This project is mirrored from https://github.com/DefTruth/Awesome-LLM-Inference.git.
Pull mirroring updated .
- Mar 30, 2025
-
-
DefTruth authored
-
- Mar 25, 2025
-
-
DefTruth authored
-
- Mar 04, 2025
-
-
skejriwal44 authored
Thanks for this great list! We’d love to add CacheCraft —a chunk-aware KV reuse approach for RAG that minimizes redundant computation while preserving generation quality. Our work is concurrent to CacheBlend, with key differences in chunk-level reuse, selective recompute planning, and optimizations designed for real-world production systems. CacheCraft is accepted at SIGMOD 2025. We’re also open-sourcing a vLLM-based extension soon. Results on real RAG traces show strong efficiency gains in production. Recent works like CacheFocus and EPIC further build on related ideas, highlighting the growing relevance of this research direction.
-
- Mar 03, 2025
-
-
DefTruth authored
-
- Mar 02, 2025
-
-
DefTruth authored
-
- Mar 01, 2025
-
-
Jintao Zhang authored
-
DefTruth authored
🔥 [MHA2MLA] Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs (#122)
-
- Feb 27, 2025
-
-
Blank-z0 authored
Add paper "Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification" Dynamic-LLaVA is the first MLLM acceleration framework that simultaneously sparsifies both vision and language contexts while integrating inference efficiency optimization across different MLLM inference modes into a unified framework. In practice, Dynamic-LLaVA can achieve additional inference efficiency throughout the entire generation process, with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. GitHub: https://github.com/Osilly/dynamic_llava
-
- Feb 24, 2025
-
-
Shaoyu Yang authored
-
- Feb 19, 2025
-
-
DefTruth authored
🔥 [DeepSeek-NSA] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention (#119)
-
- Feb 13, 2025
-
- Jan 31, 2025
- Jan 24, 2025
-
-
Shaoyu Yang authored
* add deepseek-r1 * fix: fix the update time
-
- Jan 23, 2025
-
-
DefTruth authored
-
- Jan 16, 2025
- Jan 15, 2025
-
-
Shaoyu Yang authored
* add minimax-01 * fix: fix typos * feat: add Lightning Attention * fix: fix some typos
-
- Jan 08, 2025
-
-
DefTruth authored
-
- Jan 06, 2025
-
-
DefTruth authored
🔥 🔥 [FFPA] FFPA: Yet another Faster Flash Prefill Attention with O(1) SRAM complexity for headdim > 256, ~1.5x faster than SDPA EA(@DefTruth) (#111)🔥 🔥 [FFPA] FFPA: Yet another Faster Flash Prefill Attention with O(1) SRAM complexity for headdim > 256, ~1.5x faster than SDPA EA(@DefTruth) -
DefTruth authored
🔥 🔥 [SP: TokenRing] TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication (#110)
-
- Jan 03, 2025
-
-
DefTruth authored
-
- Dec 27, 2024
-
-
DefTruth authored
-
- Dec 22, 2024
- Dec 08, 2024
-
-
DefTruth authored
🔥 [BatchLLM] BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching (#104)🔥 [BatchLLM] BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching -
DefTruth authored
🔥 [ClusterKV] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression (#103)🔥 [ClusterKV] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
-
- Dec 01, 2024
-
-
DefTruth authored
🔥 [KV Cache Recomputation] Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation (#102)
-
- Nov 28, 2024
-
-
DefTruth authored
🔥 [Star-Attention: 11x~ speedup] Star Attention: Efficient LLM Inference over Long Sequences
-
- Nov 25, 2024
-
-
DefTruth authored
🔥 [SparseInfer] SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference (#100)
-
- Nov 24, 2024