Commits · main · academic / LLM / Awesome LLM Inference

This project is mirrored from https://github.com/DefTruth/Awesome-LLM-Inference.git. Pull mirroring updated just now.

Mar 30, 2025
- Update README.md · 9932c916
  DefTruth authored 1 week ago
  
  9932c916
Mar 25, 2025
- Update README.md · ff4b767e
  DefTruth authored 2 weeks ago
  
  ff4b767e
Mar 04, 2025

CacheCraft: A Relevant Work on Chunk-Aware KV Cache Reuse for RAG (#126) · 0faf3bf1

skejriwal44 authored 1 month ago

Thanks for this great list! We’d love to add CacheCraft —a chunk-aware KV reuse approach for RAG that minimizes redundant computation while preserving generation quality. Our work is concurrent to CacheBlend, with key differences in chunk-level reuse, selective recompute planning, and optimizations designed for real-world production systems. CacheCraft is accepted at SIGMOD 2025.

We’re also open-sourcing a vLLM-based extension soon. Results on real RAG traces show strong efficiency gains in production. Recent works like CacheFocus and EPIC further build on related ideas, highlighting the growing relevance of this research direction.

0faf3bf1

Mar 03, 2025
- Update DeepSeek/MLA Topics (#125) · eb7e0d01
  DefTruth authored 1 month ago
  
  eb7e0d01
Mar 02, 2025
- Add DeepSeek Open Sources modules (#124) · 99280223
  DefTruth authored 1 month ago
  
  99280223
Mar 01, 2025
- update the title of SageAttention2 and add SpargeAttn (#123) · d32b3dde
  Jintao Zhang authored 1 month ago
  
  d32b3dde
- 🔥[MHA2MLA] Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent... · cc978444
  DefTruth authored 1 month ago
  
  🔥[MHA2MLA] Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs (#122)
  cc978444
Feb 27, 2025

Add our ICLR2025 work Dynamic-LLaVA (#121) · 4cb87630

Blank-z0 authored 1 month ago

Add paper "Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification"
Dynamic-LLaVA is the first MLLM acceleration framework that simultaneously sparsifies both vision and language contexts while integrating inference efficiency optimization across different MLLM inference modes into a unified framework. In practice, Dynamic-LLaVA can achieve additional inference efficiency throughout the entire generation process, with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines.
GitHub: https://github.com/Osilly/dynamic_llava

4cb87630

Feb 24, 2025
- feat: flashMLA (#120) · fe502d8c
  Shaoyu Yang authored 1 month ago
  
  fe502d8c
Feb 19, 2025
- 🔥[DeepSeek-NSA] Native Sparse Attention: Hardware-Aligned and Natively... · 0525c4d4
  DefTruth authored 1 month ago
  
  🔥[DeepSeek-NSA] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention (#119)
  View commits for tag v2.6.13 v2.6.13
  
  0525c4d4
Feb 13, 2025
- Add Multi-head Latent Attention(MLA) topic (#118) · 1ddf093b
  DefTruth authored 1 month ago
  
  View commits for tag v2.6.12 v2.6.12
  
  1ddf093b
Jan 31, 2025
- 🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (#117) · d7914c03
  DefTruth authored 2 months ago
  
  View commits for tag v2.6.11 v2.6.11
  
  d7914c03
- 🔥🔥[DeServe] DESERVE: TOWARDS AFFORDABLE OFFLINE LLM INFERENCE VIA DECENTRALIZATION(#116) · bb1f1171
  DefTruth authored 2 months ago
  
  bb1f1171
- 🔥🔥[KVDirect] KVDirect: Distributed Disaggregated LLM Inference (#115) · a523f1df
  DefTruth authored 2 months ago
  
  a523f1df
- 🔥🔥[DistServe] DistServe: Disaggregating Prefill and Decoding for... · 301cc216
  DefTruth authored 2 months ago
  
  🔥🔥[DistServe] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (#114)
  301cc216
Jan 24, 2025
- [feat] add deepseek-r1 (#113) · b117b3c1
  Shaoyu Yang authored 2 months ago
  
  * add deepseek-r1 * fix: fix the update time
  b117b3c1
Jan 23, 2025
- Update README.md · c03a56e8
  DefTruth authored 2 months ago
  
  c03a56e8
Jan 16, 2025
- Update README.md · 1ecf8ab0
  DefTruth authored 2 months ago
  
  1ecf8ab0
- Update README.md · e49ea483
  DefTruth authored 2 months ago
  
  e49ea483
- Update README.md · c07a70dd
  DefTruth authored 2 months ago
  
  c07a70dd
- Update README.md · 295f4088
  DefTruth authored 2 months ago
  
  295f4088
- Update README.md · bd24577b
  DefTruth authored 2 months ago
  
  bd24577b
Jan 15, 2025
- add `MiniMax-01` in Trending LLM/VLM Topics and Long Context Attention (#112) · 37adfda5
  Shaoyu Yang authored 2 months ago
  
  * add minimax-01 * fix: fix typos * feat: add Lightning Attention * fix: fix some typos
  37adfda5
Jan 08, 2025
- Update README.md · 78749010
  DefTruth authored 3 months ago
  
  78749010
Jan 06, 2025

🔥

[FFPA] FFPA: Yet another Faster Flash Prefill Attention with O(1) SRAM... · b8b3a43b

DefTruth authored 3 months ago

🔥🔥[FFPA] FFPA: Yet another Faster Flash Prefill Attention with O(1) SRAM complexity for headdim > 256, ~1.5x faster than SDPA EA(@DefTruth) (#111)

🔥🔥[FFPA] FFPA: Yet another Faster Flash Prefill Attention with O(1) SRAM complexity for headdim > 256, ~1.5x faster than SDPA EA(@DefTruth)

b8b3a43b

🔥

[SP: TokenRing] TokenRing: An Efficient Parallelism Framework for... · eb6fb10d

DefTruth authored 3 months ago

🔥🔥[SP: TokenRing] TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication (#110)

eb6fb10d

Jan 03, 2025
- 🔥🔥🔥[DeepSeek-V3] DeepSeek-V3 Technical Report(@deepseek-ai) (#109) · 12416b58
  DefTruth authored 3 months ago
  
  12416b58
Dec 27, 2024
- 🔥🔥[FLASH-ATTENTION RNG] Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM · b5c98344
  DefTruth authored 3 months ago
  
  b5c98344
Dec 22, 2024
- 🔥🔥[HADACORE] HADACORE: TENSOR CORE ACCELERATED HADAMARD TRANSFORM KERNEL (#108) · 6ad7b307
  DefTruth authored 3 months ago
  
  View commits for tag v2.6.9 v2.6.9
  
  6ad7b307
- 🔥[DynamicKV] DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs (#107) · 7ab2d180
  DefTruth authored 3 months ago
  
  7ab2d180
- 🔥🔥[NITRO] NITRO: LLM INFERENCE ON INTEL® LAPTOP NPUS (#106) · 34f9ce70
  DefTruth authored 3 months ago
  
  34f9ce70
- 🔥🔥[TurboAttention] TURBOATTENTION: EFFICIENT ATTENTION APPROXIMATION FOR HIGH... · 5abc2c95
  DefTruth authored 3 months ago
  
  🔥🔥[TurboAttention] TURBOATTENTION: EFFICIENT ATTENTION APPROXIMATION FOR HIGH THROUGHPUTS LLMS (#105)
  5abc2c95
Dec 08, 2024

🔥

[BatchLLM] BatchLLM: Optimizing Large Batched LLM Inference with Global... · 32fdb843

DefTruth authored 4 months ago

🔥[BatchLLM] BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching (#104)

🔥[BatchLLM] BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

32fdb843

🔥

[ClusterKV] ClusterKV: Manipulating LLM KV Cache in Semantic Space for... · 9bb3f6a3

DefTruth authored 4 months ago

🔥[ClusterKV] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression (#103)

🔥[ClusterKV] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

9bb3f6a3

Dec 01, 2024
- 🔥[KV Cache Recomputation] Efficient LLM Inference with I/O-Aware Partial KV... · 9f548f61
  DefTruth authored 4 months ago
  
  🔥[KV Cache Recomputation] Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation (#102)
  View commits for tag v2.6.7 v2.6.7
  
  9f548f61
Nov 28, 2024
- 🔥[Star-Attention: 11x~ speedup] Star Attention: Efficient LLM Inference over Long Sequences (#101) · 7939ea2a
  DefTruth authored 4 months ago
  
  🔥[Star-Attention: 11x~ speedup] Star Attention: Efficient LLM Inference over Long Sequences
  7939ea2a
Nov 25, 2024
- 🔥[SparseInfer] SparseInfer: Training-free Prediction of Activation Sparsity... · 40292d73
  DefTruth authored 4 months ago
  
  🔥[SparseInfer] SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference (#100)
  View commits for tag v2.6.6 v2.6.6
  
  40292d73
Nov 24, 2024
- Update README.md · 01a14af8
  DefTruth authored 4 months ago
  
  01a14af8
- Update README.md · c0530a03
  DefTruth authored 4 months ago
  
  c0530a03
- Update README.md · 6d5d30da
  DefTruth authored 4 months ago
  
  6d5d30da