🚀 KVSplit Differentiated KV Cache Quantization for Apple Silicon 📌 Overview Run larger context windows and heavier LLMs on your Mac by applying different quantization precision to keys vs values in the attention mechanism's KV cache. KVSplit enables you to: Reduce memory usage by up to 72% with minimal quality loss with minimal quality loss Run 2-3x longer contexts in the same memory budget in the same memory budget Maintain or improve inference speed compared to FP16 compared to FP16 Optimize for Apple Silicon with full Metal support Key Findings Configuration VRAM @ 8K tokens Tokens/sec Perplexity Change FP16 (base) 176.00 MB (100%) 54,360 -- K8V8 (8-bit) 93.50 MB (47%) 51,503 +0.03% K8V4 71.50 MB (41%) 57,438 +0.86% K4V8 71.50 MB (41%) 58,690 +6.06% K4V4 (4-bit) 49.50 MB (28%) 55,193 +6.15% Memory Savings by Sequence Length Configuration 128 tokens 2048 tokens 4096 tokens 8192 tokens FP16 (baseline) 5.50 MB 44.00 MB 88.00 MB 176.00 MB K8V8 (8-bit) 2.92 MB 23.38 MB 46.75 MB 93.50 MB K8V4 (mixed) 2.23 MB 17.88 MB 35.75 MB 71.50 MB K4V8 (mixed) 2.23 MB 17.88 MB 35.75 MB 71.50 MB K4V4 (4-bit) 1.55 MB 12.38 MB 24.75 MB 49.50 MB Features Independent quantization of keys and values in the KV cache Optimized for Apple Silicon with Metal support Comprehensive benchmarking suite with perplexity measurement Memory usage and performance analysis tools Publication-quality visualization tools Easy setup and usage Prerequisites macOS (tested on Apple Silicon) Homebrew package manager Xcode Command Line Tools ⚡ One-Command Installation # Clone the repository git clone https://github.com/dipampaul17/KVSplit.git cd kvsplit # Run the installer script chmod +x scripts/install_kvsplit.sh ./scripts/install_kvsplit.sh The installer will: Set up the project structure Clone and build llama.cpp with Metal support Configure for differentiated KV cache quantization Download a small test model (optional) Set up Python environment for visualization 🏎️ Quick Comparison Want to see the benefits...
First seen: 2025-05-16 20:44
Last seen: 2025-05-17 04:46