Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

ICLR 2026
1EPIC Lab, Shanghai Jiao Tong University
2Sichuan University   3Huazhong University of Science and Technology

*Equal contribution. Corresponding author: zhanglinfeng@sjtu.edu.cn.

💡 The first to identify heterogeneous head-wise redundancy in the KV cache of both LVLMs and LLMs.

Contributions

(1) Semantic Redundancy Analysis. We conduct in-depth analyses of KV caches in LVLMs, revealing substantial inherent semantic redundancy. Besides, we demonstrate that importance-based methods fail to preserve full KV distribution coverage, exposing fundamental limitations.

(2) Mixing Importance with Diversity. Based on our analysis, we propose MixKV, a head-wise adaptive mechanism that quantifies semantic redundancy to create principled weighting between importance and diversity scores for joint optimization of KV cache compression.

(3) Comprehensive Experimental Validation. Extensive experiments across diverse multi-modal and text benchmarks demonstrate that MixKV yields consistent performance improvements for existing importance-based compression methods while maintaining inference efficiency.

Core Findings

Finding 1

(I) Vision-Language Redundancy Differences: Visual information in LVLMs contains significantly more semantic redundancy than textual information in LLMs. Images often contain repetitive visual elements (e.g., similar textures, repeated patterns), leading to higher semantic similarity among KV pairs during vision-language processing. Figure (a) shows that Qwen2-VL exhibits much denser high-similarity regions compared to the more diverse patterns of Qwen2, while Figure (b) reveals that keys in Qwen2 peak around 0.2-0.4 average similarity whereas Qwen2-VL keys peak around 0.6-0.8, a 2-3x increase. This demonstrates that KV pairs in LVLMs exhibit substantially higher semantic redundancy than in LLMs.

Finding 2

(II) Head-wise Redundancy Differences: Within LVLMs, different attention heads focus on distinct multi-modal aspects. Some heads capture global features with lower redundancy, while others focus on local details with higher semantic similarity. Figure illustrates this phenomenon across multiple tasks: for Qwen2-VL-7B, certain heads show extremely high average similarity exceeding 0.9, while other heads maintain relatively low similarity below 0.3. This pattern is consistent across different vision-language tasks, indicating that KV pairs in LVLMs show varying degrees of semantic redundancy across attention heads in the LLM.

Heterogeneous head-wise redundancy

Heterogeneous head-wise redundancy in LLMs and LVLMs. For both pure-text and vision-language data, different heads exhibit markedly different redundancy levels, and their overall patterns are highly similar: a head that is relatively more redundant on text remains relatively more redundant on vision-language inputs. We hypothesize that this is because different heads focus on different types of information: some heads primarily attend to local patterns and therefore exhibit higher semantic redundancy, while others capture more global information and consequently show much lower redundancy.

Overview

We argue that beyond importance, preserving diverse KV pairs at per-head granularity is essential for minimizing semantic redundancy while maintaining comprehensive information coverage. To this end, we propose MixKV, which adopts a principled "mixing importance with diversity" approach. Specifically, MixKV extends existing importance-based KV compression methods by incorporating head-wise semantic diversity evaluation. By independently measuring semantic similarity within each attention head, MixKV adaptively balances importance and diversity per head to achieve fine-grained joint optimization of KV cache compression in LVLMs.

MixKV algorithm pipeline

MixKV enables KV cache compression methods (e.g. SnapKV, AdaKV, and SparseMM) to approximate the full original semantic distribution of the uncompressed KV cache more effectively.

PCA analysis

PCA analysis of semantic coverage under KV compression.

Experimental Results

Performance on multiple image understanding benchmarks.

Since SparseMM does not provide head importance scores for InternVL3-8B, we cannot reproduce their results on this model. "Full KV" means caching all KV pairs (upper bound). Results here report only budget = 128.

Methods DocVQA (%) OCRBench (%) TextVQA (%) ChartQA (%) TextCaps
LLaVA-NeXT-Mistral-7B
Full KV63.652.965.752.90.707
SnapKV55.239.061.047.50.558
+ MixKV58.144.764.347.70.659
+2.9+5.7+3.3+0.2+0.101
PyramidKV54.339.460.947.10.553
+ MixKV57.243.763.847.50.656
+2.9+4.3+2.9+0.4+0.103
AdaKV55.940.460.547.80.566
+ MixKV58.344.963.748.50.660
+2.4+4.5+3.2+0.7+0.094
SparseMM60.850.764.751.20.634
+ MixKV61.050.465.051.50.652
+0.2-0.3+0.3+0.3+0.018
InternVL3-8B
Full KV90.9684.281.186.361.042
SnapKV85.469.078.284.60.901
+ MixKV86.271.178.884.80.949
+0.8+2.1+0.6+0.2+0.048
PyramidKV82.758.475.384.00.809
+ MixKV83.560.076.684.40.850
+0.8+1.6+1.3+0.4+0.041
AdaKV86.070.278.084.40.921
+ MixKV86.771.678.785.20.955
+0.7+1.4+0.7+0.8+0.034
Qwen2-VL-7B-Instruct
Full KV93.982.182.181.51.469
SnapKV80.171.977.579.61.142
+ MixKV82.675.480.681.21.342
+2.5+3.5+3.1+1.6+0.200
PyramidKV74.067.974.679.20.951
+ MixKV76.372.677.180.71.119
+2.3+4.7+2.5+1.5+0.168
AdaKV81.271.077.079.61.146
+ MixKV82.174.779.680.91.275
+0.9+3.7+2.6+1.3+0.129
SparseMM91.579.081.681.51.430
+ MixKV92.781.082.081.81.459
+1.2+2.0+0.4+0.3+0.029
Performance on ScreenSpot-v2 GUI grounding benchmark with Qwen2.5-VL-7B-Instruct.

"Full KV" refers to caching all KV pairs of the LLM (upper bound). Results here report only budget = 128.

Methods Mobile Text Mobile Icon/Widget Desktop Text Desktop Icon/Widget Web Text Web Icon/Widget Average
Qwen2.5-VL-7B-Instruct
Full KV97.287.791.277.188.582.388.5
SnapKV65.578.786.174.376.974.475.3
+ MixKV86.685.387.175.085.076.483.3
+21.1+6.6+1.0+0.7+8.1+2.0+7.9
PyramidKV45.562.182.075.069.271.465.6
+ MixKV64.174.487.174.376.971.974.1
+18.6+12.3+5.1-0.7+7.7+0.5+8.5
AdaKV80.784.890.274.382.175.981.6
+ MixKV94.188.689.775.085.076.986.0
+13.4+3.8-0.5+0.7+2.9+1.0+4.4
Performance on LongBench with Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct.

"Full KV" refers to caching all KV pairs of the LLM (upper bound). Results here report only budget = 512.

Methods NrtvQAQasperMF-enHotpotQA2WikiMQAMusique GovReportQMSumMultiNewsTRECTriviaQASAMSum PCountPReLccRB-PAvg
Mistral-7B-Instruct-v0.2
Full KV26.8133.1949.2643.0227.1218.7832.8024.1627.0271.0086.2342.642.7586.9855.0953.0142.49
KV Cache Budget = 512
SnapKV23.6927.7149.1639.7025.4417.3823.3123.2824.2066.0086.1741.543.2486.2953.7151.1940.13
+ MixKV23.5628.1948.9640.3625.8617.3424.6323.3625.3266.0086.2342.253.0287.6653.8751.4040.50
-0.13+0.48-0.20+0.66+0.42-0.04+1.32+0.08+1.120.00+0.06+0.71-0.22+1.37+0.16+0.21+0.37
AdaKV24.3527.3348.7640.0726.3817.9723.7323.5124.3167.5086.3842.533.0686.6553.9051.5740.50
+ MixKV24.2628.3948.9040.8626.3317.0724.6323.3225.4169.0086.5142.673.0786.4454.4651.6940.81
-0.09+1.06+0.14+0.79-0.05-0.90+0.90-0.19+1.10+1.50+0.13+0.14+0.01-0.21+0.56+0.12+0.31
Llama-3.1-8B-Instruct
Full KV30.2245.3755.8055.9745.0031.2635.1225.3827.2072.5091.6443.579.4199.5062.8856.4349.20
KV Cache Budget = 512
SnapKV27.4238.9553.5755.2044.6829.7525.5524.2124.2864.5092.3541.049.9899.5062.5054.9346.53
+ MixKV26.7641.7753.7755.1944.7230.0226.0324.2825.2769.0091.4442.249.9899.5061.8455.1747.37
-0.66+2.82+0.20-0.01+0.04+0.27+0.48+0.07+0.99+4.50-0.91+1.20+0.00+0.00-0.66+0.24+0.84
AdaKV25.9640.2652.8254.5543.8330.4325.7624.0624.6969.0092.0542.109.4599.5062.5855.5946.42
+ MixKV26.1342.0853.1855.4743.8828.8026.6824.0325.3570.0091.0142.799.4199.5062.9255.8246.75
+0.17+1.82+0.36+0.92+0.05-1.63+0.92-0.03+0.66+1.00-1.04+0.69-0.04+0.00+0.34+0.23+0.33
Performance of applying MixKV to InternVL3-38B.

Results here report only budget = 128.

Methods DocVQA (%) OCRBench (%) TextVQA (%) ChartQA (%) TextCaps
InternVL3-38B
Full KV93.585.983.888.60.953
SnapKV87.577.882.087.50.932
+ MixKV92.179.382.888.20.959
+4.6+1.5+0.8+0.7+0.027
AdaKV92.079.682.087.40.940
+ MixKV92.381.182.988.20.961
+0.3+1.5+0.9+0.8+0.021
Performance of applying MixKV to Qwen3-VL-30B-A3B-Instruct.

Results here report only budget = 128.

Methods DocVQA (%) OCRBench (%) TextVQA (%) ChartQA (%) TextCaps
Qwen3-VL-30B-A3B-Instruct
Full KV94.584.083.585.10.287
SnapKV91.971.075.383.80.314
+ MixKV93.280.780.884.50.411
+1.3+9.7+5.5+0.7+0.097
Efficiency comparisons of total latency and peak memory.
Efficiency comparison

For a context length of 32,000, “Full KV” refers to caching the entire sequence, whereas KV compression strategies employ a budget of 64. The upper part is total time, while the lower part is peak memory.

Citation

If you find this project helpful, please consider citing our paper with:

@article{liu2025mixkv,
  title={Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models},
  author={Liu, Xuyang and Gui, Xiyan and Zhang, Yuchao and Zhang, Linfeng},
  journal={arXiv preprint arXiv:2510.20707},
  year={2025}
}