Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

ICLR 2026
1EPIC Lab, Shanghai Jiao Tong University
2Sichuan University   3Huazhong University of Science and Technology

*Equal contribution. Corresponding author: zhanglinfeng@sjtu.edu.cn.

💡 The first to identify heterogeneous head-wise redundancy in the KV cache of both LVLMs and LLMs.

Contributions

(1) Semantic Redundancy Analysis. We conduct in-depth analyses of KV caches in LVLMs, revealing substantial inherent semantic redundancy. Besides, we demonstrate that importance-based methods fail to preserve full KV distribution coverage, exposing fundamental limitations.

(2) Mixing Importance with Diversity. Based on our analysis, we propose MixKV, a head-wise adaptive mechanism that quantifies semantic redundancy to create principled weighting between importance and diversity scores for joint optimization of KV cache compression.

(3) Comprehensive Experimental Validation. Extensive experiments across diverse multi-modal and text benchmarks demonstrate that MixKV yields consistent performance improvements for existing importance-based compression methods while maintaining inference efficiency.

Core Findings

Finding 1

(I) Vision-Language Redundancy Differences: Visual information in LVLMs contains significantly more semantic redundancy than textual information in LLMs. Images often contain repetitive visual elements (e.g., similar textures, repeated patterns), leading to higher semantic similarity among KV pairs during vision-language processing. Figure (a) shows that Qwen2-VL exhibits much denser high-similarity regions compared to the more diverse patterns of Qwen2, while Figure (b) reveals that keys in Qwen2 peak around 0.2-0.4 average similarity whereas Qwen2-VL keys peak around 0.6-0.8, a 2-3x increase. This demonstrates that KV pairs in LVLMs exhibit substantially higher semantic redundancy than in LLMs.

Finding 2

(II) Head-wise Redundancy Differences: Within LVLMs, different attention heads focus on distinct multi-modal aspects. Some heads capture global features with lower redundancy, while others focus on local details with higher semantic similarity. Figure illustrates this phenomenon across multiple tasks: for Qwen2-VL-7B, certain heads show extremely high average similarity exceeding 0.9, while other heads maintain relatively low similarity below 0.3. This pattern is consistent across different vision-language tasks, indicating that KV pairs in LVLMs show varying degrees of semantic redundancy across attention heads in the LLM.

Heterogeneous head-wise redundancy

Heterogeneous head-wise redundancy in LLMs and LVLMs. For both pure-text and vision-language data, different heads exhibit markedly different redundancy levels, and their overall patterns are highly similar: a head that is relatively more redundant on text remains relatively more redundant on vision-language inputs. We hypothesize that this is because different heads focus on different types of information: some heads primarily attend to local patterns and therefore exhibit higher semantic redundancy, while others capture more global information and consequently show much lower redundancy.

Overview

We argue that beyond importance, preserving diverse KV pairs at per-head granularity is essential for minimizing semantic redundancy while maintaining comprehensive information coverage. To this end, we propose MixKV, which adopts a principled "mixing importance with diversity" approach. Specifically, MixKV extends existing importance-based KV compression methods by incorporating head-wise semantic diversity evaluation. By independently measuring semantic similarity within each attention head, MixKV adaptively balances importance and diversity per head to achieve fine-grained joint optimization of KV cache compression in LVLMs.

MixKV algorithm pipeline

MixKV enables KV cache compression methods (e.g. SnapKV, AdaKV, and SparseMM) to approximate the full original semantic distribution of the uncompressed KV cache more effectively.

PCA analysis

PCA analysis of semantic coverage under KV compression.

Experimental Results

Performance on multiple image understanding benchmarks.

Since SparseMM does not provide head importance scores for InternVL3-8B, we cannot reproduce their results on this model. “Full KV” means caching all KV pairs (upper bound).

Methods DocVQA (%) OCRBench (%) TextVQA (%) ChartQA (%) TextCaps
25612864 25612864 25612864 25612864 25612864
LLaVA-NeXT-Mistral-7B
Full KV 63.6 52.9 65.7 52.9 0.707
SnapKV 59.755.247.3 45.039.031.9 63.561.057.1 50.247.542.7 0.6500.5580.444
+ MixKV 61.758.148.8 49.944.736.1 65.264.360.1 50.847.743.6 0.7080.6590.514
+2.0+2.9+1.5 +4.9+5.7+4.2 +1.7+3.3+3.0 +0.6+0.2+0.9 +0.058+0.101+0.070
PyramidKV 58.254.343.4 44.139.429.1 62.960.954.8 49.147.140.8 0.6210.5530.407
+ MixKV 60.857.245.1 49.743.732.0 64.963.857.8 50.847.541.3 0.6870.6560.466
+2.6+2.9+1.7 +5.6+4.3+2.9 +2.0+2.9+3.0 +1.7+0.4+0.5 +0.066+0.103+0.059
AdaKV 59.655.948.7 45.140.432.8 62.960.556.9 50.447.844.6 0.6460.5660.440
+ MixKV 61.358.350.8 49.844.936.6 65.363.759.6 50.948.545.2 0.7040.6600.509
+1.7+2.4+2.1 +4.7+4.5+3.8 +2.4+3.2+2.7 +0.5+0.7+0.6 +0.058+0.094+0.069
SparseMM 61.660.857.6 51.950.746.2 65.164.762.8 51.951.248.9 0.6800.6340.524
+ MixKV 61.961.059.2 50.850.449.5 65.265.064.4 51.851.550.6 0.6820.6520.575
+0.3+0.2+1.6 -1.1-0.3+3.3 +0.1+0.3+1.6 -0.1+0.3+1.7 +0.002+0.018+0.051
InternVL3-8B
Full KV 91.0 84.2 81.1 86.4 1.042
SnapKV 89.285.475.7 80.669.053.1 80.478.271.9 86.284.679.8 1.0090.9010.734
+ MixKV 89.486.276.3 81.971.152.3 80.978.872.9 86.384.880.7 1.0290.9490.753
+0.2+0.8+0.6 +1.3+2.1-0.8 +0.5+0.6+1.0 +0.1+0.2+0.9 +0.020+0.048+0.019
PyramidKV 87.282.769.7 70.958.441.8 78.375.367.2 85.784.078.0 0.8960.8090.632
+ MixKV 87.583.570.4 72.360.041.2 79.076.668.2 85.884.478.6 0.9410.8500.646
+0.3+0.8+0.7 +1.4+1.6-0.6 +0.7+1.3+1.0 +0.1+0.4+0.6 +0.045+0.041+0.014
AdaKV 89.286.077.2 80.870.253.1 80.478.071.8 86.284.480.4 1.0130.9210.759
+ MixKV 89.586.778.1 82.471.652.3 80.878.772.9 86.285.280.9 1.0340.9550.782
+0.3+0.7+0.9 +1.6+1.4-0.8 +0.4+0.7+1.1 +0.0+0.8+0.5 +0.021+0.034+0.023
Qwen2-VL-7B-Instruct
Full KV 93.9 82.1 82.1 81.5 1.469
SnapKV 88.080.166.5 77.371.962.4 80.377.569.9 81.379.675.5 1.3601.1420.794
+ MixKV 90.582.667.9 79.375.466.0 81.980.672.5 81.681.277.6 1.4701.3420.878
+2.5+2.5+1.4 +2.0+3.5+3.6 +1.6+3.1+2.6 +0.3+1.6+2.1 +0.110+0.200+0.084
PyramidKV 81.774.059.9 74.567.956.8 78.374.665.3 81.179.273.5 1.1150.9510.569
+ MixKV 84.076.360.8 76.672.658.4 80.477.167.0 81.380.775.5 1.3481.1190.633
+2.3+2.3+0.9 +2.1+4.7+1.6 +2.1+2.5+1.7 +0.2+1.5+2.0 +0.233+0.168+0.064
AdaKV 87.481.267.1 77.871.062.1 79.977.070.3 80.879.675.9 1.3451.1460.775
+ MixKV 90.382.167.8 79.374.765.5 81.879.671.2 81.580.977.4 1.4481.2750.878
+2.9+0.9+0.7 +1.5+3.7+3.4 +1.9+2.6+0.9 +0.7+1.3+1.5 +0.103+0.129+0.103
SparseMM 93.591.584.9 81.279.074.3 82.081.677.3 82.081.580.1 1.4821.4301.038
+ MixKV 93.892.786.4 82.081.077.1 82.082.080.9 81.681.881.4 1.4801.4591.303
+0.3+1.2+1.5 +0.8+2.0+2.8 +0.0+0.4+3.6 -0.4+0.3+1.3 -0.002+0.029+0.265
Performance on ScreenSpot-v2 GUI grounding benchmark with Qwen2.5-VL-7B-Instruct.

“Full KV” refers to caching all KV pairs of the LLM (upper bound).

Methods Mobile Text Mobile Icon/Widget Desktop Text Desktop Icon/Widget Web Text Web Icon/Widget Average
12864 12864 12864 12864 12864 12864 12864
Qwen2.5-VL-7B-Instruct
Full KV 97.2 87.7 91.2 77.1 88.5 82.3 88.5
SnapKV 65.528.6 78.753.1 86.157.2 74.357.1 76.946.6 74.449.8 75.346.9
+ MixKV 86.635.5 85.360.2 87.171.1 75.065.7 85.053.0 76.456.2 83.354.9
+21.1+6.9 +6.6+7.1 +1.0+13.9 +0.7+8.6 +8.1+6.4 +2.0+6.4 +7.9+8.0
PyramidKV 45.511.0 62.134.1 82.033.0 75.047.9 69.220.5 71.424.1 65.626.1
+ MixKV 64.115.9 74.442.2 87.141.8 74.347.1 76.924.8 71.927.1 74.131.1
+18.6+4.9 +12.3+8.1 +5.1+8.8 -0.7-0.8 +7.7+4.3 +0.5+3.0 +8.5+5.0
AdaKV 80.735.2 84.859.2 90.270.6 74.363.6 82.149.6 75.956.2 81.653.7
+ MixKV 94.149.0 88.666.8 89.775.3 75.068.6 85.061.5 76.963.1 86.062.7
+13.4+13.8 +3.8+7.6 -0.5+4.7 +0.7+5.0 +2.9+12.0 +1.0+6.9 +4.4+9.0
Performance on LongBench with Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct.

“Full KV” refers to caching all KV pairs of the LLM (upper bound).

Methods NrtvQAQasperMF-enHotpotQA2WikiMQAMusique GovReportQMSumMultiNewsTRECTriviaQASAMSum PCountPReLccRB-PAvg
Mistral-7B-Instruct-v0.2
Full KV 26.8133.1949.2643.0227.1218.78 32.8024.1627.0271.0086.2342.64 2.7586.9855.0953.0142.49
KV Cache Budget = 1024
SnapKV 24.9830.2449.0341.4527.1118.26 25.6923.8725.9768.0086.2542.30 2.8287.9354.9552.0041.30
+ MixKV 25.5531.0448.1941.3127.1819.24 26.9823.8826.7470.0086.4643.77 2.9085.9955.0251.2841.60
+0.57+0.80-0.84-0.14+0.07+0.98 +1.29+0.01+0.77+2.00+0.21+1.47 +0.08-1.94+0.07-0.72+0.30
AdaKV 25.1530.6049.0640.9326.9218.81 25.8823.9625.8469.0086.2443.01 2.8588.6855.1952.4641.54
+ MixKV 25.3130.5648.8341.9626.9518.27 26.7723.8526.3770.5086.6343.44 2.6286.5255.6551.8741.63
+0.16-0.04-0.23+1.03+0.03-0.54 +0.89-0.11+0.53+1.50+0.39+0.43 -0.23-2.16+0.46-0.59+0.09
KV Cache Budget = 512
SnapKV 23.6927.7149.1639.7025.4417.38 23.3123.2824.2066.0086.1741.54 3.2486.2953.7151.1940.13
+ MixKV 23.5628.1948.9640.3625.8617.34 24.6323.3625.3266.0086.2342.25 3.0287.6653.8751.4040.50
-0.13+0.48-0.20+0.66+0.42-0.04 +1.32+0.08+1.120.00+0.06+0.71 -0.22+1.37+0.16+0.21+0.37
AdaKV 24.3527.3348.7640.0726.3817.97 23.7323.5124.3167.5086.3842.53 3.0686.6553.9051.5740.50
+ MixKV 24.2628.3948.9040.8626.3317.07 24.6323.3225.4169.0086.5142.67 3.0786.4454.4651.6940.81
-0.09+1.06+0.14+0.79-0.05-0.90 +0.90-0.19+1.10+1.50+0.13+0.14 +0.01-0.21+0.56+0.12+0.31
Llama-3.1-8B-Instruct
Full KV 30.2245.3755.8055.9745.0031.26 35.1225.3827.2072.5091.6443.57 9.4199.5062.8856.4349.20
KV Cache Budget = 1024
SnapKV 27.1043.9155.0755.6045.1730.47 27.8424.4425.7569.0091.8942.69 9.4499.5062.4956.3048.86
+ MixKV 27.5044.1955.4255.8245.4030.65 28.8324.7526.2670.0091.6242.88 8.9699.5062.6956.4149.30
+0.40+0.28+0.35+0.22+0.23+0.18 +0.99+0.31+0.51+1.00-0.27+0.19 -0.48+0.00+0.20+0.11+0.44
AdaKV 28.1643.9854.6856.1445.1930.30 28.3524.8026.1172.5091.7242.48 8.7499.5062.9456.5149.27
+ MixKV 27.9844.2855.0356.0345.5830.55 29.0624.5826.7072.5091.4243.37 9.4699.5062.6556.9749.37
-0.18+0.30+0.35-0.11+0.39+0.25 +0.71-0.22+0.59+0.00-0.30+0.89 +0.72+0.00-0.29+0.46+0.10
KV Cache Budget = 512
SnapKV 27.4238.9553.5755.2044.6829.75 25.5524.2124.2864.5092.3541.04 9.9899.5062.5054.9346.53
+ MixKV 26.7641.7753.7755.1944.7230.02 26.0324.2825.2769.0091.4442.24 9.9899.5061.8455.1747.37
-0.66+2.82+0.20-0.01+0.04+0.27 +0.48+0.07+0.99+4.50-0.91+1.20 +0.00+0.00-0.66+0.24+0.84
AdaKV 25.9640.2652.8254.5543.8330.43 25.7624.0624.6969.0092.0542.10 9.4599.5062.5855.5946.42
+ MixKV 26.1342.0853.1855.4743.8828.80 26.6824.0325.3570.0091.0142.79 9.4199.5062.9255.8246.75
+0.17+1.82+0.36+0.92+0.05-1.63 +0.92-0.03+0.66+1.00-1.04+0.69 -0.04+0.00+0.34+0.23+0.33
Performance of applying MixKV to InternVL3-38B.

Results report budgets = 128 / 64.

Methods DocVQA (%) OCRBench (%) TextVQA (%) ChartQA (%) TextCaps
12864 12864 12864 12864 12864
InternVL3-38B
Full KV 93.5 85.9 83.8 88.6 0.953
SnapKV 87.585.2 77.864.3 82.078.5 87.585.2 0.9320.822
+ MixKV 92.186.9 79.365.8 82.879.4 88.285.8 0.9590.859
+4.6+1.7 +1.5+1.5 +0.8+0.9 +0.7+0.6 +0.027+0.037
AdaKV 92.087.6 79.667.8 82.079.3 87.485.3 0.9400.841
+ MixKV 92.388.5 81.169.2 82.980.2 88.286.0 0.9610.859
+0.3+0.9 +1.5+1.4 +0.9+0.9 +0.8+0.7 +0.021+0.018
Performance of applying MixKV to Qwen3-VL-30B-A3B-Instruct.

Results report budgets = 128 / 64.

Methods DocVQA (%) OCRBench (%) TextVQA (%) ChartQA (%) TextCaps
12864 12864 12864 12864 12864
Qwen3-VL-30B-A3B-Instruct
Full KV 94.5 84.0 83.5 85.1 0.287
SnapKV 91.983.8 71.055.2 75.375.3 83.879.8 0.3140.272
+ MixKV 93.286.2 80.768.8 80.879.7 84.580.8 0.4110.349
+1.3+2.4 +9.7+13.6 +5.5+4.4 +0.7+1.0 +0.097+0.077
Efficiency comparisons of total latency and peak memory.
Efficiency comparison

For a context length of 32,000, “Full KV” refers to caching the entire sequence, whereas KV compression strategies employ a budget of 64. The upper part is total time, while the lower part is peak memory.

Citation

If you find this project helpful, please consider citing our paper with:

@article{liu2025mixkv,
  title={Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models},
  author={Liu, Xuyang and Gui, Xiyan and Zhang, Yuchao and Zhang, Linfeng},
  journal={arXiv preprint arXiv:2510.20707},
  year={2025}
}