Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

ICLR 2026

Xuyang Liu^1,2*, Xiyan Gui^1,3*, Yuchao Zhang¹, Linfeng Zhang^1✉

¹EPIC Lab, Shanghai Jiao Tong University
²Sichuan University ³Huazhong University of Science and Technology

^*Equal contribution. ^✉Corresponding author: zhanglinfeng@sjtu.edu.cn.

💡 The first to identify heterogeneous head-wise redundancy in the KV cache of both LVLMs and LLMs.

Contributions

(1) Semantic Redundancy Analysis. We conduct in-depth analyses of KV caches in LVLMs, revealing substantial inherent semantic redundancy. Besides, we demonstrate that importance-based methods fail to preserve full KV distribution coverage, exposing fundamental limitations.

(2) Mixing Importance with Diversity. Based on our analysis, we propose MixKV, a head-wise adaptive mechanism that quantifies semantic redundancy to create principled weighting between importance and diversity scores for joint optimization of KV cache compression.

(3) Comprehensive Experimental Validation. Extensive experiments across diverse multi-modal and text benchmarks demonstrate that MixKV yields consistent performance improvements for existing importance-based compression methods while maintaining inference efficiency.

Core Findings

(I) Vision-Language Redundancy Differences: Visual information in LVLMs contains significantly more semantic redundancy than textual information in LLMs. Images often contain repetitive visual elements (e.g., similar textures, repeated patterns), leading to higher semantic similarity among KV pairs during vision-language processing. Figure (a) shows that Qwen2-VL exhibits much denser high-similarity regions compared to the more diverse patterns of Qwen2, while Figure (b) reveals that keys in Qwen2 peak around 0.2-0.4 average similarity whereas Qwen2-VL keys peak around 0.6-0.8, a 2-3x increase. This demonstrates that KV pairs in LVLMs exhibit substantially higher semantic redundancy than in LLMs.

(II) Head-wise Redundancy Differences: Within LVLMs, different attention heads focus on distinct multi-modal aspects. Some heads capture global features with lower redundancy, while others focus on local details with higher semantic similarity. Figure illustrates this phenomenon across multiple tasks: for Qwen2-VL-7B, certain heads show extremely high average similarity exceeding 0.9, while other heads maintain relatively low similarity below 0.3. This pattern is consistent across different vision-language tasks, indicating that KV pairs in LVLMs show varying degrees of semantic redundancy across attention heads in the LLM.

Heterogeneous head-wise redundancy in LLMs and LVLMs. For both pure-text and vision-language data, different heads exhibit markedly different redundancy levels, and their overall patterns are highly similar: a head that is relatively more redundant on text remains relatively more redundant on vision-language inputs. We hypothesize that this is because different heads focus on different types of information: some heads primarily attend to local patterns and therefore exhibit higher semantic redundancy, while others capture more global information and consequently show much lower redundancy.

Overview

We argue that beyond importance, preserving diverse KV pairs at per-head granularity is essential for minimizing semantic redundancy while maintaining comprehensive information coverage. To this end, we propose MixKV, which adopts a principled "mixing importance with diversity" approach. Specifically, MixKV extends existing importance-based KV compression methods by incorporating head-wise semantic diversity evaluation. By independently measuring semantic similarity within each attention head, MixKV adaptively balances importance and diversity per head to achieve fine-grained joint optimization of KV cache compression in LVLMs.

MixKV enables KV cache compression methods (e.g. SnapKV, AdaKV, and SparseMM) to approximate the full original semantic distribution of the uncompressed KV cache more effectively.

PCA analysis of semantic coverage under KV compression.

Experimental Results

Performance on multiple image understanding benchmarks.

Since SparseMM does not provide head importance scores for InternVL3-8B, we cannot reproduce their results on this model. "Full KV" means caching all KV pairs (upper bound). Results here report only budget = 128.

Methods	DocVQA (%)	OCRBench (%)	TextVQA (%)	ChartQA (%)	TextCaps
LLaVA-NeXT-Mistral-7B
Full KV	63.6	52.9	65.7	52.9	0.707
SnapKV	55.2	39.0	61.0	47.5	0.558
+ MixKV	58.1	44.7	64.3	47.7	0.659
△	+2.9	+5.7	+3.3	+0.2	+0.101
PyramidKV	54.3	39.4	60.9	47.1	0.553
+ MixKV	57.2	43.7	63.8	47.5	0.656
△	+2.9	+4.3	+2.9	+0.4	+0.103
AdaKV	55.9	40.4	60.5	47.8	0.566
+ MixKV	58.3	44.9	63.7	48.5	0.660
△	+2.4	+4.5	+3.2	+0.7	+0.094
SparseMM	60.8	50.7	64.7	51.2	0.634
+ MixKV	61.0	50.4	65.0	51.5	0.652
△	+0.2	-0.3	+0.3	+0.3	+0.018
InternVL3-8B
Full KV	90.96	84.2	81.1	86.36	1.042
SnapKV	85.4	69.0	78.2	84.6	0.901
+ MixKV	86.2	71.1	78.8	84.8	0.949
△	+0.8	+2.1	+0.6	+0.2	+0.048
PyramidKV	82.7	58.4	75.3	84.0	0.809
+ MixKV	83.5	60.0	76.6	84.4	0.850
△	+0.8	+1.6	+1.3	+0.4	+0.041
AdaKV	86.0	70.2	78.0	84.4	0.921
+ MixKV	86.7	71.6	78.7	85.2	0.955
△	+0.7	+1.4	+0.7	+0.8	+0.034
Qwen2-VL-7B-Instruct
Full KV	93.9	82.1	82.1	81.5	1.469
SnapKV	80.1	71.9	77.5	79.6	1.142
+ MixKV	82.6	75.4	80.6	81.2	1.342
△	+2.5	+3.5	+3.1	+1.6	+0.200
PyramidKV	74.0	67.9	74.6	79.2	0.951
+ MixKV	76.3	72.6	77.1	80.7	1.119
△	+2.3	+4.7	+2.5	+1.5	+0.168
AdaKV	81.2	71.0	77.0	79.6	1.146
+ MixKV	82.1	74.7	79.6	80.9	1.275
△	+0.9	+3.7	+2.6	+1.3	+0.129
SparseMM	91.5	79.0	81.6	81.5	1.430
+ MixKV	92.7	81.0	82.0	81.8	1.459
△	+1.2	+2.0	+0.4	+0.3	+0.029

Performance on ScreenSpot-v2 GUI grounding benchmark with Qwen2.5-VL-7B-Instruct.

"Full KV" refers to caching all KV pairs of the LLM (upper bound). Results here report only budget = 128.

Methods	Mobile Text	Mobile Icon/Widget	Desktop Text	Desktop Icon/Widget	Web Text	Web Icon/Widget	Average
Qwen2.5-VL-7B-Instruct
Full KV	97.2	87.7	91.2	77.1	88.5	82.3	88.5
SnapKV	65.5	78.7	86.1	74.3	76.9	74.4	75.3
+ MixKV	86.6	85.3	87.1	75.0	85.0	76.4	83.3
△	+21.1	+6.6	+1.0	+0.7	+8.1	+2.0	+7.9
PyramidKV	45.5	62.1	82.0	75.0	69.2	71.4	65.6
+ MixKV	64.1	74.4	87.1	74.3	76.9	71.9	74.1
△	+18.6	+12.3	+5.1	-0.7	+7.7	+0.5	+8.5
AdaKV	80.7	84.8	90.2	74.3	82.1	75.9	81.6
+ MixKV	94.1	88.6	89.7	75.0	85.0	76.9	86.0
△	+13.4	+3.8	-0.5	+0.7	+2.9	+1.0	+4.4

Performance on LongBench with Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct.

"Full KV" refers to caching all KV pairs of the LLM (upper bound). Results here report only budget = 512.

Methods	NrtvQA	Qasper	MF-en	HotpotQA	2WikiMQA	Musique	GovReport	QMSum	MultiNews	TREC	TriviaQA	SAMSum	PCount	PRe	Lcc	RB-P	Avg
Mistral-7B-Instruct-v0.2
Full KV	26.81	33.19	49.26	43.02	27.12	18.78	32.80	24.16	27.02	71.00	86.23	42.64	2.75	86.98	55.09	53.01	42.49
KV Cache Budget = 512
SnapKV	23.69	27.71	49.16	39.70	25.44	17.38	23.31	23.28	24.20	66.00	86.17	41.54	3.24	86.29	53.71	51.19	40.13
+ MixKV	23.56	28.19	48.96	40.36	25.86	17.34	24.63	23.36	25.32	66.00	86.23	42.25	3.02	87.66	53.87	51.40	40.50
△	-0.13	+0.48	-0.20	+0.66	+0.42	-0.04	+1.32	+0.08	+1.12	0.00	+0.06	+0.71	-0.22	+1.37	+0.16	+0.21	+0.37
AdaKV	24.35	27.33	48.76	40.07	26.38	17.97	23.73	23.51	24.31	67.50	86.38	42.53	3.06	86.65	53.90	51.57	40.50
+ MixKV	24.26	28.39	48.90	40.86	26.33	17.07	24.63	23.32	25.41	69.00	86.51	42.67	3.07	86.44	54.46	51.69	40.81
△	-0.09	+1.06	+0.14	+0.79	-0.05	-0.90	+0.90	-0.19	+1.10	+1.50	+0.13	+0.14	+0.01	-0.21	+0.56	+0.12	+0.31
Llama-3.1-8B-Instruct
Full KV	30.22	45.37	55.80	55.97	45.00	31.26	35.12	25.38	27.20	72.50	91.64	43.57	9.41	99.50	62.88	56.43	49.20
KV Cache Budget = 512
SnapKV	27.42	38.95	53.57	55.20	44.68	29.75	25.55	24.21	24.28	64.50	92.35	41.04	9.98	99.50	62.50	54.93	46.53
+ MixKV	26.76	41.77	53.77	55.19	44.72	30.02	26.03	24.28	25.27	69.00	91.44	42.24	9.98	99.50	61.84	55.17	47.37
△	-0.66	+2.82	+0.20	-0.01	+0.04	+0.27	+0.48	+0.07	+0.99	+4.50	-0.91	+1.20	+0.00	+0.00	-0.66	+0.24	+0.84
AdaKV	25.96	40.26	52.82	54.55	43.83	30.43	25.76	24.06	24.69	69.00	92.05	42.10	9.45	99.50	62.58	55.59	46.42
+ MixKV	26.13	42.08	53.18	55.47	43.88	28.80	26.68	24.03	25.35	70.00	91.01	42.79	9.41	99.50	62.92	55.82	46.75
△	+0.17	+1.82	+0.36	+0.92	+0.05	-1.63	+0.92	-0.03	+0.66	+1.00	-1.04	+0.69	-0.04	+0.00	+0.34	+0.23	+0.33

Performance of applying MixKV to InternVL3-38B.

Results here report only budget = 128.

Methods	DocVQA (%)	OCRBench (%)	TextVQA (%)	ChartQA (%)	TextCaps
InternVL3-38B
Full KV	93.5	85.9	83.8	88.6	0.953
SnapKV	87.5	77.8	82.0	87.5	0.932
+ MixKV	92.1	79.3	82.8	88.2	0.959
△	+4.6	+1.5	+0.8	+0.7	+0.027
AdaKV	92.0	79.6	82.0	87.4	0.940
+ MixKV	92.3	81.1	82.9	88.2	0.961
△	+0.3	+1.5	+0.9	+0.8	+0.021

Performance of applying MixKV to Qwen3-VL-30B-A3B-Instruct.

Results here report only budget = 128.

Methods	DocVQA (%)	OCRBench (%)	TextVQA (%)	ChartQA (%)	TextCaps
Qwen3-VL-30B-A3B-Instruct
Full KV	94.5	84.0	83.5	85.1	0.287
SnapKV	91.9	71.0	75.3	83.8	0.314
+ MixKV	93.2	80.7	80.8	84.5	0.411
△	+1.3	+9.7	+5.5	+0.7	+0.097

Efficiency comparisons of total latency and peak memory.

For a context length of 32,000, “Full KV” refers to caching the entire sequence, whereas KV compression strategies employ a budget of 64. The upper part is total time, while the lower part is peak memory.

Citation

If you find this project helpful, please consider citing our paper with:

@article{liu2025mixkv,
  title={Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models},
  author={Liu, Xuyang and Gui, Xiyan and Zhang, Yuchao and Zhang, Linfeng},
  journal={arXiv preprint arXiv:2510.20707},
  year={2025}
}