Tech

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

|

May 7, 2025

|

3

min read

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

Summary

  • Our method, Trimmed-Llama, reduces the key-value cache (KV cache) and latency of cross-attention-based Large Vision Language Models (LVLMs) without sacrificing performance.
  • We identify sparsity in LVLM cross-attention maps, showing a consistent layer-wise pattern where most visual features are selected in early layers, with little variation in later layers.
  • Our work has been accepted to the CVPR 2025 ELVM (Efficient Large Vision Language Model) Workshop.

Key Messages of the Paper

Visual token reduction lowers inference costs in large vision-language models (LVLMs) by pruning redundant image features. Unlike prior work focused on self-attention-only LVLMs, we target cross-attention-based models, which deliver superior performance. We observe that the key-value (KV) cache for image tokens in cross-attention layers is significantly larger than that for text tokens in self-attention layers, creating a major compute bottleneck.

To address this, our method, Trimmed Llama, leverages sparsity in cross-attention maps to prune unnecessary visual features without additional training. By reducing visual features by 50%, Trimmed Llama cuts KV cache demands, lowers inference latency, and reduces memory usage—all while maintaining benchmark performance.

Significance/Importance of the Paper

We observe that the KV cache for image tokens in cross-attention layers is much larger than for text tokens in self-attention layers, creating a compute bottleneck. To address this, we selectively prune redundant visual features using cross-attention sparsity.

Summary of Methodology

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

Image features are pruned in the first cross-attention block using a criterion derived from attention weights.

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

Our method uses headwise attention scores from language sequences to remove unimportant image features. In the first cross-attention layer, each head selects the top-k most salient features based on attention scores. The union of these top-k sets across all heads forms the final selection, providing a focused image representation.

Experimental Results

As described below, we achieved third place out of 25 teams in Subtask B (binary multilingual MGT detection) of Shared Task 1, with an F1 Macro Score of 0.7532.

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

Our method consistently outperforms or achieves comparable performance while leveraging 40∼50% of the image features. Notably, the pruning ratios are adaptively allocated for each task, as evidenced by LLaVA-Bench, an open-ended generation task utilizing more image features compared to other benchmarks.

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

Our method reduces latency by pruning key and value inputs in the cross-attention layers. Since image features are pruned after the first cross-attention layer, both the key-value projections and the attention operations are consequently reduced. Furthermore, the impact of the reduction grows more significant with larger batch sizes.

Conclusion

Cross-attention-based models, like LLaMA-3.2-Vision, achieve outstanding performance and efficiency using high-quality, proprietary datasets. We expect future open-source models with similar architectures to drive further advancements.


If you have any further inquiries about this research, please feel free to reach out to us at following 📧 email address: contact@nota.ai.

Furthermore, if you have an interest in AI optimization technologies, you can visit our website at 🔗 netspresso.ai.

Related