Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

Summary

Our method, Trimmed-Llama, reduces the key-value cache (KV cache) and latency of cross-attention-based Large Vision Language Models (LVLMs) without sacrificing performance.
‍
We identify sparsity in LVLM cross-attention maps, showing a consistent layer-wise pattern where most visual features are selected in early layers, with little variation in later layers.
‍
Our work has been accepted to the CVPR 2025 ELVM (Efficient Large Vision Language Model) Workshop.

Key Messages of the Paper

Visual token reduction lowers inference costs in large vision-language models (LVLMs) by pruning redundant image features. Unlike prior work focused on self-attention-only LVLMs, we target cross-attention-based models, which deliver superior performance. We observe that the key-value (KV) cache for image tokens in cross-attention layers is significantly larger than that for text tokens in self-attention layers, creating a major compute bottleneck.

To address this, our method, Trimmed Llama, leverages sparsity in cross-attention maps to prune unnecessary visual features without additional training. By reducing visual features by 50%, Trimmed Llama cuts KV cache demands, lowers inference latency, and reduces memory usage—all while maintaining benchmark performance.

Significance/Importance of the Paper

We observe that the KV cache for image tokens in cross-attention layers is much larger than for text tokens in self-attention layers, creating a compute bottleneck. To address this, we selectively prune redundant visual features using cross-attention sparsity.

Summary of Methodology

Image features are pruned in the first cross-attention block using a criterion derived from attention weights.

Our method uses headwise attention scores from language sequences to remove unimportant image features. In the first cross-attention layer, each head selects the top-k most salient features based on attention scores. The union of these top-k sets across all heads forms the final selection, providing a focused image representation.

Experimental Results

As described below, we achieved third place out of 25 teams in Subtask B (binary multilingual MGT detection) of Shared Task 1, with an F1 Macro Score of 0.7532.

Our method consistently outperforms or achieves comparable performance while leveraging 40∼50% of the image features. Notably, the pruning ratios are adaptively allocated for each task, as evidenced by LLaVA-Bench, an open-ended generation task utilizing more image features compared to other benchmarks.

Our method reduces latency by pruning key and value inputs in the cross-attention layers. Since image features are pruned after the first cross-attention layer, both the key-value projections and the attention operations are consequently reduced. Furthermore, the impact of the reduction grows more significant with larger batch sizes.

Conclusion

Cross-attention-based models, like LLaMA-3.2-Vision, achieve outstanding performance and efficiency using high-quality, proprietary datasets. We expect future open-source models with similar architectures to drive further advancements.

If you have any further inquiries about this research, please feel free to reach out to us at following 📧 email address: contact@nota.ai.

Furthermore, if you have an interest in AI optimization technologies, you can visit our website at 🔗 netspresso.ai.

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

Summary

Key Messages of the Paper

Significance/Importance of the Paper

Summary of Methodology

Experimental Results

Conclusion

Related

The Hardest Places to Run AI: Where Nota AI Tackles the Constraints

Two papers discussing quantization methods specialized for MoE LLMs, accepted to an ICML 2026 workshop

Smart Tech Korea (STK) 2026 Nota AI Booth Preview: Physical AI, Built at the Edge