Tech

Visual token reduction lowers inference costs in large vision-language models (LVLMs) by pruning redundant image features. Unlike prior work focused on self-attention-only LVLMs, we target cross-attention-based models, which deliver superior performance. We observe that the key-value (KV) cache for image tokens in cross-attention layers is significantly larger than that for text tokens in self-attention layers, creating a major compute bottleneck.
To address this, our method, Trimmed Llama, leverages sparsity in cross-attention maps to prune unnecessary visual features without additional training. By reducing visual features by 50%, Trimmed Llama cuts KV cache demands, lowers inference latency, and reduces memory usage—all while maintaining benchmark performance.
We observe that the KV cache for image tokens in cross-attention layers is much larger than for text tokens in self-attention layers, creating a compute bottleneck. To address this, we selectively prune redundant visual features using cross-attention sparsity.

Image features are pruned in the first cross-attention block using a criterion derived from attention weights.

Our method uses headwise attention scores from language sequences to remove unimportant image features. In the first cross-attention layer, each head selects the top-k most salient features based on attention scores. The union of these top-k sets across all heads forms the final selection, providing a focused image representation.
As described below, we achieved third place out of 25 teams in Subtask B (binary multilingual MGT detection) of Shared Task 1, with an F1 Macro Score of 0.7532.

Our method consistently outperforms or achieves comparable performance while leveraging 40∼50% of the image features. Notably, the pruning ratios are adaptively allocated for each task, as evidenced by LLaVA-Bench, an open-ended generation task utilizing more image features compared to other benchmarks.

Our method reduces latency by pruning key and value inputs in the cross-attention layers. Since image features are pruned after the first cross-attention layer, both the key-value projections and the attention operations are consequently reduced. Furthermore, the impact of the reduction grows more significant with larger batch sizes.
Cross-attention-based models, like LLaMA-3.2-Vision, achieve outstanding performance and efficiency using high-quality, proprietary datasets. We expect future open-source models with similar architectures to drive further advancements.
If you have any further inquiries about this research, please feel free to reach out to us at following 📧 email address: contact@nota.ai.
Furthermore, if you have an interest in AI optimization technologies, you can visit our website at 🔗 netspresso.ai.