Emerging
Jun 18, 20261
66%
Researchers Introduce CaVe-VLM-CoT Framework to Combat Hallucinations in Vision-Language Models

Researchers have introduced CaVe-VLM-CoT, a new interpretable framework designed to reduce hallucinations in vision-language models by enforcing evidence-grounded reasoning through a five-stage pipeline with built-in verification and correction mechanisms. The framework achieved 87.1% accuracy on ScienceQA and 55.2% on MMMU benchmarks, assessed using a novel suite of 23 component-wise metrics anchored by CaVeScore.

Quick Facts
Who
Computer Science Research Community
What
Submitted CaVe-VLM-CoT framework to arXiv
When
Submitted on June 16, 2026
Where
arXiv
- Submitted CaVe-VLM-CoT framework to arXiv
- Developed five-stage pipeline architecture: Extractor, Retriever, Solver, Citation Injector, and Verifier
- Created suite of 23 component-wise metrics with CaVeScore composite metric
- Achieved 87.1% accuracy and 56.6% CaVeScore on ScienceQA
- Achieved 55.2% accuracy and 35.7% CaVeScore on MMMU
A new interpretable vision-language model framework called CaVe-VLM-CoT has been proposed to address a persistent challenge in artificial intelligence: hallucinations in vision-language models (VLMs). These systems, while capable of producing fluent outputs, often generate descriptions that are visually unfaithful or disconnected from actual image content. The framework was submitted to arXiv on June 16, 2026, representing a significant advance in grounding AI reasoning in verifiable evidence.
CaVe-VLM-CoT operates through a modular, reflection-based agentic retrieval-augmented generation (RAG) pipeline comprising five integrated stages: Extractor, Retriever, Solver, Citation Injector, and Verifier. The framework enforces evidence-grounded reasoning by implementing a closed-loop system where any detected ungrounded claims automatically trigger structured feedback back to the Extractor for targeted re-retrieval and correction. This approach differs from existing chain-of-thought and retrieval-augmented methods by both enforcing step-level citation grounding and routing verification failures back through the retrieval process for systematic correction.
To properly evaluate the framework's effectiveness, the researchers developed an unprecedented suite of 23 component-wise metrics across all pipeline stages. These metrics are anchored by CaVeScore, a composite metric that weights accuracy, citation precision and recall, attribution, and evidence grounding to provide comprehensive assessment of the system's performance. This multi-faceted evaluation approach reflects the complexity of measuring both retrieval quality and cross-modal grounding simultaneously.
When tested on standard benchmarks without requiring any architectural or prompt modifications, CaVe-VLM-CoT demonstrated substantial performance gains. On the ScienceQA dataset, the framework achieved 87.1 percent accuracy with a CaVeScore of 56.6 percent. Performance on the more challenging MMMU benchmark covering 30 subjects was 55.2 percent accuracy with a CaVeScore of 35.7 percent. These results suggest the framework offers a practical solution for reducing hallucinations while maintaining the fluency and capabilities of vision-language models.
Why This Matters
Vision-language models power critical applications from medical imaging to autonomous systems, but hallucinations—where AI generates plausible-sounding but false descriptions—undermine trust and safety. CaVe-VLM-CoT's evidence-grounded reasoning pipeline offers a practical, modular solution that doesn't require architectural changes to existing models. For enterprises deploying VLMs in high-stakes domains, this framework provides measurable grounding guarantees through its novel CaVeScore metric, enabling better risk assessment and more reliable AI outputs.
Timeline & Sources
Jun 16, 2026
WireCaVe-VLM-CoT framework submitted to arXiv
Jun 18, 2026
WireCaVe-VLM-CoT framework published and announced