Hallucination Attenuated Language and Vision Assistant

We use LLaVA-v1.5, a widely used open-sourced MLLM, as our base model and train it using our contrastive tuning framework (HALVA). We then evaluate its performance on object hallucination mitigation and general visual question answering tasks (VQA) against fine-tuning–based approaches, HA-DPO and EOS. We consider LLaVA-v1.5 as the lower bound and GPT-4V as a strong reference point given its performance on standard benchmarks.

We use the AMBER benchmark and Caption Hallucination Assessment with Image Relevance (CHAIR) metric to evaluate MLLM performance on image description tasks, assessing both hallucination rate and the level of detail of their generated image descriptions. The latter aspect is quantified by calculating the percentage of ground-truth objects present in the image that are accurately captured in the model’s output. Our goal is to mitigate hallucinations while retaining or improving the richness of image descriptions. As shown in the left plot below, HALVA captures more ground-truth objects while hallucinating less than HA-DPO. Moreover, while EOS achieves a slightly lower hallucination rate, it degrades the level of detail in the image descriptions, performing worse than HALVA.

We also use the F1-score to compare the performance of MLLMs on visual question answering tasks using the AMBER benchmark for object hallucination and TextVQA benchmark for general vision language accuracy. As shown in the right plot below, both HA-DPO and EOS underperform HALVA in mitigating object hallucination and even deteriorate general vision-language abilities compared to the base model.

Leave a Comment Cancel Reply