Related papers: MASS: Overcoming Language Bias in Image-Text Matching

MASS: Overcoming Language Bias in Image-Text Matching

URL: http://arxiv.org/abs/2501.11469v1
Date: Mon, 20 Jan 2025 12:56:28 GMT
Title: MASS: Overcoming Language Bias in Image-Text Matching
Authors: Jiwan Chung, Seungwon Lim, Sangkyu Lee, Youngjae Yu,
Abstract summary: Multimodal ASsociation Score (MASS) is a framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems.<n>Our experiments have shown that MASS effectively lessens language bias without losing an understanding of linguistic compositionality.
Score: 15.922356794782965
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pretrained visual-language models have made significant advancements in multimodal tasks, including image-text retrieval. However, a major challenge in image-text matching lies in language bias, where models predominantly rely on language priors and neglect to adequately consider the visual content. We thus present Multimodal ASsociation Score (MASS), a framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems. It can be seamlessly incorporated into existing visual-language models without necessitating additional training. Our experiments have shown that MASS effectively lessens language bias without losing an understanding of linguistic compositionality. Overall, MASS offers a promising solution for enhancing image-text matching performance in visual-language models.

Related papers

Chatting with Images for Introspective Visual Thinking [50.7747647794877]
''Chatting with images'' is a new framework that reframes visual manipulation as language-guided feature modulation.<n>Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions.<n>ViLaVT achieves strong and consistent improvements on complex multi-image and video-based spatial reasoning tasks.
arXiv Detail & Related papers (2026-02-11T17:42:37Z)
uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data [3.364569898365253]
We propose a lightweight and data-efficient framework for multilingual vision-language alignment.<n>Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training.<n>This minimal training setup enables robust multilingual alignment even for languages with limited supervision.
arXiv Detail & Related papers (2025-11-17T06:34:49Z)
Cross-Image Contrastive Decoding: Precise, Lossless Suppression of Language Priors in Large Vision-Language Models [11.385588803559733]
Over-reliance on language priors is a major cause of hallucinations in Large Vision-Language Models (LVLMs)<n>Recent studies have explored contrastive decoding as a training-free solution.<n>We propose Cross-Image Contrastive Decoding (CICD), a simple yet effective training-free method that uses unrelated images as contrastive visual inputs.
arXiv Detail & Related papers (2025-05-15T18:16:56Z)
Learning the Visualness of Text Using Large Vision-Language Models [42.75864384249245]
Visual text evokes an image in a person's mind, while non-visual text fails to do so. A method to automatically detect visualness in text will enable text-to-image retrieval and generation models to augment text with relevant images. We curate a dataset of 3,620 English sentences and their visualness scores provided by multiple human annotators.
arXiv Detail & Related papers (2023-05-11T17:45:16Z)
Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework. We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image. We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling. With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling. We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z)
From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union. VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z)
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning. We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM) Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images. "vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
Probing Contextual Language Models for Common Ground with Visual Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.