Related papers: Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability

Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability

URL: http://arxiv.org/abs/2402.11159v3
Date: Fri, 7 Jun 2024 00:56:09 GMT
Title: Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability
Authors: Yejun Yoon, Seunghyun Yoon, Kunwoo Park,
Abstract summary: We focus on whether a news image represents the actors discussed in the news text. We introduce NewsTT, a dataset of 1000 news thumbnail images and text pairs. We propose CFT-CLIP, a contrastive learning framework that updates vision and language bi-encoders according to the hypothesis.
Score: 5.111382868644429
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper addresses the critical challenge of assessing the representativeness of news thumbnail images, which often serve as the first visual engagement for readers when an article is disseminated on social media. We focus on whether a news image represents the actors discussed in the news text. To serve the challenge, we introduce NewsTT, a manually annotated dataset of 1000 news thumbnail images and text pairs. We found that the pretrained vision and language models, such as BLIP-2, struggle with this task. Since news subjects frequently involve named entities or proper nouns, the pretrained models could have a limited capability to match news actors' visual and textual appearances. We hypothesize that learning to contrast news text with its counterfactual, of which named entities are replaced, can enhance the cross-modal matching ability of vision and language models. We propose CFT-CLIP, a contrastive learning framework that updates vision and language bi-encoders according to the hypothesis. We found that our simple method can boost the performance for assessing news thumbnail representativeness, supporting our assumption. Code and data can be accessed at https://github.com/ssu-humane/news-images-acl24.

Related papers

SmartCLIP: Modular Vision-language Alignment with Identification Guarantees [59.16312652369709]
Contrastive Language-Image Pre-training (CLIP)citepradford2021learning has emerged as a pivotal model in computer vision and multimodal learning.<n>CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation.<n>We introduce ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner.
arXiv Detail & Related papers (2025-07-29T22:26:20Z)
Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images. We propose a face-naming module for learning better name embeddings. We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z)
Focus! Relevant and Sufficient Context Selection for News Image Captioning [69.36678144800936]
News Image Captioning requires describing an image by leveraging additional context from a news article. We propose to use the pre-trained vision and language retrieval model CLIP to localize the visually grounded entities in the news article. Our experiments demonstrate that by simply selecting a better context from the article, we can significantly improve the performance of existing models.
arXiv Detail & Related papers (2022-12-01T20:00:27Z)
I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision [32.49636188029509]
We produce models using only text training data on four representative tasks. We find these models perform close to models trained on images. We showcase a variety of stylistic image captioning models that are trained using no image data and no human-curated language data.
arXiv Detail & Related papers (2022-11-17T18:52:19Z)
Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection. We propose to learn contextualized, joint representations through vision-language pre-training. The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
Visual News: Benchmark and Challenges in News Image Captioning [18.865262609683676]
We propose Visual News Captioner, an entity-aware model for the task of news image captioning. We also introduce Visual News, a large-scale benchmark consisting of more than one million news images.
arXiv Detail & Related papers (2020-10-08T03:07:00Z)
Transform and Tell: Entity-Aware News Image Captioning [77.4898875082832]
We propose an end-to-end model which generates captions for images embedded in news articles. We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism. We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts.
arXiv Detail & Related papers (2020-04-17T05:44:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.