Related papers: Reading Is Believing: Revisiting Language Bottleneck Models for Image Classification

Reading Is Believing: Revisiting Language Bottleneck Models for Image Classification

URL: http://arxiv.org/abs/2406.15816v1
Date: Sat, 22 Jun 2024 10:49:34 GMT
Title: Reading Is Believing: Revisiting Language Bottleneck Models for Image Classification
Authors: Honori Udo, Takafumi Koshinaka,
Abstract summary: We revisit language bottleneck models as an approach to ensuring the explainability of deep learning models for image classification. We experimentally show that a language bottleneck model that combines a modern image captioner with a pre-trained language model can achieve image classification accuracy that exceeds that of black-box models.
Score: 4.1205832766381985
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We revisit language bottleneck models as an approach to ensuring the explainability of deep learning models for image classification. Because of inevitable information loss incurred in the step of converting images into language, the accuracy of language bottleneck models is considered to be inferior to that of standard black-box models. Recent image captioners based on large-scale foundation models of Vision and Language, however, have the ability to accurately describe images in verbal detail to a degree that was previously believed to not be realistically possible. In a task of disaster image classification, we experimentally show that a language bottleneck model that combines a modern image captioner with a pre-trained language model can achieve image classification accuracy that exceeds that of black-box models. We also demonstrate that a language bottleneck model and a black-box model may be thought to extract different features from images and that fusing the two can create a synergistic effect, resulting in even higher classification accuracy.

Related papers

Asymmetric Idiosyncrasies in Multimodal Models [22.359102255231004]
We study idiosyncrasies in the caption models and their downstream impact on text-to-image models.<n>Our results show that text classification yields very high accuracy (99.70%)<n>Our framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.
arXiv Detail & Related papers (2026-02-26T08:16:47Z)
Impact of Language Guidance: A Reproducibility Study [0.0]
Recent advances in self-supervised learning allow us to train huge models without explicit annotation. We use an off-the-shelf image captioning model, BLIP-2, to replace the captions and improve performance. We also devise a new metric to evaluate the semantic capabilities of self-supervised models based on interpretability methods.
arXiv Detail & Related papers (2025-04-10T21:59:13Z)
Evaluating Pixel Language Models on Non-Standardized Languages [24.94386050975835]
pixel-based models convert text into images that are divided into patches, enabling a continuous vocabulary representation. Using German as a case study, we compare the performance of pixel-based models to token-based models across various syntactic and semantic tasks.
arXiv Detail & Related papers (2024-12-12T09:11:45Z)
Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images. We identify model weaknesses by testing the model using the counterfactual image dataset. We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z)
Bidirectional Representations for Low Resource Spoken Language Understanding [39.208462511430554]
We propose a representation model to encode speech in bidirectional rich encodings. The approach uses a masked language modelling objective to learn the representations. We show that the performance of the resulting encodings is better than comparable models on multiple datasets.
arXiv Detail & Related papers (2022-11-24T17:05:16Z)
Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z)
Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. In this work, we repurpose such models to generate a descriptive text given an image at inference time. The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
Caption Enriched Samples for Improving Hateful Memes Detection [78.5136090997431]
The hateful meme challenge demonstrates the difficulty of determining whether a meme is hateful or not. Both unimodal language models and multimodal vision-language models cannot reach the human level of performance.
arXiv Detail & Related papers (2021-09-22T10:57:51Z)
Visual Conceptual Blending with Large-scale Language and Vision Models [54.251383721475655]
We generate a single-sentence description of the blend of the two using a language model. We generate a visual depiction of the blend using a text-based image generation model.
arXiv Detail & Related papers (2021-06-27T02:48:39Z)
Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition. How to effectively model linguistic rules in end-to-end deep networks remains a research challenge. We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.