What's in a Caption? Dataset-Specific Linguistic Diversity and Its
Effect on Visual Description Models and Metrics
- URL: http://arxiv.org/abs/2205.06253v1
- Date: Thu, 12 May 2022 17:55:08 GMT
- Title: What's in a Caption? Dataset-Specific Linguistic Diversity and Its
Effect on Visual Description Models and Metrics
- Authors: David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A.
Ross, Bryan Seybold, John F. Canny
- Abstract summary: We find that caption diversity is a major driving factor behind the generation of generic and uninformative captions.
We show that state-of-the-art models even outperform held-out ground truth captions on modern metrics.
- Score: 14.624063829492764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While there have been significant gains in the field of automated video
description, the generalization performance of automated description models to
novel domains remains a major barrier to using these systems in the real world.
Most visual description methods are known to capture and exploit patterns in
the training data leading to evaluation metric increases, but what are those
patterns? In this work, we examine several popular visual description datasets,
and capture, analyze, and understand the dataset-specific linguistic patterns
that models exploit but do not generalize to new domains. At the token level,
sample level, and dataset level, we find that caption diversity is a major
driving factor behind the generation of generic and uninformative captions. We
further show that state-of-the-art models even outperform held-out ground truth
captions on modern metrics, and that this effect is an artifact of linguistic
diversity in datasets. Understanding this linguistic diversity is key to
building strong captioning models, we recommend several methods and approaches
for maintaining diversity in the collection of new data, and dealing with the
consequences of limited diversity when using current models and metrics.
Related papers
- Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations [5.065947993017157]
This study introduces an approach to curate vision-language datasets by employing an image decoding machine learning model.
We amassed approximately 9.6 million vision-language paired datasets in VHR imagery.
The resultant model outperformed counterparts that did not leverage publicly available vision-language datasets.
arXiv Detail & Related papers (2024-09-11T06:36:08Z) - Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Diversify Your Vision Datasets with Automatic Diffusion-Based
Augmentation [66.6546668043249]
ALIA (Automated Language-guided Image Augmentation) is a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains.
To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information.
We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks.
arXiv Detail & Related papers (2023-05-25T17:43:05Z) - Effective Data Augmentation With Diffusion Models [65.09758931804478]
We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models.
Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples.
We evaluate our approach on few-shot image classification tasks, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains.
arXiv Detail & Related papers (2023-02-07T20:42:28Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Distribution Aware Metrics for Conditional Natural Language Generation [3.6350564275444173]
We argue that existing metrics are not appropriate for domains such as visual description or summarization where ground truths are semantically diverse.
We propose a novel paradigm for multi-candidate evaluation of conditional language generation models.
arXiv Detail & Related papers (2022-09-15T17:58:13Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Efficient Multi-Modal Embeddings from Structured Data [0.0]
Multi-modal word semantics aims to enhance embeddings with perceptual input.
Visual grounding can contribute to linguistic applications as well.
New embedding conveys complementary information for text based embeddings.
arXiv Detail & Related papers (2021-10-06T08:42:09Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.