Goal-driven text descriptions for images
- URL: http://arxiv.org/abs/2108.12575v1
- Date: Sat, 28 Aug 2021 05:10:38 GMT
- Title: Goal-driven text descriptions for images
- Authors: Ruotian Luo
- Abstract summary: This thesis focuses on generating textual output given visual input.
We use a comprehension machine to guide the generated referring expressions to be more discriminative.
In Chapter 5, we study how training objectives and sampling methods affect the models' ability to generate diverse captions.
- Score: 7.059848512713061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A big part of achieving Artificial General Intelligence(AGI) is to build a
machine that can see and listen like humans. Much work has focused on designing
models for image classification, video classification, object detection, pose
estimation, speech recognition, etc., and has achieved significant progress in
recent years thanks to deep learning. However, understanding the world is not
enough. An AI agent also needs to know how to talk, especially how to
communicate with a human. While perception (vision, for example) is more common
across animal species, the use of complicated language is unique to humans and
is one of the most important aspects of intelligence.
In this thesis, we focus on generating textual output given visual input. In
Chapter 3, we focus on generating the referring expression, a text description
for an object in the image so that a receiver can infer which object is being
described. We use a comprehension machine to directly guide the generated
referring expressions to be more discriminative. In Chapter 4, we introduce a
method that encourages discriminability in image caption generation. We show
that more discriminative captioning models generate more descriptive captions.
In Chapter 5, we study how training objectives and sampling methods affect the
models' ability to generate diverse captions. We find that a popular captioning
training strategy will be detrimental to the diversity of generated captions.
In Chapter 6, we propose a model that can control the length of generated
captions. By changing the desired length, one can influence the style and
descriptiveness of the captions. Finally, in Chapter 7, we rank/generate
informative image tags according to their information utility. The proposed
method better matches what humans think are the most important tags for the
images.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - Image Captioners Are Scalable Vision Learners Too [61.98796478791261]
Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones.
Our results show that plain image captioning is a more powerful pretraining strategy than was previously believed.
arXiv Detail & Related papers (2023-06-13T17:18:01Z) - Implicit and Explicit Commonsense for Multi-sentence Video Captioning [33.969215964292395]
We propose a novel video captioning Transformer-based model that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge.
We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions.
arXiv Detail & Related papers (2023-03-14T00:19:11Z) - I Can't Believe There's No Images! Learning Visual Tasks Using only
Language Supervision [32.49636188029509]
We produce models using only text training data on four representative tasks.
We find these models perform close to models trained on images.
We showcase a variety of stylistic image captioning models that are trained using no image data and no human-curated language data.
arXiv Detail & Related papers (2022-11-17T18:52:19Z) - Show, Interpret and Tell: Entity-aware Contextualised Image Captioning
in Wikipedia [10.21762162291523]
We propose the novel task of captioning Wikipedia images by integrating contextual knowledge.
Specifically, we produce models that jointly reason over Wikipedia articles, Wikimedia images and their associated descriptions.
arXiv Detail & Related papers (2022-09-21T16:14:15Z) - Visual Clues: Bridging Vision and Language Foundations for Image
Paragraph Captioning [78.07495777674747]
We argue that by using visual clues to bridge large pretrained vision foundation models and language models, we can do so without any extra cross-modal training.
Thanks to the strong zero-shot capability of foundation models, we start by constructing a rich semantic representation of the image.
We use large language model to produce a series of comprehensive descriptions for the visual content, which is then verified by the vision model again to select the candidate that aligns best with the image.
arXiv Detail & Related papers (2022-06-03T22:33:09Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - On Guiding Visual Attention with Language Specification [76.08326100891571]
We use high-level language specification as advice for constraining the classification evidence to task-relevant features, instead of distractors.
We show that supervising spatial attention in this way improves performance on classification tasks with biased and noisy data.
arXiv Detail & Related papers (2022-02-17T22:40:19Z) - Neural Twins Talk & Alternative Calculations [3.198144010381572]
Inspired by how the human brain employs a higher number of neural pathways when describing a highly focused subject, we show that deep attentive models could be extended to achieve better performance.
Image captioning bridges a gap between computer vision and natural language processing.
arXiv Detail & Related papers (2021-08-05T18:41:34Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.