MultiSubs: A Large-scale Multimodal and Multilingual Dataset
- URL: http://arxiv.org/abs/2103.01910v1
- Date: Tue, 2 Mar 2021 18:09:07 GMT
- Title: MultiSubs: A Large-scale Multimodal and Multilingual Dataset
- Authors: Josiah Wang, Pranava Madhyastha, Josiel Figueiredo, Chiraag Lala,
Lucia Specia
- Abstract summary: This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language.
The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles.
We show the utility of the dataset on two automatic tasks: (i) fill-in-the blank; (ii) lexical translation.
- Score: 32.48454703822847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a large-scale multimodal and multilingual dataset that
aims to facilitate research on grounding words to images in their contextual
usage in language. The dataset consists of images selected to unambiguously
illustrate concepts expressed in sentences from movie subtitles. The dataset is
a valuable resource as (i) the images are aligned to text fragments rather than
whole sentences; (ii) multiple images are possible for a text fragment and a
sentence; (iii) the sentences are free-form and real-world like; (iv) the
parallel texts are multilingual. We set up a fill-in-the-blank game for humans
to evaluate the quality of the automatic image selection process of our
dataset. We show the utility of the dataset on two automatic tasks: (i)
fill-in-the blank; (ii) lexical translation. Results of the human evaluation
and automatic models demonstrate that images can be a useful complement to the
textual context. The dataset will benefit research on visual grounding of words
especially in the context of free-form sentences.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Revising Image-Text Retrieval via Multi-Modal Entailment [25.988058843564335]
Many-to-many matching phenomenon is quite common in the widely-used image-text retrieval datasets.
We propose a multi-modal entailment classifier to determine whether a sentence is entailed by an image plus its linked captions.
arXiv Detail & Related papers (2022-08-22T07:58:54Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Multimodal Neural Machine Translation with Search Engine Based Image
Retrieval [4.662583832063716]
We propose an open-vocabulary image retrieval method to collect descriptive images for bilingual parallel corpus.
Our proposed method achieves significant improvements over strong baselines.
arXiv Detail & Related papers (2022-07-26T08:42:06Z) - Backretrieval: An Image-Pivoted Evaluation Metric for Cross-Lingual Text
Representations Without Parallel Corpora [19.02834713111249]
Backretrieval is shown to correlate with ground truth metrics on annotated datasets.
Our experiments conclude with a case study on a recipe dataset without parallel cross-lingual data.
arXiv Detail & Related papers (2021-05-11T12:14:24Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.