Cross-modal Language Generation using Pivot Stabilization for Web-scale
Language Coverage
- URL: http://arxiv.org/abs/2005.00246v1
- Date: Fri, 1 May 2020 06:58:18 GMT
- Title: Cross-modal Language Generation using Pivot Stabilization for Web-scale
Language Coverage
- Authors: Ashish V. Thapliyal and Radu Soricut
- Abstract summary: Cross-modal language generation tasks such as image captioning are directly hurt by the trend of data-hungry models combined with the lack of non-English annotations.
We describe an approach called Pivot-Language Generation Stabilization (PLuGS), which leverages directly at training time both existing English annotations and their machine-translated versions.
We show that PLuGS models outperform other candidate solutions in evaluations performed over 5 different target languages.
- Score: 23.71195344840051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal language generation tasks such as image captioning are directly
hurt in their ability to support non-English languages by the trend of
data-hungry models combined with the lack of non-English annotations. We
investigate potential solutions for combining existing language-generation
annotations in English with translation capabilities in order to create
solutions at web-scale in both domain and language coverage. We describe an
approach called Pivot-Language Generation Stabilization (PLuGS), which
leverages directly at training time both existing English annotations (gold
data) as well as their machine-translated versions (silver data); at run-time,
it generates first an English caption and then a corresponding target-language
caption. We show that PLuGS models outperform other candidate solutions in
evaluations performed over 5 different target languages, under a large-domain
testset using images from the Open Images dataset. Furthermore, we find an
interesting effect where the English captions generated by the PLuGS models are
better than the captions generated by the original, monolingual English model.
Related papers
- Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - Pixel Aligned Language Models [94.32841818609914]
We develop a vision-language model that can take locations as either inputs or outputs.
When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.
Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention.
arXiv Detail & Related papers (2023-12-14T18:57:58Z) - Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - "Wikily" Neural Machine Translation Tailored to Cross-Lingual Tasks [20.837515947519524]
First sentences and titles of linked Wikipedia pages, as well as cross-lingual image captions, are strong signals for a seed parallel data to extract bilingual dictionaries and cross-lingual word embeddings for mining parallel text from Wikipedia.
In image captioning, we train a multi-tasking machine translation and image captioning pipeline for Arabic and English from which the Arabic training data is a wikily translation of the English captioning data.
Our captioning results in Arabic are slightly better than that of its supervised model.
arXiv Detail & Related papers (2021-04-16T21:49:12Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Fusion Models for Improved Visual Captioning [18.016295296424413]
This paper proposes a generic multimodal model fusion framework for caption generation and emendation.
We employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM) with a visual captioning model, viz. Show, Attend, and Tell.
Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline.
arXiv Detail & Related papers (2020-10-28T21:55:25Z) - UNISON: Unpaired Cross-lingual Image Captioning [17.60054750276632]
We present a novel unpaired cross-lingual method to generate image captions without relying on any caption corpus in the source or the target language.
Specifically, our method consists of two phases: (i) a cross-lingual auto-encoding process, which utilizing a sentence parallel (bitext) corpus to learn the mapping from the source to the target language in the scene graph encoding space and decode sentences in the target language, and (ii) a cross-modal unsupervised feature mapping, which seeks to map the encoded scene graph features from image modality to language modality.
arXiv Detail & Related papers (2020-10-03T06:14:06Z) - Denoising Large-Scale Image Captioning from Alt-text Data using Content
Selection Models [25.86785379429413]
We show that selecting content words as skeletons helps in generating improved and denoised captions.
We also show that the predicted English skeletons can be further cross-lingually leveraged to generate non-English captions.
We also show that skeleton-based prediction allows for better control of certain caption properties, such as length, content, and gender expression.
arXiv Detail & Related papers (2020-09-10T23:31:38Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.