Practical Comparable Data Collection for Low-Resource Languages via
Images
- URL: http://arxiv.org/abs/2004.11954v2
- Date: Tue, 28 Apr 2020 19:51:43 GMT
- Title: Practical Comparable Data Collection for Low-Resource Languages via
Images
- Authors: Aman Madaan, Shruti Rijhwani, Antonios Anastasopoulos, Yiming Yang,
Graham Neubig
- Abstract summary: We propose a method of curating high-quality comparable training data for low-resource languages with monolingual annotators.
Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently.
Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not translations at all.
- Score: 126.64069379167975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a method of curating high-quality comparable training data for
low-resource languages with monolingual annotators. Our method involves using a
carefully selected set of images as a pivot between the source and target
languages by getting captions for such images in both languages independently.
Human evaluations on the English-Hindi comparable corpora created with our
method show that 81.1% of the pairs are acceptable translations, and only 2.47%
of the pairs are not translations at all. We further establish the potential of
the dataset collected through our approach by experimenting on two downstream
tasks - machine translation and dictionary extraction. All code and data are
available at https://github.com/madaan/PML4DC-Comparable-Data-Collection.
Related papers
- Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - Synergy with Translation Artifacts for Training and Inference in
Multilingual Tasks [11.871523410051527]
This paper shows that combining the use of both translations simultaneously can synergize the results on various multilingual sentence classification tasks.
We propose a cross-lingual fine-tuning algorithm called MUSC, which uses SupCon and MixUp jointly and improves the performance.
arXiv Detail & Related papers (2022-10-18T04:55:24Z) - Low-resource Neural Machine Translation with Cross-modal Alignment [15.416659725808822]
We propose a cross-modal contrastive learning method to learn a shared space for all languages.
Experimental results and further analysis show that our method can effectively learn the cross-modal and cross-lingual alignment with a small amount of image-text pairs.
arXiv Detail & Related papers (2022-10-13T04:15:43Z) - OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource
Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks.
When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result.
We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Massively Multilingual Document Alignment with Cross-lingual
Sentence-Mover's Distance [8.395430195053061]
Document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other.
We develop an unsupervised scoring function that leverages cross-lingual sentence embeddings to compute the semantic distance between documents in different languages.
These semantic distances are then used to guide a document alignment algorithm to properly pair cross-lingual web documents across a variety of low, mid, and high-resource language pairs.
arXiv Detail & Related papers (2020-01-31T05:14:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.