ARPA: Armenian Paraphrase Detection Corpus and Models
- URL: http://arxiv.org/abs/2009.12615v1
- Date: Sat, 26 Sep 2020 14:56:57 GMT
- Title: ARPA: Armenian Paraphrase Detection Corpus and Models
- Authors: Arthur Malajyan, Karen Avetisyan, Tsolak Ghukasyan
- Abstract summary: We employ a semi-automatic method to generate a sentential paraphrase corpus for the Armenian language.
The initial collection of sentences is translated from Armenian to English and back twice, resulting in pairs of lexically distant but semantically similar sentences.
The generated paraphrases are then manually reviewed and annotated.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we employ a semi-automatic method based on back translation to
generate a sentential paraphrase corpus for the Armenian language. The initial
collection of sentences is translated from Armenian to English and back twice,
resulting in pairs of lexically distant but semantically similar sentences. The
generated paraphrases are then manually reviewed and annotated. Using the
method train and test datasets are created, containing 2360 paraphrases in
total. In addition, the datasets are used to train and evaluate BERTbased
models for detecting paraphrase in Armenian, achieving results comparable to
the state-of-the-art of other languages.
Related papers
- LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Sõnajaht: Definition Embeddings and Semantic Search for Reverse Dictionary Creation [0.21485350418225246]
We present an information retrieval based reverse dictionary system using modern pre-trained language models and approximate nearest neighbors search algorithms.
The proposed approach is applied to an existing Estonian language lexicon resource, Sonaveeb (word web), with the purpose of enhancing and enriching it by introducing cross-lingual reverse dictionary functionality powered by semantic search.
arXiv Detail & Related papers (2024-04-30T10:21:14Z) - ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR
Back-Translation [59.91139600152296]
ParaAMR is a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation.
We show that ParaAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning.
arXiv Detail & Related papers (2023-05-26T02:27:33Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Training Effective Neural Sentence Encoders from Automatically Mined
Paraphrases [0.0]
We propose a method for training effective language-specific sentence encoders without manually labeled data.
Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora.
Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks.
arXiv Detail & Related papers (2022-07-26T09:08:56Z) - Semantic Search as Extractive Paraphrase Span Detection [0.8137055256093007]
We frame the problem of semantic search by framing the search task as paraphrase span detection.
On the Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs, we find that our paraphrase span detection model outperforms two strong retrieval baselines.
We introduce a method for creating artificial paraphrase data through back-translation, suitable for languages where manually annotated paraphrase resources are not available.
arXiv Detail & Related papers (2021-12-09T13:16:42Z) - Extracting and filtering paraphrases by bridging natural language
inference and paraphrasing [0.0]
We propose a novel methodology for the extraction of paraphrasing datasets from NLI datasets and cleaning existing paraphrasing datasets.
The results show high quality of extracted paraphrasing datasets and surprisingly high noise levels in two existing paraphrasing datasets.
arXiv Detail & Related papers (2021-11-13T14:06:37Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext)
Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z) - A Deep Reinforced Model for Zero-Shot Cross-Lingual Summarization with
Bilingual Semantic Similarity Rewards [40.17497211507507]
Cross-lingual text summarization is a practically important but under-explored task.
We propose an end-to-end cross-lingual text summarization model.
arXiv Detail & Related papers (2020-06-27T21:51:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.