Related papers: A French Version of the OLDI Seed Corpus

A French Version of the OLDI Seed Corpus

URL: http://arxiv.org/abs/2508.02290v1
Date: Mon, 04 Aug 2025 10:57:54 GMT
Title: A French Version of the OLDI Seed Corpus
Authors: Malik Marmonier, Benoît Sagot, Rachel Bawden,
Abstract summary: We present the first French partition of the OLDI Seed Corpus, our submission to the WMT 2025 Open Language Data Initiative (OLDI) shared task.<n>We detail its creation process, which involved using multiple machine translation systems and a custom-built interface for post-editing by qualified native speakers.<n>This French corpus is intended as a crucial pivot resource to facilitate the collection of parallel corpora for the under-resourced regional languages of France.
Score: 20.630120942837564
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We present the first French partition of the OLDI Seed Corpus, our submission to the WMT 2025 Open Language Data Initiative (OLDI) shared task. We detail its creation process, which involved using multiple machine translation systems and a custom-built interface for post-editing by qualified native speakers. We also highlight the unique translation challenges presented by the source data, which combines highly technical, encyclopedic terminology with the stylistic irregularities characteristic of user-generated content taken from Wikipedia. This French corpus is not an end in itself, but is intended as a crucial pivot resource to facilitate the collection of parallel corpora for the under-resourced regional languages of France.

Related papers

Multilingual corpora for the study of new concepts in the social sciences and humanities: [0.0]
This article presents a hybrid methodology for building a multilingual corpus designed to support the study of emerging concepts in the humanities and social sciences.<n>The corpus relies on two complementary sources: (1) textual content automatically extracted from company websites, cleaned for French and English, and (2) annual reports collected and automatically filtered according to documentary criteria (year, format, duplication)<n>The processing pipeline includes automatic language detection, filtering of non-relevant content, extraction of relevant segments, and enrichment with structural metadata.
arXiv Detail & Related papers (2025-12-08T10:04:50Z)
LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries. Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z)
Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages. Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z)
Wav2Gloss: Generating Interlinear Glossed Text from Speech [78.64412090339044]
We propose Wav2Gloss, a task in which four linguistic annotation components are extracted automatically from speech. We provide various baselines to lay the groundwork for future research on Interlinear Glossed Text generation from speech.
arXiv Detail & Related papers (2024-03-19T21:45:29Z)
The Claire French Dialogue Dataset [9.45456707528025]
This paper describes the 24 individual corpora of which CFDD is composed and provides links and citations to their original sources. It also provides our proposed breakdown of the full CFDD dataset into eight categories of subcorpora and describes the process we followed to standardize the format of the final dataset.
arXiv Detail & Related papers (2023-11-28T14:55:22Z)
One-for-All: Towards Universal Domain Translation with a Single StyleGAN [86.33216867136639]
We propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains.<n>The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations.<n>UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks.
arXiv Detail & Related papers (2023-10-22T08:02:55Z)
A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics. Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z)
Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language. The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German. We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z)
A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction. Our approach creates diverse parallel GEC data without any language-specific operations. It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z)
Named Entity Recognition and Linking Augmented with Large-Scale Structured Data [3.211619859724085]
We describe our submissions to the 2nd and 3rd SlavNER Shared Tasks held at BSNLP 2019 and BSNLP 2021. The tasks focused on the analysis of Named Entities in multilingual Web documents in Slavic languages with rich inflection. Our solution takes advantage of large collections of both unstructured and structured documents.
arXiv Detail & Related papers (2021-04-27T20:10:18Z)
PENELOPIE: Enabling Open Information Extraction for the Greek Language through Machine Translation [0.30938904602244344]
We present our submission for the EACL 2021 SRW; a methodology that aims at bridging the gap between high and low-resource languages. We build Neural Machine Translation (NMT) models for English-to-Greek and Greek-to-English based on the Transformer architecture. We leverage these NMT models to produce English translations of Greek text as input for our NLP pipeline, to which we apply a series of pre-processing and triple extraction tasks.
arXiv Detail & Related papers (2021-03-28T08:01:58Z)
VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages. It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z)
Natural Language Processing Chains Inside a Cross-lingual Event-Centric Knowledge Pipeline for European Union Under-resourced Languages [0.0]
This article presents the strategy for developing a platform containing Language Processing Chains for European Union languages. These chains are part of the first step of an event-centric knowledge processing pipeline whose aim is to process multilingual media information about major events that can cause an impactin Europe and the rest of the world.
arXiv Detail & Related papers (2020-10-23T14:26:30Z)
Language Guided Networks for Cross-modal Moment Retrieval [66.49445903955777]
Cross-modal moment retrieval aims to localize a temporal segment from an untrimmed video described by a natural language query. Existing methods independently extract the features of videos and sentences. We present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval.
arXiv Detail & Related papers (2020-06-18T12:08:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.