M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation
- URL: http://arxiv.org/abs/2406.08255v1
- Date: Wed, 12 Jun 2024 14:28:25 GMT
- Title: M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation
- Authors: Benjamin Hsu, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nadejde, Xing Niu, Yair Kittenplon, Ron Litman, Raghavendra Pappagari,
- Abstract summary: Document translation poses a challenge for Neural Machine Translation (NMT) systems.
Most document-level NMT systems rely on meticulously curated sentence-level parallel data.
Real-world documents often possess intricate text layouts that defy these assumptions.
This paper introduces M3T, a novel benchmark dataset tailored to evaluate NMT systems on the comprehensive task of translating semi-structured documents.
- Score: 32.39711525205274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world documents often possess intricate text layouts that defy these assumptions. Extracting information from Optical Character Recognition (OCR) or heuristic rules can result in errors, and the layout (e.g., paragraphs, headers) may convey relationships between distant sections of text. This complexity is particularly evident in widely used PDF documents, which represent information visually. This paper addresses this gap by introducing M3T, a novel benchmark dataset tailored to evaluate NMT systems on the comprehensive task of translating semi-structured documents. This dataset aims to bridge the evaluation gap in document-level NMT systems, acknowledging the challenges posed by rich text layouts in real-world applications.
Related papers
- Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents [7.358946120326249]
We introduce 'Eclair, a text-extraction tool specifically designed to process a wide range of document types.
Given an image, 'Eclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes.
'Eclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics.
arXiv Detail & Related papers (2025-02-06T17:07:22Z) - Investigating Length Issues in Document-level Machine Translation [19.397788794005862]
We design and implement a new approach designed to precisely measure the effect of length increments on machine translation outputs.
Experiments show that (a)translation performance decreases with the length of the input text; (b)the position of sentences within the document matters and translation quality is higher for sentences occurring earlier in a document.
Our results suggest that even though document-level MT is computationally feasible, it does not yet match the performance of sentence-based MT.
arXiv Detail & Related papers (2024-12-23T14:08:45Z) - The Power of Summary-Source Alignments [62.76959473193149]
Multi-document summarization (MDS) is a challenging task, often decomposed to subtasks of salience and redundancy detection.
alignment of corresponding sentences between a reference summary and its source documents has been leveraged to generate training data.
This paper proposes extending the summary-source alignment framework by applying it at the more fine-grained proposition span level.
arXiv Detail & Related papers (2024-06-02T19:35:19Z) - OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [79.852642726105]
We propose a unified paradigm for parsing visually-situated text across diverse scenarios.
Specifically, we devise a universal model, called Omni, which can simultaneously handle three typical visually-situated text parsing tasks.
In Omni, all tasks share the unified encoder-decoder architecture, the unified objective point-conditioned text generation, and the unified input representation.
arXiv Detail & Related papers (2024-03-28T03:51:14Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - On Search Strategies for Document-Level Neural Machine Translation [51.359400776242786]
Document-level neural machine translation (NMT) models produce a more consistent output across a document.
In this work, we aim to answer the question how to best utilize a context-aware translation model in decoding.
arXiv Detail & Related papers (2023-06-08T11:30:43Z) - Exploring Paracrawl for Document-level Neural Machine Translation [21.923881766940088]
Document-level neural machine translation (NMT) has outperformed sentence-level NMT on a number of datasets.
We show that document-level NMT models trained with only parallel paragraphs from Paracrawl can be used to translate real documents.
arXiv Detail & Related papers (2023-04-20T11:21:34Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Diving Deep into Context-Aware Neural Machine Translation [36.17847243492193]
This paper analyzes the performance of document-level NMT models on four diverse domains.
We find that there is no single best approach to document-level NMT, but rather that different architectures come out on top on different tasks.
arXiv Detail & Related papers (2020-10-19T13:23:12Z) - Document-level Neural Machine Translation with Document Embeddings [82.4684444847092]
This work focuses on exploiting detailed document-level context in terms of multiple forms of document embeddings.
The proposed document-aware NMT is implemented to enhance the Transformer baseline by introducing both global and local document-level clues on the source end.
arXiv Detail & Related papers (2020-09-16T19:43:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.