Related papers: Detecting Cross-Language Plagiarism using Open Knowledge Graphs

Detecting Cross-Language Plagiarism using Open Knowledge Graphs

URL: http://arxiv.org/abs/2111.09749v1
Date: Thu, 18 Nov 2021 15:23:27 GMT
Title: Detecting Cross-Language Plagiarism using Open Knowledge Graphs
Authors: Johannes Stegm\"uller, Fabian Bauer-Marquart, Norman Meuschke, Terry Ruas, Moritz Schubotz, Bela Gipp
Abstract summary: We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections.
Score: 7.378348990383349
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL\nobreakdash-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL\nobreakdash-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA's performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available.

Related papers

Building and Aligning Comparable Corpora [0.0]
Comparable corpus is a set of topic aligned documents in multiple languages.<n>We present a method to build comparable corpora from Wikipedia encyclopedia and EURONEWS website in English, French and Arabic languages.<n>We also experiment a method to automatically align comparable documents using cross-lingual similarity measures.
arXiv Detail & Related papers (2025-08-04T16:05:36Z)
The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora [6.594531626178451]
Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages.<n>We study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets.<n>We propose a simple retrieval strategy that addresses this source of failure by enforcing equal retrieval from both languages.
arXiv Detail & Related papers (2025-07-10T08:38:31Z)
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents [2.0277446818410994]
This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search. The dataset is built using bilingual article metadata from 'Erudit, a Canadian publishing platform.
arXiv Detail & Related papers (2025-04-22T20:55:08Z)
Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness [30.00463676754559]
We introduce BordIRLines, a benchmark consisting of 720 territorial dispute queries paired with 14k Wikipedia documents across 49 languages. Our experiments reveal that retrieving multilingual documents best improves response consistency and decreases geopolitical bias over using purely in-language documents. Our further experiments and case studies investigate how cross-lingual RAG is affected by aspects from IR to document contents.
arXiv Detail & Related papers (2024-10-02T01:59:07Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, i.e., be crosslingual? This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries. Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z)
Do We Need Language-Specific Fact-Checking Models? The Case of Chinese [15.619421104102516]
This paper investigates the potential benefits of language-specific fact-checking models, focusing on the case of Chinese. We first demonstrate the limitations of translation-based methods and multilingual large language models, highlighting the need for language-specific systems. We propose a Chinese fact-checking system that can better retrieve evidence from a document by incorporating context information.
arXiv Detail & Related papers (2024-01-27T20:26:03Z)
Lost in Translation, Found in Spans: Identifying Claims in Multilingual Social Media [40.26888469822391]
Claim span identification (CSI) is an important step in fact-checking pipelines. Despite its importance to journalists and human fact-checkers, it remains a severely understudied problem. We create a novel dataset, X-CLAIM, consisting of 7K real-world claims collected from numerous social media platforms in five Indian languages and English.
arXiv Detail & Related papers (2023-10-27T15:28:12Z)
Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding. We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart. Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z)
Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language. To collect large-scale CLS data, existing datasets typically involve translation in their creation. In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z)
CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages. We present the first fact-checking framework augmented with crosslingual retrieval. We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z)
On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks. We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments. We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z)
A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text. We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z)
Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval [51.60862829942932]
We present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks. For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved. However, the peak performance is not met using the general-purpose multilingual text encoders off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
arXiv Detail & Related papers (2021-01-21T00:15:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.