Detecting Cross-Language Plagiarism using Open Knowledge Graphs
- URL: http://arxiv.org/abs/2111.09749v1
- Date: Thu, 18 Nov 2021 15:23:27 GMT
- Title: Detecting Cross-Language Plagiarism using Open Knowledge Graphs
- Authors: Johannes Stegm\"uller, Fabian Bauer-Marquart, Norman Meuschke, Terry
Ruas, Moritz Schubotz, Bela Gipp
- Abstract summary: We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis.
CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata.
It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections.
- Score: 7.378348990383349
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Identifying cross-language plagiarism is challenging, especially for distant
language pairs and sense-for-sense translations. We introduce the new
multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis
(CL\nobreakdash-OSA) for this task. CL-OSA represents documents as entity
vectors obtained from the open knowledge graph Wikidata. Opposed to other
methods, CL\nobreakdash-OSA does not require computationally expensive machine
translation, nor pre-training using comparable or parallel corpora. It reliably
disambiguates homonyms and scales to allow its application to Web-scale
document collections. We show that CL-OSA outperforms state-of-the-art methods
for retrieving candidate documents from five large, topically diverse test
corpora that include distant language pairs like Japanese-English. For
identifying cross-language plagiarism at the character level, CL-OSA primarily
improves the detection of sense-for-sense translations. For these challenging
cases, CL-OSA's performance in terms of the well-established PlagDet score
exceeds that of the best competitor by more than factor two. The code and data
of our study are openly available.
Related papers
- LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - Do We Need Language-Specific Fact-Checking Models? The Case of Chinese [15.619421104102516]
This paper investigates the potential benefits of language-specific fact-checking models, focusing on the case of Chinese.
We first demonstrate the limitations of translation-based methods and multilingual large language models, highlighting the need for language-specific systems.
We propose a Chinese fact-checking system that can better retrieve evidence from a document by incorporating context information.
arXiv Detail & Related papers (2024-01-27T20:26:03Z) - Lost in Translation, Found in Spans: Identifying Claims in Multilingual
Social Media [40.26888469822391]
Claim span identification (CSI) is an important step in fact-checking pipelines.
Despite its importance to journalists and human fact-checkers, it remains a severely understudied problem.
We create a novel dataset, X-CLAIM, consisting of 7K real-world claims collected from numerous social media platforms in five Indian languages and English.
arXiv Detail & Related papers (2023-10-27T15:28:12Z) - Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding.
We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.
Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual
Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages.
We present the first fact-checking framework augmented with crosslingual retrieval.
We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual
Retrieval [51.60862829942932]
We present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved.
However, the peak performance is not met using the general-purpose multilingual text encoders off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
arXiv Detail & Related papers (2021-01-21T00:15:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.