Crafting Tomorrow's Headlines: Neural News Generation and Detection in English, Turkish, Hungarian, and Persian
- URL: http://arxiv.org/abs/2408.10724v3
- Date: Mon, 4 Nov 2024 12:42:18 GMT
- Title: Crafting Tomorrow's Headlines: Neural News Generation and Detection in English, Turkish, Hungarian, and Persian
- Authors: Cem Üyük, Danica Rovó, Shaghayegh Kolli, Rabia Varol, Georg Groh, Daryna Dementieva,
- Abstract summary: We introduce a benchmark dataset designed for neural news detection in four languages: English, Turkish, Hungarian, and Persian.
The dataset incorporates outputs from multiple multilingual generators (in both, zero-shot and fine-tuned setups) such as BloomZ, LLaMa-2, Mistral, Mixtral, and GPT-4.
We present the detection results aiming to delve into the interpretablity and robustness of machine-generated texts detectors across all target languages.
- Score: 9.267227655791443
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In the era dominated by information overload and its facilitation with Large Language Models (LLMs), the prevalence of misinformation poses a significant threat to public discourse and societal well-being. A critical concern at present involves the identification of machine-generated news. In this work, we take a significant step by introducing a benchmark dataset designed for neural news detection in four languages: English, Turkish, Hungarian, and Persian. The dataset incorporates outputs from multiple multilingual generators (in both, zero-shot and fine-tuned setups) such as BloomZ, LLaMa-2, Mistral, Mixtral, and GPT-4. Next, we experiment with a variety of classifiers, ranging from those based on linguistic features to advanced Transformer-based models and LLMs prompting. We present the detection results aiming to delve into the interpretablity and robustness of machine-generated texts detectors across all target languages.
Related papers
- Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.
Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.
We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z) - Lost in Translation, Found in Spans: Identifying Claims in Multilingual
Social Media [40.26888469822391]
Claim span identification (CSI) is an important step in fact-checking pipelines.
Despite its importance to journalists and human fact-checkers, it remains a severely understudied problem.
We create a novel dataset, X-CLAIM, consisting of 7K real-world claims collected from numerous social media platforms in five Indian languages and English.
arXiv Detail & Related papers (2023-10-27T15:28:12Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages [11.581072296148031]
We conduct an extensive study using popular Transformer-based text generation models on our extended multi-lingual dataset.
Our experiments show that a multi-lingual mT5 model which uses fact-aware embeddings with structure-aware input encoding leads to best results on average across the twelve languages.
arXiv Detail & Related papers (2022-09-22T18:01:27Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Transferring Knowledge Distillation for Multilingual Social Event
Detection [42.663309895263666]
Recently published graph neural networks (GNNs) show promising performance at social event detection tasks.
We present a GNN that incorporates cross-lingual word embeddings for detecting events in multilingual data streams.
Experiments on both synthetic and real-world datasets show the framework to be highly effective at detection in both multilingual data and in languages where training samples are scarce.
arXiv Detail & Related papers (2021-08-06T12:38:42Z) - Multilingual Code-Switching for Zero-Shot Cross-Lingual Intent
Prediction and Slot Filling [29.17194639368877]
We propose a novel method to augment the monolingual source data using multilingual code-switching via random translations.
Experiments on the benchmark dataset of MultiATIS++ yielded an average improvement of +4.2% in accuracy for intent task and +1.8% in F1 for slot task.
We present an application of our method for crisis informatics using a new human-annotated tweet dataset of slot filling in English and Haitian Creole.
arXiv Detail & Related papers (2021-03-13T21:05:09Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Data and Representation for Turkish Natural Language Inference [6.135815931215188]
We offer a positive response for natural language inference (NLI) in Turkish.
We translate two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels.
We find that in-language embeddings are essential and that morphological parsing can be avoided where the training set is large.
arXiv Detail & Related papers (2020-04-30T17:12:52Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.