From Perceptions To Evidence: Detecting AI-Generated Content In Turkish News Media With A Fine-Tuned Bert Classifier
- URL: http://arxiv.org/abs/2602.13504v1
- Date: Fri, 13 Feb 2026 22:29:00 GMT
- Title: From Perceptions To Evidence: Detecting AI-Generated Content In Turkish News Media With A Fine-Tuned Bert Classifier
- Authors: Ozancan Ozdemir,
- Abstract summary: This study fine-tuning a Turkish-specific BERT model on a labeled dataset of 3,600 articles from three major Turkish outlets.<n>It reveals consistent cross-source and temporally stable classification patterns, with mean prediction confidence exceeding 0.96.<n>It is the first study to move beyond self-reported journalist perceptions toward empirical, data-driven measurement of AI usage in Turkish news media.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The rapid integration of large language models into newsroom workflows has raised urgent questions about the prevalence of AI-generated content in online media. While computational studies have begun to quantify this phenomenon in English-language outlets, no empirical investigation exists for Turkish news media, where existing research remains limited to qualitative interviews with journalists or fake news detection. This study addresses that gap by fine-tuning a Turkish-specific BERT model (dbmdz/bert-base-turkish-cased) on a labeled dataset of 3,600 articles from three major Turkish outlets with distinct editorial orientations for binary classification of AI-rewritten content. The model achieves 0.9708 F1 score on the held-out test set with symmetric precision and recall across both classes. Subsequent deployment on over 3,500 unseen articles spanning between 2023 and 2026 reveals consistent cross-source and temporally stable classification patterns, with mean prediction confidence exceeding 0.96 and an estimated 2.5 percentage of examined news content rewritten or revised by LLMs on average. To the best of our knowledge, this is the first study to move beyond self-reported journalist perceptions toward empirical, data-driven measurement of AI usage in Turkish news media.
Related papers
- A Unified BERT-CNN-BiLSTM Framework for Simultaneous Headline Classification and Sentiment Analysis of Bangla News [1.8737506366172099]
This research presents a state-of-the-art approach to Bangla news headline classification combined with sentiment analysis.<n>We have explored a dataset called BAN-ABSA of 9014 news headlines, which is the first time that has been experimented with simultaneously in the headline and sentiment categorization.<n>The proposed model BERT-CNN-BiLSTM significantly outperforms all baseline models in classification tasks.
arXiv Detail & Related papers (2025-11-23T21:22:56Z) - CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English [53.32175252285023]
Cross-lingual news comparison offers a promising approach to verify information.<n>Existing datasets for cross-lingual news analysis were manually curated by journalists and experts.<n>We introduce a scalable, explainable crowdsourcing pipeline for cross-lingual news similarity assessment.
arXiv Detail & Related papers (2025-10-22T14:23:50Z) - Cross-Lingual Learning vs. Low-Resource Fine-Tuning: A Case Study with Fact-Checking in Turkish [0.9217021281095907]
We introduce the FCTR dataset, consisting of 3238 real-world claims.
This dataset spans multiple domains and incorporates evidence collected from three Turkish fact-checking organizations.
arXiv Detail & Related papers (2024-03-01T09:57:46Z) - Tackling Fake News in Bengali: Unraveling the Impact of Summarization vs. Augmentation on Pre-trained Language Models [0.07696728525672149]
We propose a methodology consisting of four distinct approaches to classify fake news articles in Bengali.<n>Our approach includes translating English news articles and using augmentation techniques to curb the deficit of fake news articles.<n>We show the effectiveness of summarization and augmentation in the case of Bengali fake news detection.
arXiv Detail & Related papers (2023-07-13T14:50:55Z) - Multiverse: Multilingual Evidence for Fake News Detection [71.51905606492376]
Multiverse is a new feature based on multilingual evidence that can be used for fake news detection.
The hypothesis of the usage of cross-lingual evidence as a feature for fake news detection is confirmed.
arXiv Detail & Related papers (2022-11-25T18:24:17Z) - Overview of the Shared Task on Fake News Detection in Urdu at FIRE 2021 [55.41644538483948]
The goal of the shared task is to motivate the community to come up with efficient methods for solving this vital problem.
The training set contains 1300 annotated news articles -- 750 real news, 550 fake news, while the testing set contains 300 news articles -- 200 real, 100 fake news.
The best performing system obtained an F1-macro score of 0.679, which is lower than the past year's best result of 0.907 F1-macro.
arXiv Detail & Related papers (2022-07-11T18:58:36Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - NewsEdits: A News Article Revision Dataset and a Document-Level
Reasoning Challenge [122.37011526554403]
NewsEdits is the first publicly available dataset of news revision histories.
It contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2022-06-14T18:47:13Z) - Siamese BERT-based Model for Web Search Relevance Ranking Evaluated on a
New Czech Dataset [0.0]
We present our real-time approach to the document ranking problem leveraging a BERT-based siamese architecture.
We release DaReCzech, a unique data set of 1.6 million Czech user query-document pairs with manually assigned relevance levels.
We also release Small-E-Czech, an Electra-small language model pre-trained on a large Czech corpus.
arXiv Detail & Related papers (2021-12-03T09:45:18Z) - A Heuristic-driven Uncertainty based Ensemble Framework for Fake News
Detection in Tweets and News Articles [5.979726271522835]
We describe a novel Fake News Detection system that automatically identifies whether a news item is "real" or "fake"
We have used an ensemble model consisting of pre-trained models followed by a statistical feature fusion network.
Our proposed framework have also quantified reliable predictive uncertainty along with proper class output confidence level for the classification task.
arXiv Detail & Related papers (2021-04-05T06:35:30Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.