TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection
- URL: http://arxiv.org/abs/2503.24115v3
- Date: Wed, 02 Apr 2025 13:32:22 GMT
- Title: TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection
- Authors: Zhiming Ma, Peidong Wang, Minhua Huang, Jingpeng Wang, Kai Wu, Xiangzhao Lv, Yachun Pang, Yin Yang, Wenjie Tang, Yuchen Kang,
- Abstract summary: We present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis.<n>Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio); (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage;.<n>The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning.
- Score: 16.3518216479454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.
Related papers
- Speculative End-Turn Detector for Efficient Speech Chatbot Assistant [11.136112399898481]
We introduce the ETD dataset, the first public dataset for end-turn detection.<n>We also propose SpeculativeETD, a novel collaborative inference framework that balances efficiency and accuracy to improve real-time ETD in resource-constrained environments.<n>Experiments demonstrate that the proposed SpeculativeETD significantly improves ETD accuracy while keeping the required computations low.
arXiv Detail & Related papers (2025-03-30T13:34:23Z) - CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking [85.68235482145091]
Large-scale speech datasets have become valuable intellectual property.
We propose a novel dataset ownership verification method.
Our approach introduces a clustering-based backdoor watermark (CBW)
We conduct extensive experiments on benchmark datasets, verifying the effectiveness and robustness of our method against potential adaptive attacks.
arXiv Detail & Related papers (2025-03-02T02:02:57Z) - Semantic Consistency Regularization with Large Language Models for Semi-supervised Sentiment Analysis [20.503153899462323]
We propose a framework for semi-supervised sentiment analysis.
We introduce two prompting strategies to semantically enhance unlabeled text.
Experiments show our method achieves remarkable performance over prior semi-supervised methods.
arXiv Detail & Related papers (2025-01-29T12:03:11Z) - Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention [45.31956918333587]
In multimodal sentiment analysis, collecting text data is often more challenging than video or audio.<n>We have developed a robust model that integrates multimodal sentiment information, even in the absence of text modality.
arXiv Detail & Related papers (2024-10-19T07:59:41Z) - A Federated Learning Approach for Multi-stage Threat Analysis in Advanced Persistent Threat Campaigns [25.97800399318373]
Multi-stage threats like advanced persistent threats (APT) pose severe risks by stealing data and destroying infrastructure.
APTs use novel attack vectors and evade signature-based detection by obfuscating their network presence.
This paper proposes a novel 3-phase unsupervised federated learning (FL) framework to detect APTs.
arXiv Detail & Related papers (2024-06-19T03:34:41Z) - Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks.
We propose to automate dataset updating and provide systematic analysis regarding its effectiveness.
There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z) - Communication-Efficient Personalized Federated Learning for Speech-to-Text Tasks [64.02867484165476]
To protect privacy and meet legal regulations, federated learning (FL) has gained significant attention for training speech-to-text (S2T) systems.<n>The commonly used FL approach (i.e., textscFedAvg) in S2T tasks typically suffers from extensive communication overhead.<n>We propose a personalized federated S2T framework that introduces textscFedLoRA, a lightweight LoRA module for client-side tuning and interaction with the server, and textscFedMem, a global model equipped with a $k$-near
arXiv Detail & Related papers (2024-01-18T15:39:38Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Wiki-En-ASR-Adapt: Large-scale synthetic dataset for English ASR
Customization [66.22007368434633]
We present a first large-scale public synthetic dataset for contextual spellchecking customization of automatic speech recognition (ASR)
The proposed approach allows creating millions of realistic examples of corrupted ASR hypotheses and simulate non-trivial biasing lists for the customization task.
We report experiments with training an open-source customization model on the proposed dataset and show that the injection of hard negative biasing phrases decreases WER and the number of false alarms.
arXiv Detail & Related papers (2023-09-29T14:18:59Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Fraud Dataset Benchmark and Applications [25.184342958800293]
Fraud dataset Benchmark (FDB) is a compilation of publicly available datasets catered to fraud detection.
FDB comprises variety of fraud related tasks, ranging from identifying fraudulent card-not-present transactions, detecting bot attacks, classifying malicious URLs, estimating risk of loan default to content moderation.
Python based library for FDB provides a consistent API for data loading with standardized training and testing splits.
arXiv Detail & Related papers (2022-08-30T17:35:39Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.