Beyond a Single Perspective: Text Anomaly Detection with Multi-View Language Representations
- URL: http://arxiv.org/abs/2601.17786v1
- Date: Sun, 25 Jan 2026 10:52:59 GMT
- Title: Beyond a Single Perspective: Text Anomaly Detection with Multi-View Language Representations
- Authors: Yixin Liu, Kehan Yan, Shiyuan Li, Qingfeng Chen, Shirui Pan,
- Abstract summary: Text anomaly detection (TAD) plays a critical role in various language-driven real-world applications, including harmful content moderation, phishing detection, and spam review filtering.<n>While two-step "embedding-detector" TAD methods have shown state-of-the-art performance, their effectiveness is often limited by the use of a single embedding model and the lack of adaptability across diverse datasets and anomaly types.<n>We propose to exploit the embeddings from multiple pretrained language models and integrate them into $MCA2$, a multi-view TAD framework.
- Score: 48.7146621463489
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text anomaly detection (TAD) plays a critical role in various language-driven real-world applications, including harmful content moderation, phishing detection, and spam review filtering. While two-step "embedding-detector" TAD methods have shown state-of-the-art performance, their effectiveness is often limited by the use of a single embedding model and the lack of adaptability across diverse datasets and anomaly types. To address these limitations, we propose to exploit the embeddings from multiple pretrained language models and integrate them into $MCA^2$, a multi-view TAD framework. $MCA^2$ adopts a multi-view reconstruction model to effectively extract normal textual patterns from multiple embedding perspectives. To exploit inter-view complementarity, a contrastive collaboration module is designed to leverage and strengthen the interactions across different views. Moreover, an adaptive allocation module is developed to automatically assign the contribution weight of each view, thereby improving the adaptability to diverse datasets. Extensive experiments on 10 benchmark datasets verify the effectiveness of $MCA^2$ against strong baselines. The source code of $MCA^2$ is available at https://github.com/yankehan/MCA2.
Related papers
- MultiCaption: Detecting disinformation using multilingual visual claims [10.69065586825833]
We present MultiCaption, a dataset specifically designed for detecting contradictions in visual claims.<n>The resulting dataset comprises 11,088 visual claims in 64 languages.<n>The gains from multilingual training and testing highlight the dataset's potential for building effective multilingual fact-checking pipelines.
arXiv Detail & Related papers (2026-01-16T11:57:07Z) - A$^2$M$^2$-Net: Adaptively Aligned Multi-Scale Moment for Few-Shot Action Recognition [56.79651392604733]
A$2$M$2$-Net is able to handle the challenging temporal misalignment problem by establishing an adaptive alignment protocol for strong representation.<n>The experiments are conducted on five widely used FSAR benchmarks, and the results show our A$2$M$2$-Net achieves very competitive performance compared to state-of-the-arts.
arXiv Detail & Related papers (2025-09-22T11:44:14Z) - Cross-domain Few-shot Object Detection with Multi-modal Textual Enrichment [21.36633828492347]
Cross-Domain Multi-Modal Few-Shot Object Detection (CDMM-FSOD)<n>We introduce a meta-learning-based framework designed to leverage rich textual semantics as an auxiliary modality to achieve effective domain adaptation.<n>We evaluate the proposed method on common cross-domain object detection benchmarks and demonstrate that it significantly surpasses existing few-shot object detection approaches.
arXiv Detail & Related papers (2025-02-23T06:59:22Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Cross-domain Multi-modal Few-shot Object Detection via Rich Text [21.36633828492347]
Cross-modal feature extraction and integration have led to steady performance improvements in few-shot learning tasks.<n>We study the Cross-Domain few-shot generalization of MM-OD (CDMM-FSOD) and propose a meta-learning based multi-modal few-shot object detection method.
arXiv Detail & Related papers (2024-03-24T15:10:22Z) - Lightweight In-Context Tuning for Multimodal Unified Models [57.10831399642176]
MultiModal In-conteXt Tuning (M$2$IXT) is a lightweight module to enhance the ICL capabilities of multimodal unified models.
When tuned on as little as 50K multimodal data, M$2$IXT can boost the few-shot ICL performance significantly.
arXiv Detail & Related papers (2023-10-08T10:47:24Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG)
MuRAG accesses an external non-parametric multimodal memory to augment language generation.
Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.