TT-BLIP: Enhancing Fake News Detection Using BLIP and Tri-Transformer
- URL: http://arxiv.org/abs/2403.12481v2
- Date: Thu, 23 Jan 2025 07:18:03 GMT
- Title: TT-BLIP: Enhancing Fake News Detection Using BLIP and Tri-Transformer
- Authors: Eunjee Choi, Jong-Kook Kim,
- Abstract summary: This paper introduces an end-to-end model called TT-BLIP that applies the bootstrapping language-image pretraining for unified vision-image understanding and generation.
The experiments are performed using two fake news datasets, Weibo and Gossipcop.
- Score: 0.276240219662896
- License:
- Abstract: Detecting fake news has received a lot of attention. Many previous methods concatenate independently encoded unimodal data, ignoring the benefits of integrated multimodal information. Also, the absence of specialized feature extraction for text and images further limits these methods. This paper introduces an end-to-end model called TT-BLIP that applies the bootstrapping language-image pretraining for unified vision-language understanding and generation (BLIP) for three types of information: BERT and BLIPTxt for text, ResNet and BLIPImg for images, and bidirectional BLIP encoders for multimodal information. The Multimodal Tri-Transformer fuses tri-modal features using three types of multi-head attention mechanisms, ensuring integrated modalities for enhanced representations and improved multimodal data analysis. The experiments are performed using two fake news datasets, Weibo and Gossipcop. The results indicate TT-BLIP outperforms the state-of-the-art models.
Related papers
- CroMe: Multimodal Fake News Detection using Cross-Modal Tri-Transformer and Metric Learning [0.22499166814992438]
Multimodal Fake News Detection has received increasing attention recently.
Existing methods rely on independently encoded unimodal data.
To address these issues, Cross-Modal Tri-Transformer and Metric Learning for Multimodal Fake News Detection (CroMe) is proposed.
arXiv Detail & Related papers (2025-01-21T09:36:27Z) - Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation [57.84148140637513]
Multi-Prompts Sinkhorn Attention (MPSA) effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings.
OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic (ZS3) tasks.
arXiv Detail & Related papers (2024-03-21T07:15:37Z) - PV2TEA: Patching Visual Modality to Textual-Established Information
Extraction [59.76117533540496]
We patch the visual modality to the textual-established attribute information extractor.
PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes.
Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
arXiv Detail & Related papers (2023-06-01T05:39:45Z) - Multimodal Fake News Detection via CLIP-Guided Learning [26.093561485807832]
This paper proposes a FND-CLIP framework, i.e., a multimodal Fake News Detection network based on Contrastive Language-Image Pretraining (CLIP)
Given a targeted multimodal news, we extract the deep representations from the image and text using a ResNet-based encoder, a BERT-based encoder and two pair-wise CLIP encoders.
The multimodal feature is a concatenation of the CLIP-generated features weighted by the standardized cross-modal similarity of the two modalities.
arXiv Detail & Related papers (2022-05-28T02:43:18Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - LMR-CBT: Learning Modality-fused Representations with CB-Transformer for
Multimodal Emotion Recognition from Unaligned Multimodal Sequences [5.570499497432848]
We propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition.
We conduct word-aligned and unaligned experiments on three challenging datasets.
arXiv Detail & Related papers (2021-12-03T03:43:18Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and
Generation [52.037766778458504]
We propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation.
OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate token-based embeddings for each modality.
OPT can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.
arXiv Detail & Related papers (2021-07-01T06:59:44Z) - MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical
Attention [5.584060970507506]
This paper presents MAST, a new model for Multimodal Abstractive Text Summarization.
We examine the usefulness and challenges of deriving information from the audio modality.
We present a sequence-to-sequence trimodal hierarchical attention-based model that overcomes these challenges.
arXiv Detail & Related papers (2020-10-15T21:08:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.