Multimodal Short Video Rumor Detection System Based on Contrastive
Learning
- URL: http://arxiv.org/abs/2304.08401v3
- Date: Wed, 17 May 2023 13:12:27 GMT
- Title: Multimodal Short Video Rumor Detection System Based on Contrastive
Learning
- Authors: Yuxing Yang, Junhao Zhao, Siyi Wang, Xiangyu Min, Pengchao Wang and
Haizhou Wang
- Abstract summary: Short video platforms in China have gradually evolved into fertile grounds for the proliferation of fake news.
distinguishing short video rumors poses a significant challenge due to the substantial amount of information and shared features.
Our research group proposes a methodology encompassing multimodal feature fusion and the integration of external knowledge.
- Score: 3.4192832062683842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rise of short video platforms as prominent channels for news
dissemination, major platforms in China have gradually evolved into fertile
grounds for the proliferation of fake news. However, distinguishing short video
rumors poses a significant challenge due to the substantial amount of
information and shared features among videos, resulting in homogeneity. To
address the dissemination of short video rumors effectively, our research group
proposes a methodology encompassing multimodal feature fusion and the
integration of external knowledge, considering the merits and drawbacks of each
algorithm. The proposed detection approach entails the following steps: (1)
creation of a comprehensive dataset comprising multiple features extracted from
short videos; (2) development of a multimodal rumor detection model: first, we
employ the Temporal Segment Networks (TSN) video coding model to extract video
features, followed by the utilization of Optical Character Recognition (OCR)
and Automatic Speech Recognition (ASR) to extract textual features.
Subsequently, the BERT model is employed to fuse textual and video features;
(3) distinction is achieved through contrast learning: we acquire external
knowledge by crawling relevant sources and leverage a vector database to
incorporate this knowledge into the classification output. Our research process
is driven by practical considerations, and the knowledge derived from this
study will hold significant value in practical scenarios, such as short video
rumor identification and the management of social opinions.
Related papers
- VMID: A Multimodal Fusion LLM Framework for Detecting and Identifying Misinformation of Short Videos [14.551693267228345]
This paper presents a novel fake news detection method based on multimodal information, designed to identify misinformation through a multi-level analysis of video content.
The proposed framework successfully integrates multimodal features within videos, significantly enhancing the accuracy and reliability of fake news detection.
arXiv Detail & Related papers (2024-11-15T08:20:26Z) - Video Summarization Techniques: A Comprehensive Review [1.6381055567716192]
The paper explores the various approaches and methods created for video summarizing, emphasizing both abstractive and extractive strategies.
The process of extractive summarization involves the identification of key frames or segments from the source video, utilizing methods such as shot boundary recognition, and clustering.
On the other hand, abstractive summarization creates new content by getting the essential content from the video, using machine learning models like deep neural networks and natural language processing, reinforcement learning, attention mechanisms, generative adversarial networks, and multi-modal learning.
arXiv Detail & Related papers (2024-10-06T11:17:54Z) - Language as the Medium: Multimodal Video Classification through text
only [3.744589644319257]
We propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information.
Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2.
Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks.
arXiv Detail & Related papers (2023-09-19T17:32:21Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - You Need to Read Again: Multi-granularity Perception Network for Moment
Retrieval in Videos [19.711703590063976]
We propose a novel Multi-Granularity Perception Network (MGPN) that perceives intra-modality and inter-modality information at a multi-granularity level.
Specifically, we formulate moment retrieval as a multi-choice reading comprehension task and integrate human reading strategies into our framework.
arXiv Detail & Related papers (2022-05-25T16:15:46Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
We learn a compositional embedding that closes the cross-modal semantic gap.
We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
arXiv Detail & Related papers (2021-04-22T09:31:20Z) - A Comprehensive Study of Deep Video Action Recognition [35.7068977497202]
Video action recognition is one of the representative tasks for video understanding.
We provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition.
arXiv Detail & Related papers (2020-12-11T18:54:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.