SoMeLVLM: A Large Vision Language Model for Social Media Processing
- URL: http://arxiv.org/abs/2402.13022v1
- Date: Tue, 20 Feb 2024 14:02:45 GMT
- Title: SoMeLVLM: A Large Vision Language Model for Social Media Processing
- Authors: Xinnong Zhang, Haoyu Kuang, Xinyi Mou, Hanjia Lyu, Kun Wu, Siming
Chen, Jiebo Luo, Xuanjing Huang, Zhongyu Wei
- Abstract summary: We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM)
SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation.
Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
- Score: 78.47310657638567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growth of social media, characterized by its multimodal nature, has led
to the emergence of diverse phenomena and challenges, which calls for an
effective approach to uniformly solve automated tasks. The powerful Large
Vision Language Models make it possible to handle a variety of tasks
simultaneously, but even with carefully designed prompting methods, the general
domain models often fall short in aligning with the unique speaking style and
context of social media tasks. In this paper, we introduce a Large Vision
Language Model for Social Media Processing (SoMeLVLM), which is a cognitive
framework equipped with five key capabilities including knowledge &
comprehension, application, analysis, evaluation, and creation. SoMeLVLM is
designed to understand and generate realistic social media behavior. We have
developed a 654k multimodal social media instruction-tuning dataset to support
our cognitive framework and fine-tune our model. Our experiments demonstrate
that SoMeLVLM achieves state-of-the-art performance in multiple social media
tasks. Further analysis shows its significant advantages over baselines in
terms of cognitive abilities.
Related papers
- VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks [48.67062958311173]
VL-GLUE is a multitask benchmark for natural language understanding.
We show that this benchmark is quite challenging for existing large-scale vision-language models.
arXiv Detail & Related papers (2024-10-17T15:27:17Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models [50.259006481656094]
We present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models.
Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer.
We present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
arXiv Detail & Related papers (2024-04-03T23:57:34Z) - MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms [25.73585435351771]
This paper introduces MM-Soc, a benchmark designed to evaluate Multimodal Large Language Models' understanding of social media content.
MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset.
Our analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks.
arXiv Detail & Related papers (2024-02-21T22:27:40Z) - An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents.
Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction.
We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z) - WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data.
We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.