Related papers: SoMeLVLM: A Large Vision Language Model for Social Media Processing

SoMeLVLM: A Large Vision Language Model for Social Media Processing

URL: http://arxiv.org/abs/2402.13022v1
Date: Tue, 20 Feb 2024 14:02:45 GMT
Title: SoMeLVLM: A Large Vision Language Model for Social Media Processing
Authors: Xinnong Zhang, Haoyu Kuang, Xinyi Mou, Hanjia Lyu, Kun Wu, Siming Chen, Jiebo Luo, Xuanjing Huang, Zhongyu Wei
Abstract summary: We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM) SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation. Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
Score: 78.47310657638567
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The growth of social media, characterized by its multimodal nature, has led to the emergence of diverse phenomena and challenges, which calls for an effective approach to uniformly solve automated tasks. The powerful Large Vision Language Models make it possible to handle a variety of tasks simultaneously, but even with carefully designed prompting methods, the general domain models often fall short in aligning with the unique speaking style and context of social media tasks. In this paper, we introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM), which is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation. SoMeLVLM is designed to understand and generate realistic social media behavior. We have developed a 654k multimodal social media instruction-tuning dataset to support our cognitive framework and fine-tune our model. Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks. Further analysis shows its significant advantages over baselines in terms of cognitive abilities.

Related papers

MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind [28.25540132218273]
Theory of Mind is essential for building socially intelligent multimodal agents.<n>We introduce MOMENTS, a benchmark designed to assess the ToM capabilities of multimodal large language models.
arXiv Detail & Related papers (2025-07-06T15:06:30Z)
Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv Detail & Related papers (2025-06-27T18:09:49Z)
SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models [41.68365456601248]
We introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning.<n>SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty.<n>It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms.
arXiv Detail & Related papers (2025-05-29T17:47:36Z)
Towards Online Multi-Modal Social Interaction Understanding [36.37278022436327]
We propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams. We develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting. Our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI.
arXiv Detail & Related papers (2025-03-25T17:17:19Z)
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction [105.88658935310605]
We propose a multi-stage training methodology that progressively trains LLM to understand both visual and speech information. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities.
arXiv Detail & Related papers (2025-01-03T18:59:52Z)
VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks [48.67062958311173]
VL-GLUE is a multitask benchmark for natural language understanding. We show that this benchmark is quite challenging for existing large-scale vision-language models.
arXiv Detail & Related papers (2024-10-17T15:27:17Z)
A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems. This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z)
LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models [50.259006481656094]
We present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer. We present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
arXiv Detail & Related papers (2024-04-03T23:57:34Z)
MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms [25.73585435351771]
This paper introduces MM-Soc, a benchmark designed to evaluate Multimodal Large Language Models' understanding of social media content. MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset. Our analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks.
arXiv Detail & Related papers (2024-02-21T22:27:40Z)
An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z)
WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data. We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.