Related papers: OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities

OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities

URL: http://arxiv.org/abs/2410.12219v1
Date: Wed, 16 Oct 2024 04:29:46 GMT
Title: OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities
Authors: Lichang Chen, Hexiang Hu, Mingda Zhang, Yiwen Chen, Zifeng Wang, Yandong Li, Pranav Shyam, Tianyi Zhou, Heng Huang, Ming-Hsuan Yang, Boqing Gong,
Abstract summary: We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models. evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges. Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer.
Score: 124.05360767047539
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models, such as GPT-4o and Gemini. Evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges. Particularly, the user message might often consist of multiple modalities, such that OLMs have to establish holistic understanding and reasoning across modalities to accomplish the task. Existing benchmarks are limited to single modality or dual-modality tasks, overlooking comprehensive multi-modal assessments of model reasoning. To address this, OmnixR offers two evaluation variants: (1)synthetic subset: a synthetic dataset generated automatically by translating text into multiple modalities--audio, images, video, and hybrids (Omnify). (2)realistic subset: a real-world dataset, manually curated and annotated by experts, for evaluating cross-modal reasoning in natural settings. OmnixR presents a unique evaluation towards assessing OLMs over a diverse mix of modalities, such as a question that involves video, audio, and text, providing a rigorous cross-modal reasoning testbed unlike any existing benchmarks. Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer. Further analysis highlights differences in reasoning behavior, underscoring the challenges of omni-modal AI alignment.

Related papers

OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs [6.365802395342737]
We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models. We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning.
arXiv Detail & Related papers (2025-03-27T13:12:49Z)
Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data. We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities? Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios? Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding. Existing solutions often rely on task-specific architectures and objectives for individual tasks. In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z)
Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention [45.31956918333587]
In multimodal sentiment analysis, collecting text data is often more challenging than video or audio. We have developed a robust model that integrates multimodal sentiment information, even in the absence of text modality.
arXiv Detail & Related papers (2024-10-19T07:59:41Z)
OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. Our main findings reveal that most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts. To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)
BlendX: Complex Multi-Intent Detection with Blended Patterns [4.852816974803059]
We present BlendX, a suite of refined datasets featuring more diverse patterns than their predecessors. For dataset construction, we utilize both rule-baseds and a generative tool -- OpenAI's ChatGPT -- which is augmented with a similarity-driven strategy for utterance selection. Experiments on BlendX reveal that state-of-the-art MID models struggle with the challenges posed by the new datasets.
arXiv Detail & Related papers (2024-03-27T06:13:04Z)
AiGen-FoodReview: A Multimodal Dataset of Machine-Generated Restaurant Reviews and Images on Social Media [57.70351255180495]
AiGen-FoodReview is a dataset of 20,144 restaurant review-image pairs divided into authentic and machine-generated. We explore unimodal and multimodal detection models, achieving 99.80% multimodal accuracy with FLAVA. The paper contributes by open-sourcing the dataset and releasing fake review detectors, recommending its use in unimodal and multimodal fake review detection tasks, and evaluating linguistic and visual features in synthetic versus authentic data.
arXiv Detail & Related papers (2024-01-16T20:57:36Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality. We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z)
MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG) MuRAG accesses an external non-parametric multimodal memory to augment language generation. Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z)
Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model [51.42415340921237]
We propose a Dynamic Inter- and Intra-modality Attention (DIIA) model to fuse the two modalities (audio and textual) We further develop a Multimodal Knowledge Distillation (MKD) module to enable our multimodal MC model to accurately predict the answers based only on either the text or the audio.
arXiv Detail & Related papers (2021-07-04T08:35:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.