Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding
- URL: http://arxiv.org/abs/2505.23990v2
- Date: Sat, 14 Jun 2025 20:12:01 GMT
- Title: Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding
- Authors: Mingyang Mao, Mariela M. Perez-Cabarcas, Utteja Kallakuri, Nicholas R. Waytowich, Xiaomin Lin, Tinoosh Mohsenin,
- Abstract summary: Multi-RAG is a retrieval-augmented generation system designed to provide adaptive assistance to humans in information-intensive circumstances.<n>Our system aims to improve situational understanding and reduce cognitive load by integrating and reasoning over multi-source information streams.
- Score: 2.3390724500399838
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To effectively engage in human society, the ability to adapt, filter information, and make informed decisions in ever-changing situations is critical. As robots and intelligent agents become more integrated into human life, there is a growing opportunity-and need-to offload the cognitive burden on humans to these systems, particularly in dynamic, information-rich scenarios. To fill this critical need, we present Multi-RAG, a multimodal retrieval-augmented generation system designed to provide adaptive assistance to humans in information-intensive circumstances. Our system aims to improve situational understanding and reduce cognitive load by integrating and reasoning over multi-source information streams, including video, audio, and text. As an enabling step toward long-term human-robot partnerships, Multi-RAG explores how multimodal information understanding can serve as a foundation for adaptive robotic assistance in dynamic, human-centered situations. To evaluate its capability in a realistic human-assistance proxy task, we benchmarked Multi-RAG on the MMBench-Video dataset, a challenging multimodal video understanding benchmark. Our system achieves superior performance compared to existing open-source video large language models (Video-LLMs) and large vision-language models (LVLMs), while utilizing fewer resources and less input data. The results demonstrate Multi- RAG's potential as a practical and efficient foundation for future human-robot adaptive assistance systems in dynamic, real-world contexts.
Related papers
- Enhancing Explainability with Multimodal Context Representations for Smarter Robots [0.0]
Key issue in Human-Robot Interaction is enabling robots to effectively perceive and reason over multimodal inputs, such as audio and vision.<n>We propose a generalized and explainable multimodal framework for context representation, designed to improve the fusion of speech and vision modalities.
arXiv Detail & Related papers (2025-02-28T13:36:47Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Selective Exploration and Information Gathering in Search and Rescue Using Hierarchical Learning Guided by Natural Language Input [5.522800137785975]
We introduce a system that integrates social interaction via large language models (LLMs) with a hierarchical reinforcement learning (HRL) framework.
The proposed system is designed to translate verbal inputs from human stakeholders into actionable RL insights and adjust its search strategy.
By leveraging human-provided information through LLMs and structuring task execution through HRL, our approach significantly improves the agent's learning efficiency and decision-making process in environments characterised by long horizons and sparse rewards.
arXiv Detail & Related papers (2024-09-20T12:27:47Z) - Multidimensional Human Activity Recognition With Large Language Model: A Conceptual Framework [0.0]
In high-stake environments like emergency response or elder care, the integration of large language model (LLM) revolutionizes risk assessment, resource allocation, and emergency responses.
We propose a conceptual framework that utilizes various wearable devices, each considered as a single dimension, to support a multidimensional learning approach within Human Activity Recognition (HAR) systems.
arXiv Detail & Related papers (2024-09-16T21:36:23Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models [50.259006481656094]
We present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models.
Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer.
We present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
arXiv Detail & Related papers (2024-04-03T23:57:34Z) - SoMeLVLM: A Large Vision Language Model for Social Media Processing [78.47310657638567]
We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM)
SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation.
Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
arXiv Detail & Related papers (2024-02-20T14:02:45Z) - An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents.
Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction.
We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z) - Agent AI: Surveying the Horizons of Multimodal Interaction [83.18367129924997]
"Agent AI" is a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data.
We envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
arXiv Detail & Related papers (2024-01-07T19:11:18Z) - MISAR: A Multimodal Instructional System with Augmented Reality [38.79160527414268]
Augmented reality (AR) requires seamless integration of visual, auditory, and linguistic channels for optimized human-computer interaction.
Our study introduces an innovative method harnessing large language models (LLMs) to assimilate information from visual, auditory, and contextual modalities.
arXiv Detail & Related papers (2023-10-18T04:15:12Z) - Chat with the Environment: Interactive Multimodal Perception Using Large
Language Models [19.623070762485494]
Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning.
Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
arXiv Detail & Related papers (2023-03-14T23:01:27Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.