LLaVAction: evaluating and training multi-modal large language models for action recognition
- URL: http://arxiv.org/abs/2503.18712v1
- Date: Mon, 24 Mar 2025 14:24:17 GMT
- Title: LLaVAction: evaluating and training multi-modal large language models for action recognition
- Authors: Shaokai Ye, Haozhe Qi, Alexander Mathis, Mackenzie W. Mathis,
- Abstract summary: We focus on evaluating and improving MLLMs to perform action recognition.<n>We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets.<n>We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions.
- Score: 46.473599879244716
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: https://github.com/AdaptiveMotorControlLab/LLaVAction.
Related papers
- EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.<n>It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity.<n>We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z) - Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
Task Preference Optimization (TPO) is a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks.<n>By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance.<n>Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models.
arXiv Detail & Related papers (2024-12-26T18:56:05Z) - Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks? [6.7065734065794835]
We introduce a novel learning paradigm termed MLLM4WTAL.
It harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors.
It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR)
arXiv Detail & Related papers (2024-11-13T09:37:24Z) - Grounding Multimodal Large Language Models in Actions [65.88208317380793]
We study how to best ground a MLLM into different embodiments and their associated action spaces.<n>For continuous actions, we show that a learned tokenization allows for sufficient modeling precision.<n>For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance.
arXiv Detail & Related papers (2024-06-12T06:12:04Z) - A Self-Correcting Vision-Language-Action Model for Fast and Slow System Manipulation [30.207690822989292]
Self-corrected (SC-)VLA framework integrates fast system for directly predicting actions and slow system for reflecting on failed actions.<n>For the fast system, we incorporate parameter-efficient fine-tuning to equip the model with pose prediction capabilities.<n>For the slow system, we propose a Chain-of-Thought training strategy for failure correction, designed to mimic human reflection after a manipulation failure.
arXiv Detail & Related papers (2024-05-27T17:58:48Z) - MLLMReID: Multimodal Large Language Model-based Person Re-identification [14.68436005777866]
Multimodal large language models (MLLM) have achieved satisfactory results in many tasks.
This paper will investigate how to adapt them for the task of ReID.
arXiv Detail & Related papers (2024-01-24T03:07:26Z) - Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected
Multi-Modal Large Models [76.99140362751787]
We present NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks.
We also present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View features.
arXiv Detail & Related papers (2024-01-02T01:54:22Z) - InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest.
This work delves into enabling LLMs to tackle more vision-language-related tasks.
InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.
This paper presents the first comprehensive MLLM Evaluation benchmark MME.
It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.