Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks
- URL: http://arxiv.org/abs/2506.03174v1
- Date: Thu, 29 May 2025 01:47:43 GMT
- Title: Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks
- Authors: Koki Matsuishi, Kosuke Ukita, Tsuyoshi Okita,
- Abstract summary: We propose a foundational model integrating four modalities: third-person video, motion capture, IMU, and text.<n>By incorporating third-person video and motion capture data, the model enables a detailed and multidimensional understanding of human activity.
- Score: 3.1976901430982063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the widespread adoption of wearable devices has highlighted the growing importance of behavior analysis using IMU. While applications span diverse fields such as healthcare and robotics, recent studies have increasingly focused on multimodal analysis, in addition to unimodal analysis. Several studies have proposed multimodal foundation models that incorporate first-person video and text data; however, these models still fall short in providing a detailed analysis of full-body human activity. To address this limitation, we propose Activity Understanding and Representations Alignment - Multimodal Foundation Model (AURA-MFM), a foundational model integrating four modalities: third-person video, motion capture, IMU, and text. By incorporating third-person video and motion capture data, the model enables a detailed and multidimensional understanding of human activity, which first-person perspectives alone fail to capture. Additionally, a Transformer-based IMU encoder is employed to enhance the model's overall performance. Experimental evaluations on retrieval and activity recognition tasks demonstrate that our model surpasses existing methods. Notably, in the zero-shot classification for action recognition, our method achieved significantly higher performance, with an F1-score of 0.6226 and an accuracy of 0.7320, whereas the existing method recorded an F1-score of 0.0747 and an accuracy of 0.1961.
Related papers
- Human-Centric Evaluation for Foundation Models [31.400215906308546]
We propose a Human-Centric subjective Evaluation framework, focusing on three core dimensions: problem-solving ability, information quality, and interaction experience.<n>We conduct over 540 participant-driven evaluations, where humans and models collaborate on open-ended research tasks.<n>Our findings highlight Grok 3's superior performance, followed by Deepseek R1 and Gemini 2.5, with OpenAI o3 mini lagging behind.
arXiv Detail & Related papers (2025-06-02T15:33:29Z) - Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z) - How Good Are We? Evaluating Cell AI Foundation Models in Kidney Pathology with Human-in-the-Loop Enrichment [11.60167559546617]
Training AI foundation models have emerged as a promising large-scale learning approach for addressing real-world healthcare challenges.
While many of these models have been developed for tasks like disease diagnosis and tissue quantification, their readiness for deployment on some arguably simplest tasks, such as nuclei segmentation within a single organ, remains uncertain.
This paper seeks to answer this key question, "How good are we?" by thoroughly evaluating the performance of recent cell foundation models on a curated dataset.
arXiv Detail & Related papers (2024-10-31T17:00:33Z) - Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models [1.9890559505377343]
Current vision-language multimodal models are well-adapted for general visual understanding tasks.<n>We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes.<n>We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate it on the benchmark, achieving significant improvements.
arXiv Detail & Related papers (2024-09-14T05:07:57Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance [68.18779562801762]
multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance.
Our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
arXiv Detail & Related papers (2024-04-04T17:58:02Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Robustness Analysis on Foundational Segmentation Models [28.01242494123917]
In this work, we perform a robustness analysis of Visual Foundation Models (VFMs) for segmentation tasks.
We benchmark seven state-of-the-art segmentation architectures using 2 different datasets.
Our findings reveal several key insights: VFMs exhibit vulnerabilities to compression-induced corruptions, despite not outpacing all of unimodal models in robustness, multimodal models show competitive resilience in zero-shot scenarios, and VFMs demonstrate enhanced robustness for certain object categories.
arXiv Detail & Related papers (2023-06-15T16:59:42Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - A Multi-modal and Multi-task Learning Method for Action Unit and
Expression Recognition [18.478011167414223]
We introduce a multi-modal and multi-task learning method by using both visual and audio information.
We achieve an AU score of 0.712 and an expression score of 0.477 on the validation set.
arXiv Detail & Related papers (2021-07-09T03:28:17Z) - Modeling Shared Responses in Neuroimaging Studies through MultiView ICA [94.31804763196116]
Group studies involving large cohorts of subjects are important to draw general conclusions about brain functional organization.
We propose a novel MultiView Independent Component Analysis model for group studies, where data from each subject are modeled as a linear combination of shared independent sources plus noise.
We demonstrate the usefulness of our approach first on fMRI data, where our model demonstrates improved sensitivity in identifying common sources among subjects.
arXiv Detail & Related papers (2020-06-11T17:29:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.