Related papers: On-device Large Multi-modal Agent for Human Activity Recognition

On-device Large Multi-modal Agent for Human Activity Recognition

URL: http://arxiv.org/abs/2512.19742v1
Date: Wed, 17 Dec 2025 22:05:05 GMT
Title: On-device Large Multi-modal Agent for Human Activity Recognition
Authors: Md Shakhrul Iman Siam, Ishtiaque Ahmed Showmik, Guanqun Song, Ting Zhu,
Abstract summary: Human Activity Recognition (HAR) has been an active area of research, with applications ranging from healthcare to smart environments.<n>Recent advancements in Large Language Models (LLMs) have opened new possibilities to leverage their capabilities in HAR.<n>We present a Large Multi-Modal Agent designed for HAR, which integrates the power of LLMs to enhance both performance and user engagement.
Score: 1.9342524451932614
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human Activity Recognition (HAR) has been an active area of research, with applications ranging from healthcare to smart environments. The recent advancements in Large Language Models (LLMs) have opened new possibilities to leverage their capabilities in HAR, enabling not just activity classification but also interpretability and human-like interaction. In this paper, we present a Large Multi-Modal Agent designed for HAR, which integrates the power of LLMs to enhance both performance and user engagement. The proposed framework not only delivers activity classification but also bridges the gap between technical outputs and user-friendly insights through its reasoning and question-answering capabilities. We conduct extensive evaluations using widely adopted HAR datasets, including HHAR, Shoaib, Motionsense to assess the performance of our framework. The results demonstrate that our model achieves high classification accuracy comparable to state-of-the-art methods while significantly improving interpretability through its reasoning and Q&A capabilities.

Related papers

RAG-HAR: Retrieval Augmented Generation-based Human Activity Recognition [5.089700375729287]
We introduce RAG-HAR, a training-free retrieval-augmented framework that leverages large language models (LLMs) for Human Activity Recognition (HAR)<n>RAG-HAR computes lightweight statistical descriptors, retrieves semantically similar samples from a vector database, and uses this contextual evidence to make LLM-based activity identification.
arXiv Detail & Related papers (2025-12-06T01:53:02Z)
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z)
Towards Generalizable Human Activity Recognition: A Survey [4.08377734173712]
IMU-based Human Activity Recognition (HAR) has attracted increasing attention from both academia and industry in recent years.<n>HAR performance has improved considerably in specific scenarios, but its generalization capability remains a key barrier to widespread real-world adoption.<n>In this survey, we explore the rapidly evolving field of IMU-based generalizable HAR, reviewing 229 research papers alongside 25 publicly available datasets.
arXiv Detail & Related papers (2025-08-17T03:04:39Z)
Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO [63.140883026848286]
Active vision refers to the process of actively selecting where and how to look in order to gather task-relevant information.<n>Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention.
arXiv Detail & Related papers (2025-05-27T17:29:31Z)
A Comparative Study of Human Activity Recognition: Motion, Tactile, and multi-modal Approaches [43.97520291340696]
This study evaluates the ability of a vision-based tactile sensor to classify 15 activities.<n>We propose a multi-modal framework combining tactile and motion data to leverage their complementary strengths.
arXiv Detail & Related papers (2025-05-13T15:20:21Z)
Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.<n>Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.<n>We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z)
Are Visual-Language Models Effective in Action Recognition? A Comparative Study [22.97135293252601]
This paper provides a large-scale study and insight on state-of-the-art vision foundation models. It compares their transfer ability onto zero-shot and frame-wise action recognition tasks. Experiments are conducted on recent fine-grained, human-centric action recognition datasets.
arXiv Detail & Related papers (2024-10-22T16:28:21Z)
DARA: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs [70.54226917774933]
We propose the DecompositionAlignment-Reasoning Agent (DARA) framework. DARA effectively parses questions into formal queries through a dual mechanism. We show that DARA attains performance comparable to state-of-the-art enumerating-and-ranking-based methods for KGQA.
arXiv Detail & Related papers (2024-06-11T09:09:37Z)
Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions. We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z)
Multi-level Contrast Network for Wearables-based Joint Activity Segmentation and Recognition [10.828099015828693]
Human activity recognition (HAR) with wearables is promising research that can be widely adopted in many smart healthcare applications. Most HAR algorithms are susceptible to the multi-class windows problem that is essential yet rarely exploited. We introduce the segmentation technology into HAR, yielding joint activity segmentation and recognition.
arXiv Detail & Related papers (2022-08-16T05:39:02Z)
FIVES: Feature Interaction Via Edge Search for Large-Scale Tabular Data [106.76845921324704]
We propose a novel method named Feature Interaction Via Edge Search (FIVES) FIVES formulates the task of interactive feature generation as searching for edges on the defined feature graph. In this paper, we present our theoretical evidence that motivates us to search for useful interactive features with increasing order.
arXiv Detail & Related papers (2020-07-29T03:33:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.