Related papers: LCMF: Lightweight Cross-Modality Mambaformer for Embodied Robotics VQA

LCMF: Lightweight Cross-Modality Mambaformer for Embodied Robotics VQA

URL: http://arxiv.org/abs/2509.18576v1
Date: Tue, 23 Sep 2025 02:57:25 GMT
Title: LCMF: Lightweight Cross-Modality Mambaformer for Embodied Robotics VQA
Authors: Zeyi Kang, Liang He, Yanxin Zhang, Zuheng Ming, Kaixing Zhao,
Abstract summary: This study proposes the lightweight LCMF cascaded attention framework, introducing a multi-level cross-modal parameter sharing mechanism into the Mamba module.<n> Experimental results show that LCMF surpasses existing multimodal baselines with an accuracy of 74.29% in VQA tasks.<n>Its lightweight design achieves a 4.35-fold reduction in FLOPs relative to the average of comparable baselines.
Score: 6.035222621379327
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal semantic learning plays a critical role in embodied intelligence, especially when robots perceive their surroundings, understand human instructions, and make intelligent decisions. However, the field faces technical challenges such as effective fusion of heterogeneous data and computational efficiency in resource-constrained environments. To address these challenges, this study proposes the lightweight LCMF cascaded attention framework, introducing a multi-level cross-modal parameter sharing mechanism into the Mamba module. By integrating the advantages of Cross-Attention and Selective parameter-sharing State Space Models (SSMs), the framework achieves efficient fusion of heterogeneous modalities and semantic complementary alignment. Experimental results show that LCMF surpasses existing multimodal baselines with an accuracy of 74.29% in VQA tasks and achieves competitive mid-tier performance within the distribution cluster of Large Language Model Agents (LLM Agents) in EQA video tasks. Its lightweight design achieves a 4.35-fold reduction in FLOPs relative to the average of comparable baselines while using only 166.51M parameters (image-text) and 219M parameters (video-text), providing an efficient solution for Human-Robot Interaction (HRI) applications in resource-constrained scenarios with strong multimodal decision generalization capabilities.

Related papers

PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration [17.887453138676964]
We introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an LLM's contextual reasoning.<n>This framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting.
arXiv Detail & Related papers (2026-02-22T13:36:48Z)
Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition [51.68340973140949]
Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions.<n> MLLMs exhibit $textbfmodality bias$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts.<n>We propose Modality-aware Consistency Reasoning ($bfMCR$), which enforces structured cross-modal reasoning.
arXiv Detail & Related papers (2026-02-04T12:12:49Z)
Large Multimodal Models-Empowered Task-Oriented Autonomous Communications: Design Methodology and Implementation Challenges [31.57528074626831]
Large language models (LLMs) and large multimodal models (LMMs) have achieved unprecedented breakthrough.<n>This article focuses on task-oriented autonomous communications with LLMs/LMMs.<n>We show that the proposed LLM/LMM-aided autonomous systems significantly outperform conventional and discriminative deep learning (DL) model-based techniques.
arXiv Detail & Related papers (2025-10-23T15:08:58Z)
NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z)
DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking [18.8210909297317]
Multimodal Entity Linking aims to associate textual and visual mentions with entities in a multimodal knowledge graph.<n>Current methods face challenges such as incomplete contextual information, coarse cross-modal fusion, and the difficulty of jointly large language models (LLMs) and large visual models (LVMs)<n>We propose DeepMEL, a novel framework based on multi-agent collaborative reasoning.
arXiv Detail & Related papers (2025-08-21T11:24:26Z)
Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs [56.76586846269894]
Multimodal Large Language Models (MLLMs) have achieved success across various domains.<n>Despite its importance, the study of knowledge sharing among domain-specific MLLMs remains largely underexplored.<n>We propose a unified parameter integration framework that enables modular composition of expert capabilities.
arXiv Detail & Related papers (2025-06-30T15:07:41Z)
A Survey on Collaborative Mechanisms Between Large and Small Language Models [5.1159419867547085]
Large Language Models (LLMs) deliver powerful AI capabilities but face deployment challenges due to high resource costs and latency.<n>Small Language Models (SLMs) offer efficiency and deployability at the cost of reduced performance.
arXiv Detail & Related papers (2025-05-12T11:48:42Z)
Cooperative Multi-Agent Planning with Adaptive Skill Synthesis [16.228784877899976]
We present a novel multi-agent architecture that integrates vision-language models (VLMs) with a dynamic skill library and structured communication for decentralized closed-loop decision-making.<n>The skill library, bootstrapped from demonstrations, evolves via planner-guided tasks to enable adaptive strategies.<n>We demonstrate its strong performance against state-of-the-art MARL baselines across both symmetric and asymmetric scenarios.
arXiv Detail & Related papers (2025-02-14T13:23:18Z)
ModServe: Scalable and Resource-Efficient Large Multimodal Model Serving [19.388562622309838]
Large multimodal models (LMMs) demonstrate impressive capabilities in understanding images, videos, and audio beyond text.<n>We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, across six representative open-source models.<n>We propose ModServe, a modular LMM serving system that decouples stages for independent optimization and adaptive scaling.
arXiv Detail & Related papers (2025-02-02T22:10:40Z)
R-MTLLMF: Resilient Multi-Task Large Language Model Fusion at the Wireless Edge [78.26352952957909]
Multi-task large language models (MTLLMs) are important for many applications at the wireless edge, where users demand specialized models to handle multiple tasks efficiently.<n>The concept of model fusion via task vectors has emerged as an efficient approach for combining fine-tuning parameters to produce an MTLLM.<n>In this paper, the problem of enabling edge users to collaboratively craft such MTLMs via tasks vectors is studied, under the assumption of worst-case adversarial attacks.
arXiv Detail & Related papers (2024-11-27T10:57:06Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD) It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.