OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance
- URL: http://arxiv.org/abs/2504.04781v1
- Date: Mon, 07 Apr 2025 07:15:26 GMT
- Title: OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance
- Authors: Chaoyi Wang, Baoqing Li, Xinhan Di,
- Abstract summary: OCC-MLLM-CoT-Alpha is a multi-modal large vision language framework that integrates 3D-aware supervision and Chain-of-Thoughts guidance.<n>In the evaluation, the proposed methods demonstrate decision score improvement of 15.75%,15.30%,16.98%,14.62%, and 4.42%,3.63%,6.94%,10.70% for two settings of a variety of state-of-the-art models.
- Score: 3.832135091367811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Comprehending occluded objects are not well studied in existing large-scale visual-language multi-modal models. Current state-of-the-art multi-modal large models struggles to provide satisfactory results in understanding occluded objects through universal visual encoders and supervised learning strategies. Therefore, we propose OCC-MLLM-CoT-Alpha, a multi-modal large vision language framework that integrates 3D-aware supervision and Chain-of-Thoughts guidance. Particularly, (1) we build a multi-modal large vision-language model framework which is consisted of a large multi-modal vision-language model and a 3D reconstruction expert model. (2) the corresponding multi-modal Chain-of-Thoughts is learned through a combination of supervised and reinforcement training strategies, allowing the multi-modal vision-language model to enhance the recognition ability with learned multi-modal chain-of-thoughts guidance. (3) A large-scale multi-modal chain-of-thoughts reasoning dataset, consisting of $110k$ samples of occluded objects held in hand, is built. In the evaluation, the proposed methods demonstrate decision score improvement of 15.75%,15.30%,16.98%,14.62%, and 4.42%,3.63%,6.94%,10.70% for two settings of a variety of state-of-the-art models.
Related papers
- A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models [74.48084001058672]
The rise of foundation models has transformed machine learning research.<n> multimodal foundation models (MMFMs) pose unique interpretability challenges beyond unimodal frameworks.<n>This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems.
arXiv Detail & Related papers (2025-02-22T20:55:26Z) - The Multi-Faceted Monosemanticity in Multimodal Representations [42.64636740703632]
We leverage recent advancements in feature monosemanticity to extract interpretable features from deep multimodal models.<n>Our findings reveal that this categorization aligns closely with human cognitive understandings of different modalities.<n>These results indicate that large-scale multimodal models, equipped with task-agnostic interpretability tools, offer valuable insights into key connections and distinctions between different modalities.
arXiv Detail & Related papers (2025-02-16T14:51:07Z) - CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models [60.08485416687596]
Chain of Multi-modal Thought (CoMT) benchmark aims to mimic human-like reasoning that inherently integrates visual operation.<n>We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches.
arXiv Detail & Related papers (2024-12-17T14:10:16Z) - Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond [51.141270065306514]
This tutorial aims to equip researchers, practitioners, and newcomers with the knowledge and skills to leverage multimodal AI.
We will cover the latest multimodal datasets and pretrained models, including those beyond vision and language.
Hands-on laboratories will offer practical experience with state-of-the-art multimodal models.
arXiv Detail & Related papers (2024-10-08T01:41:56Z) - OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning [3.544352024775253]
We introduce a multi-modal large language framework and corresponding self-supervised learning strategy with support of 3D generation.
The initial results demonstrate the improvement of 16.92% in comparison with the state-of-the-art VLM models.
arXiv Detail & Related papers (2024-10-02T06:52:39Z) - Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment [16.733970553781887]
Recent findings suggest high semantic similarity between well-trained unimodal encoders.<n>We propose a novel framework that aligns vision and language using frozen unimodal encoders.
arXiv Detail & Related papers (2024-09-28T17:57:32Z) - Multi-modal Relation Distillation for Unified 3D Representation Learning [30.942281325891226]
Multi-modal Relation Distillation (MRD) is a tri-modal pre-training framework designed to distill reputable large Vision-Language Models (VLM) into 3D backbones.
MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations.
arXiv Detail & Related papers (2024-07-19T03:43:48Z) - S3: A Simple Strong Sample-effective Multimodal Dialog System [61.31055673156622]
We present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results.
The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector.
arXiv Detail & Related papers (2024-06-26T12:45:43Z) - Noise-powered Multi-modal Knowledge Graph Representation Framework [52.95468915728721]
The rise of Multi-modal Pre-training highlights the necessity for a unified Multi-Modal Knowledge Graph representation learning framework.
We propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking.
Our approach achieves SOTA performance across a total of ten datasets, demonstrating its versatility.
arXiv Detail & Related papers (2024-03-11T15:48:43Z) - On Robustness in Multimodal Learning [75.03719000820388]
Multimodal learning is defined as learning over multiple input modalities such as video, audio, and text.
We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods.
arXiv Detail & Related papers (2023-04-10T05:02:07Z) - 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech
recognition [31.992543274210835]
We identify and integrate several approaches to achieve further improvements for ASR tasks.
Specifically, multi-loss refers to the joint CTC/AED loss and multi-path denotes the Mixture-of-Experts(MoE) architecture.
We evaluate our proposed method on the public WenetSpeech dataset and experimental results show that the proposed method provides 12.2%-17.6% relative CER improvement.
arXiv Detail & Related papers (2022-04-07T03:10:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.