Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework
for Visual Commonsense Reasoning
- URL: http://arxiv.org/abs/2301.13335v2
- Date: Mon, 25 Dec 2023 12:59:02 GMT
- Title: Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework
for Visual Commonsense Reasoning
- Authors: Jian Zhu, Hanli Wang, Miaojing Shi
- Abstract summary: Representative works first recognize objects in images and then associate them with key words in texts.
An MLLM enhanced pseudo 3D perception framework is designed for visual commonsense reasoning.
Experiments on the VCR dataset demonstrate the superiority of the proposed framework over state-of-the-art approaches.
- Score: 24.29849761674329
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The visual commonsense reasoning (VCR) task is to choose an answer and
provide a justifying rationale based on the given image and textural question.
Representative works first recognize objects in images and then associate them
with key words in texts. However, existing approaches do not consider exact
positions of objects in a human-like three-dimensional (3D) manner, making them
incompetent to accurately distinguish objects and understand visual relation.
Recently, multi-modal large language models (MLLMs) have been used as powerful
tools for several multi-modal tasks but not for VCR yet, which requires
elaborate reasoning on specific visual objects referred by texts. In light of
the above, an MLLM enhanced pseudo 3D perception framework is designed for VCR.
Specifically, we first demonstrate that the relation between objects is
relevant to object depths in images, and hence introduce object depth into VCR
frameworks to infer 3D positions of objects in images. Then, a depth-aware
Transformer is proposed to encode depth differences between objects into the
attention mechanism of Transformer to discriminatively associate objects with
visual scenes guided by depth. To further associate the answer with the depth
of visual scene, each word in the answer is tagged with a pseudo depth to
realize depth-aware association between answer words and objects. On the other
hand, BLIP-2 as an MLLM is employed to process images and texts, and the
referring expressions in texts involving specific visual objects are modified
with linguistic object labels to serve as comprehensible MLLM inputs. Finally,
a parameter optimization technique is devised to fully consider the quality of
data batches based on multi-level reasoning confidence. Experiments on the VCR
dataset demonstrate the superiority of the proposed framework over
state-of-the-art approaches.
Related papers
- MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.
Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.
We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z) - RefMask3D: Language-Guided Transformer for 3D Referring Segmentation [32.11635464720755]
RefMask3D aims to explore the comprehensive multi-modal feature interaction and understanding.
RefMask3D outperforms previous state-of-the-art method by a large margin of 3.16% mIoU on the challenging ScanRefer dataset.
arXiv Detail & Related papers (2024-07-25T17:58:03Z) - Proximity QA: Unleashing the Power of Multi-Modal Large Language Models
for Spatial Proximity Analysis [45.62657605766754]
Multi-modal large language models (MLLMs) have demonstrated remarkable vision-language capabilities.
Proximity QA is a novel framework designed to enable MLLMs to infer the proximity relationship between objects in images.
We have conducted extensive experiments to validate Proximity QA's superior ability in depth perception and proximity analysis.
arXiv Detail & Related papers (2024-01-31T14:21:49Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes [67.5351491691866]
We present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles.
Our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
arXiv Detail & Related papers (2023-12-07T12:10:05Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations [61.870882736758624]
We propose a novel self-supervised paradigm to learn Multi-View Transformation Equivariant Representations (MV-TER)
Specifically, we perform a 3D transformation on a 3D object, and obtain multiple views before and after the transformation via projection.
Then, we self-train a representation to capture the intrinsic 3D object representation by decoding 3D transformation parameters from the fused feature representations of multiple views before and after the transformation.
arXiv Detail & Related papers (2021-03-01T06:24:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.