Learning Point-Language Hierarchical Alignment for 3D Visual Grounding
- URL: http://arxiv.org/abs/2210.12513v4
- Date: Fri, 9 Jun 2023 04:06:39 GMT
- Title: Learning Point-Language Hierarchical Alignment for 3D Visual Grounding
- Authors: Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang
- Abstract summary: This paper presents a novel hierarchical alignment model (HAM) that learns multi-granularity visual and linguistic representations in an end-to-end manner.
We extract key points and proposal points to model 3D contexts and instances, and propose point-language alignment with context modulation.
To further capture both global and local relationships, we propose a spatially multi-granular modeling scheme.
- Score: 35.17185775314988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel hierarchical alignment model (HAM) that learns
multi-granularity visual and linguistic representations in an end-to-end
manner. We extract key points and proposal points to model 3D contexts and
instances, and propose point-language alignment with context modulation (PLACM)
mechanism, which learns to gradually align word-level and sentence-level
linguistic embeddings with visual representations, while the modulation with
the visual context captures latent informative relationships. To further
capture both global and local relationships, we propose a spatially
multi-granular modeling scheme that applies PLACM to both global and local
fields. Experimental results demonstrate the superiority of HAM, with
visualized results showing that it can dynamically model fine-grained visual
and linguistic representations. HAM outperforms existing methods by a
significant margin and achieves state-of-the-art performance on two publicly
available datasets, and won the championship in ECCV 2022 ScanRefer challenge.
Code is available at~\url{https://github.com/PPjmchen/HAM}.
Related papers
- SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph
Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer.
Our method replaces original language-independent encoding with cross-modal encoding in visual analysis.
Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Detecting Any Human-Object Interaction Relationship: Universal HOI
Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs)
Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image.
For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z) - Global and Local Semantic Completion Learning for Vision-Language
Pre-training [34.740507502215536]
Cross-modal alignment plays a crucial role in vision-language pre-training models.
We propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously.
arXiv Detail & Related papers (2023-06-12T13:20:29Z) - Joint Visual Grounding and Tracking with Natural Language Specification [6.695284124073918]
Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description.
We propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task.
Our method performs favorably against state-of-the-art algorithms for both tracking and grounding.
arXiv Detail & Related papers (2023-03-21T17:09:03Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - MUTATT: Visual-Textual Mutual Guidance for Referring Expression
Comprehension [16.66775734538439]
Referring expression comprehension aims to localize a text-related region in a given image by a referring expression in natural language.
We argue that for REC the referring expression and the target region are semantically correlated.
We propose a novel approach called MutAtt to construct mutual guidance between vision and language.
arXiv Detail & Related papers (2020-03-18T03:14:58Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.