Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
- URL: http://arxiv.org/abs/2404.13013v1
- Date: Fri, 19 Apr 2024 17:22:51 GMT
- Title: Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
- Authors: Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, Xiaojuan Qi,
- Abstract summary: We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability.
Groma is adept at region-level tasks such as region captioning and visual grounding.
By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs.
- Score: 62.36769498166312
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: https://groma-mllm.github.io/.
Related papers
- ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding [67.63933036920012]
Existing methods, including proxy encoding and geometry encoding, incorporate additional syntax to encode the object's location.
This study presents ClawMachine, offering a new methodology that notates an entity directly using the visual tokens.
ClawMachine unifies visual referring and grounding into an auto-regressive format and learns with a decoder-only architecture.
arXiv Detail & Related papers (2024-06-17T08:39:16Z) - RegionGPT: Towards Region Understanding Vision Language Model [88.42271128373191]
RegionGPT (short as RGPT) is a novel framework designed for complex region-level captioning and understanding.
We develop an automated region caption data generation pipeline, enriching the training set with detailed region-level captions.
We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks.
arXiv Detail & Related papers (2024-03-04T18:58:08Z) - GROUNDHOG: Grounding Large Language Models to Holistic Segmentation [22.347590874621865]
We introduce GROUNDHOG, an MLLM developed by grounding Large Language Models to holistic segmentation.
GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone.
Our experimental results show that GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-02-26T18:59:33Z) - GLaMM: Pixel Grounding Large Multimodal Model [57.91763410032292]
We present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks.
GLaMM is flexible enough to accept both textual and optional visual prompts (region of interest) as input.
Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale.
arXiv Detail & Related papers (2023-11-06T18:59:57Z) - Ferret: Refer and Ground Anything Anywhere at Any Granularity [93.80461625100826]
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image.
Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image.
Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes.
arXiv Detail & Related papers (2023-10-11T17:55:15Z) - GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z) - AttnGrounder: Talking to Cars with Attention [6.09170287691728]
We propose a single-stage end-to-end trainable model for the task of visual grounding.
Visual grounding aims to localize a specific object in an image based on a given natural language text query.
We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing methods.
arXiv Detail & Related papers (2020-09-11T23:18:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.