Ferret: Refer and Ground Anything Anywhere at Any Granularity
- URL: http://arxiv.org/abs/2310.07704v1
- Date: Wed, 11 Oct 2023 17:55:15 GMT
- Title: Ferret: Refer and Ground Anything Anywhere at Any Granularity
- Authors: Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui
Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang
- Abstract summary: We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image.
Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image.
Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes.
- Score: 93.80461625100826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and
accurately grounding open-vocabulary descriptions. To unify referring and
grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid
region representation that integrates discrete coordinates and continuous
features jointly to represent a region in the image. To extract the continuous
features of versatile regions, we propose a spatial-aware visual sampler, adept
at handling varying sparsity across different shapes. Consequently, Ferret can
accept diverse region inputs, such as points, bounding boxes, and free-form
shapes. To bolster the desired capability of Ferret, we curate GRIT, a
comprehensive refer-and-ground instruction tuning dataset including 1.1M
samples that contain rich hierarchical spatial knowledge, with 95K hard
negative data to promote model robustness. The resulting model not only
achieves superior performance in classical referring and grounding tasks, but
also greatly outperforms existing MLLMs in region-based and
localization-demanded multimodal chatting. Our evaluations also reveal a
significantly improved capability of describing image details and a remarkable
alleviation in object hallucination. Code and data will be available at
https://github.com/apple/ml-ferret
Related papers
- Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models [62.36769498166312]
We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability.
Groma is adept at region-level tasks such as region captioning and visual grounding.
By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs.
arXiv Detail & Related papers (2024-04-19T17:22:51Z) - Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models [119.63480600733715]
We unveil Ferret-v2, a significant upgrade to Ferret, with three key designs.
A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail.
By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information.
arXiv Detail & Related papers (2024-04-11T17:56:05Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - GROUNDHOG: Grounding Large Language Models to Holistic Segmentation [22.347590874621865]
We introduce GROUNDHOG, an MLLM developed by grounding Large Language Models to holistic segmentation.
GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone.
Our experimental results show that GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-02-26T18:59:33Z) - MuRF: Multi-Baseline Radiance Fields [117.55811938988256]
We present Multi-Baseline Radiance Fields (MuRF), a feed-forward approach to solving sparse view synthesis.
MuRF achieves state-of-the-art performance across multiple different baseline settings.
We also show promising zero-shot generalization abilities on the Mip-NeRF 360 dataset.
arXiv Detail & Related papers (2023-12-07T18:59:56Z) - Portrait Neural Radiance Fields from a Single Image [68.66958204066721]
We present a method for estimating Neural Radiance Fields (NeRF) from a single portrait.
We propose to pretrain the weights of a multilayer perceptron (MLP), which implicitly models the volumetric density.
To improve the generalization to unseen faces, we train the canonical coordinate space approximated by 3D face morphable models.
We quantitatively evaluate the method using controlled captures and demonstrate the generalization to real portrait images, showing favorable results against state-of-the-arts.
arXiv Detail & Related papers (2020-12-10T18:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.