Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
- URL: http://arxiv.org/abs/2404.07973v1
- Date: Thu, 11 Apr 2024 17:56:05 GMT
- Title: Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
- Authors: Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang,
- Abstract summary: We unveil Ferret-v2, a significant upgrade to Ferret, with three key designs.
A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail.
By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information.
- Score: 119.63480600733715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.
Related papers
- Expressive Gaussian Human Avatars from Monocular RGB Video [69.56388194249942]
We introduce EVA, a drivable human model that meticulously sculpts fine details based on 3D Gaussians and SMPL-X.
We highlight the critical importance of aligning the SMPL-X model with RGB frames for effective avatar learning.
We propose a context-aware adaptive density control strategy, which is adaptively adjusting the gradient thresholds.
arXiv Detail & Related papers (2024-07-03T15:36:27Z) - Lifting by Image -- Leveraging Image Cues for Accurate 3D Human Pose
Estimation [10.374944534302234]
"lifting from 2D pose" method has been the dominant approach to 3D Human Pose Estimation (3DHPE)
Rich semantic and texture information in images can contribute to a more accurate "lifting" procedure.
In this paper, we give new insight into the cause of poor generalization problems and the effectiveness of image features.
arXiv Detail & Related papers (2023-12-25T07:50:58Z) - DFU: scale-robust diffusion model for zero-shot super-resolution image
generation [15.689418447376587]
We present a novel deep-learning architecture, Dual-FNO UNet (DFU), which approximates the score operator by combining both spatial and spectral information at multiple resolutions.
We propose a fine-tuning strategy to further enhance the zero-shot super-resolution image-generation capability of our model, leading to a FID of 11.3 at 1.66 times the maximum training resolution on FFHQ.
arXiv Detail & Related papers (2023-11-30T23:31:33Z) - Ferret: Refer and Ground Anything Anywhere at Any Granularity [93.80461625100826]
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image.
Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image.
Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes.
arXiv Detail & Related papers (2023-10-11T17:55:15Z) - DiffDreamer: Towards Consistent Unsupervised Single-view Scene
Extrapolation with Conditional Diffusion Models [91.94566873400277]
DiffDreamer is an unsupervised framework capable of synthesizing novel views depicting a long camera trajectory.
We show that image-conditioned diffusion models can effectively perform long-range scene extrapolation while preserving consistency significantly better than prior GAN-based methods.
arXiv Detail & Related papers (2022-11-22T10:06:29Z) - Any-resolution Training for High-resolution Image Synthesis [55.19874755679901]
Generative models operate at fixed resolution, even though natural images come in a variety of sizes.
We argue that every pixel matters and create datasets with variable-size images, collected at their native resolutions.
We introduce continuous-scale training, a process that samples patches at random scales to train a new generator with variable output resolutions.
arXiv Detail & Related papers (2022-04-14T17:59:31Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.