Related papers: Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

URL: http://arxiv.org/abs/2404.07973v1
Date: Thu, 11 Apr 2024 17:56:05 GMT
Title: Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Authors: Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang,
Abstract summary: We unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information.
Score: 119.63480600733715
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.

Related papers

A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision [65.33043028101471]
We introduce a diffusion model for Gaussian Splats, SplatDiffusion, to enable generation of three-dimensional structures from single images. Existing methods rely on deterministic, feed-forward predictions, which limit their ability to handle the inherent ambiguity of 3D inference from 2D data.
arXiv Detail & Related papers (2024-12-01T00:29:57Z)
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms [48.00193601902457]
Ferret-UI 2 is a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms. Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting.
arXiv Detail & Related papers (2024-10-24T17:58:31Z)
GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction [52.04103235260539]
We present a diffusion model approach based on Gaussian Splatting representation for 3D object reconstruction from a single view. The model learns to generate 3D objects represented by sets of GS ellipsoids. The final reconstructed objects explicitly come with high-quality 3D structure and texture, and can be efficiently rendered in arbitrary views.
arXiv Detail & Related papers (2024-07-05T03:43:08Z)
Expressive Gaussian Human Avatars from Monocular RGB Video [69.56388194249942]
We introduce EVA, a drivable human model that meticulously sculpts fine details based on 3D Gaussians and SMPL-X. We highlight the critical importance of aligning the SMPL-X model with RGB frames for effective avatar learning. We propose a context-aware adaptive density control strategy, which is adaptively adjusting the gradient thresholds.
arXiv Detail & Related papers (2024-07-03T15:36:27Z)
Lifting by Image -- Leveraging Image Cues for Accurate 3D Human Pose Estimation [10.374944534302234]
"lifting from 2D pose" method has been the dominant approach to 3D Human Pose Estimation (3DHPE) Rich semantic and texture information in images can contribute to a more accurate "lifting" procedure. In this paper, we give new insight into the cause of poor generalization problems and the effectiveness of image features.
arXiv Detail & Related papers (2023-12-25T07:50:58Z)
DFU: scale-robust diffusion model for zero-shot super-resolution image generation [15.689418447376587]
We present a novel deep-learning architecture, Dual-FNO UNet (DFU), which approximates the score operator by combining both spatial and spectral information at multiple resolutions. We propose a fine-tuning strategy to further enhance the zero-shot super-resolution image-generation capability of our model, leading to a FID of 11.3 at 1.66 times the maximum training resolution on FFHQ.
arXiv Detail & Related papers (2023-11-30T23:31:33Z)
Ferret: Refer and Ground Anything Anywhere at Any Granularity [93.80461625100826]
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image. Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes.
arXiv Detail & Related papers (2023-10-11T17:55:15Z)
DiffDreamer: Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion Models [91.94566873400277]
DiffDreamer is an unsupervised framework capable of synthesizing novel views depicting a long camera trajectory. We show that image-conditioned diffusion models can effectively perform long-range scene extrapolation while preserving consistency significantly better than prior GAN-based methods.
arXiv Detail & Related papers (2022-11-22T10:06:29Z)
Any-resolution Training for High-resolution Image Synthesis [55.19874755679901]
Generative models operate at fixed resolution, even though natural images come in a variety of sizes. We argue that every pixel matters and create datasets with variable-size images, collected at their native resolutions. We introduce continuous-scale training, a process that samples patches at random scales to train a new generator with variable output resolutions.
arXiv Detail & Related papers (2022-04-14T17:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.