Griffon: Spelling out All Object Locations at Any Granularity with Large
Language Models
- URL: http://arxiv.org/abs/2311.14552v2
- Date: Mon, 27 Nov 2023 09:54:00 GMT
- Title: Griffon: Spelling out All Object Locations at Any Granularity with Large
Language Models
- Authors: Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, Jinqiao
Wang
- Abstract summary: Current Large Vision Language Models (LVLMs) are predominantly constrained to a single, pre-existing object.
We introduce a novel language-prompted localization dataset designed to fully unleash the capabilities of LVLMs.
$textbfGriffon$ achieves state-of-the-art performance on the fine-grained RefCOCO series.
It also approaches the capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO.
- Score: 32.01009756533755
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Replicating the innate human ability to detect all objects based on free-form
texts at any granularity remains a formidable challenge for Vision-Language
models. Current Large Vision Language Models (LVLMs) are predominantly
constrained to grounding a single, pre-existing object, relying solely on data
from Referring Expression Comprehension tasks. The limitation leads to a
compromise in model design, necessitating the introduction of visual expert
models or the integration of customized head structures. Beyond these
constraints, our research delves into the untapped potential of LVLMs and
uncover their inherent capability for basic object perception, allowing them to
accurately identify and locate objects of interest. Building on this insight,
we introduce a novel language-prompted localization dataset designed to fully
unleash the capabilities of LVLMs in integrating fine-grained object perception
with precise location awareness. More importantly, we present
$\textbf{Griffon}$, a purely LVLM-based baseline, which does not require the
introduction of any special tokens, expert models, or additional detection
modules. It simply maintains a consistent structure with popular LVLMs by
unifying data formats across various localization-related scenarios and is
trained end-to-end through a well-designed pipeline. Comprehensive experiments
demonstrate that $\textbf{Griffon}$ not only achieves state-of-the-art
performance on the fine-grained RefCOCO series but also approaches the
capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO.
Related papers
- Improving Large Vision-Language Models' Understanding for Field Data [62.917026891829025]
We introduce FieldLVLM, a framework designed to improve large vision-language models' understanding of field data.<n>FieldLVLM consists of two main components: a field-aware language generation strategy and a data-compressed multimodal model tuning.<n> Experimental results on newly proposed benchmark datasets demonstrate that FieldLVLM significantly outperforms existing methods in tasks involving scientific field data.
arXiv Detail & Related papers (2025-07-24T11:28:53Z) - Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method [10.748210940033484]
Large language models (LLMs) and vision-language models (VLMs) have achieved significant success.
Due to the substantial differences between remote sensing images and conventional optical images, these models face challenges in comprehension.
This letter explores the application of VLMs for object detection in remote sensing images.
arXiv Detail & Related papers (2025-03-11T08:02:54Z) - New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration [49.180693704510006]
Referring Expression (REC) is a cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding.<n>It serves as an essential testing ground for Multimodal Large Language Models (MLLMs)
arXiv Detail & Related papers (2025-02-27T13:58:44Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [84.73092715537364]
In this paper, we study a new task of navigating to diverse target objects in a large number of scene types.
We build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning.
Our agent achieves a success rate that surpasses GPT-4o by over 20%.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding.
We have established a new REC dataset characterized by two key features.
It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding [26.888343140449948]
Visual grounding is an essential tool that links user-provided text queries with query-specific regions within an image.
We introduce LLM-Optic, an innovative method that utilizes Large Language Models (LLMs) as an optical lens to enhance existing visual grounding models.
Our method achieves universal visual grounding, which allows for the detection of arbitrary objects specified by arbitrary human language input.
arXiv Detail & Related papers (2024-05-27T12:23:08Z) - Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring [27.45225442048711]
We introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts.
We design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models.
Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting.
arXiv Detail & Related papers (2024-03-14T12:21:37Z) - PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs [55.8550939439138]
Vision-Language Models (VLMs) have shown immense potential by integrating large language models with vision systems.
These models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions.
We introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM.
Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads.
arXiv Detail & Related papers (2024-02-13T18:39:18Z) - General Object Foundation Model for Images and Videos at Scale [99.2806103051613]
We present GLEE, an object-level foundation model for locating and identifying objects in images and videos.
GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario.
We employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks.
arXiv Detail & Related papers (2023-12-14T17:26:00Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.