Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models
- URL: http://arxiv.org/abs/2311.14552v3
- Date: Tue, 08 Oct 2024 03:05:09 GMT
- Title: Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models
- Authors: Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, Jinqiao Wang,
- Abstract summary: Griffon is a language-prompted localization dataset for large vision language models.
It is trained end-to-end through a well-designed pipeline.
It achieves state-of-the-art performance on the fine-grained RefCOCO series and Flickr30K Entities.
- Score: 30.20915403608803
- License:
- Abstract: Replicating the innate human ability to detect all objects based on free-form texts at any granularity remains a formidable challenge for Large Vision Language Models (LVLMs). Current LVLMs are predominantly constrained to locate a single, pre-existing object. This limitation leads to a compromise in model design, necessitating the introduction of visual expert models or customized head structures. Beyond these constraints, our research uncovers LVLMs' capability for basic object perception, allowing them to accurately identify and locate objects of interest. Building on this insight, we introduce a novel Language-prompted Localization Dataset to fully unleash the capabilities of LVLMs in fine-grained object perception and precise location awareness. More importantly, we present Griffon, a purely LVLM-based baseline, which does not introduce any special tokens, expert models, or additional detection modules. It simply maintains a consistent structure with popular LVLMs by unifying data formats across various localization-related scenarios and is trained end-to-end through a well-designed pipeline. Comprehensive experiments demonstrate that Griffon not only achieves state-of-the-art performance on the fine-grained RefCOCO series and Flickr30K Entities but also approaches the capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO. Data, codes, and models are released at https://github.com/jefferyZhan/Griffon.
Related papers
- DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [84.73092715537364]
In this paper, we study a new task of navigating to diverse target objects in a large number of scene types.
We build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning.
Our agent achieves a success rate that surpasses GPT-4o by over 20%.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding.
We have established a new REC dataset characterized by two key features.
It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding [26.888343140449948]
Visual grounding is an essential tool that links user-provided text queries with query-specific regions within an image.
We introduce LLM-Optic, an innovative method that utilizes Large Language Models (LLMs) as an optical lens to enhance existing visual grounding models.
Our method achieves universal visual grounding, which allows for the detection of arbitrary objects specified by arbitrary human language input.
arXiv Detail & Related papers (2024-05-27T12:23:08Z) - Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring [27.45225442048711]
We introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts.
We design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models.
Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting.
arXiv Detail & Related papers (2024-03-14T12:21:37Z) - PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs [55.8550939439138]
Vision-Language Models (VLMs) have shown immense potential by integrating large language models with vision systems.
These models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions.
We introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM.
Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads.
arXiv Detail & Related papers (2024-02-13T18:39:18Z) - General Object Foundation Model for Images and Videos at Scale [99.2806103051613]
We present GLEE, an object-level foundation model for locating and identifying objects in images and videos.
GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario.
We employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks.
arXiv Detail & Related papers (2023-12-14T17:26:00Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.