Related papers: Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

URL: http://arxiv.org/abs/2403.09333v1
Date: Thu, 14 Mar 2024 12:21:37 GMT
Title: Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Authors: Yufei Zhan, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang,
Abstract summary: We introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. We design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting.
Score: 27.45225442048711
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, Counting and \etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scaling up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details, and significantly improves multimodal perception ability especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts and even coordinates. Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting. Data, codes and models will be released at https://github.com/jefferyZhan/Griffon.

Related papers

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning [30.218743514199016]
Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files" Current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. We propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions.
arXiv Detail & Related papers (2025-03-27T17:53:50Z)
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface [25.898592418636603]
ours is a framework that textbfUnifies textbfFine-grained visual perception tasks through an textbfOpen-ended language interface. ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies.
arXiv Detail & Related papers (2025-03-03T09:27:24Z)
Aquila: A Hierarchically Aligned Visual-Language Model for Enhanced Remote Sensing Image Comprehension [6.29665399879184]
We present Aquila, an advanced visual language foundation model for remote sensing images. Aquila enables richer visual feature representation and more precise visual-language feature alignment. We validate the effectiveness of Aquila through extensive quantitative experiments and qualitative analyses.
arXiv Detail & Related papers (2024-11-09T05:31:56Z)
Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z)
OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors. This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training. Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z)
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references. Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts [38.59120110371588]
We introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow" Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings.
arXiv Detail & Related papers (2023-12-01T18:59:56Z)
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z)
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations. We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z)
DesCo: Learning Object Recognition with Rich Language Descriptions [93.8177229428617]
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. We propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions.
arXiv Detail & Related papers (2023-06-24T21:05:02Z)
PaLI-X: On Scaling up a Multilingual Vision and Language Model [166.9837904115951]
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model. Our model achieves new levels of performance on a wide-range of varied and complex tasks. We observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
arXiv Detail & Related papers (2023-05-29T18:58:38Z)
Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.