Related papers: The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

URL: http://arxiv.org/abs/2402.19474v4
Date: Fri, 23 Aug 2024 07:20:57 GMT
Title: The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Authors: Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, Yu Qiao, Jifeng Dai,
Abstract summary: We present the All-Seeing Project V2, a new model and dataset designed for understanding object relations in images. We propose the All-Seeing Model V2 that integrates the formulation of text generation, object localization, and relation comprehension into a relation conversation task. Our model excels not only in perceiving and recognizing all objects within the image but also in grasping the intricate relation graph between them.
Score: 58.40101895719467
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present the All-Seeing Project V2: a new model and dataset designed for understanding object relations in images. Specifically, we propose the All-Seeing Model V2 (ASMv2) that integrates the formulation of text generation, object localization, and relation comprehension into a relation conversation (ReC) task. Leveraging this unified task, our model excels not only in perceiving and recognizing all objects within the image but also in grasping the intricate relation graph between them, diminishing the relation hallucination often encountered by Multi-modal Large Language Models (MLLMs). To facilitate training and evaluation of MLLMs in relation understanding, we created the first high-quality ReC dataset ({AS-V2) which is aligned with the format of standard instruction tuning data. In addition, we design a new benchmark, termed Circular-based Relation Probing Evaluation (CRPE) for comprehensively evaluating the relation comprehension capabilities of MLLMs. Notably, our ASMv2 achieves an overall accuracy of 52.04 on this relation-aware benchmark, surpassing the 43.14 of LLaVA-1.5 by a large margin. We hope that our work can inspire more future research and contribute to the evolution towards artificial general intelligence. Our project is released at https://github.com/OpenGVLab/all-seeing.

Related papers

Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation [50.22361866757033]
unified vision-language models (VLMs) integrate both visual understanding and generation capabilities.<n>This paper systematically investigates the generalization across understanding and generation tasks in unifiedVLMs.
arXiv Detail & Related papers (2025-05-29T03:40:21Z)
Can Multimodal Large Language Models Understand Spatial Relations? [16.76001474065412]
We introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO 2017.<n>Results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%.
arXiv Detail & Related papers (2025-05-25T07:37:34Z)
Post-Training Language Models for Continual Relation Extraction [0.0]
This study investigates the application of pre-trained language models (PLMs), specifically large language models (LLMs) to knowledge graphs. We evaluate decoder-only models (eg, Mistral-7B and Llama2-7B) and encoder-decoder models (eg, Flan-T5 Base) on the TACRED and FewRel datasets.
arXiv Detail & Related papers (2025-04-07T16:01:22Z)
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation [67.31811007549489]
We propose a Rewriting-driven AugMentation (RAM) paradigm for Vision-Language Navigation (VLN) Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Experiments on both the discrete environments (R2R, REVERIE, and R4R) and continuous environments (R2R-CE) show the superior performance and impressive generalization ability of our method.
arXiv Detail & Related papers (2025-03-23T13:18:17Z)
Evaluating the Generation of Spatial Relations in Text and Image Generative Models [4.281091463408283]
spatial relations are naturally understood in a visuo-spatial manner. We develop an approach to convert LLM outputs into an image, thereby allowing us to evaluate both T2I models and LLMs. Surprisingly, we found that T2I models only achieve subpar performance despite their impressive general image-generation abilities.
arXiv Detail & Related papers (2024-11-12T09:30:02Z)
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. We have established a new REC dataset characterized by two key features. It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z)
RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding [4.266920365127677]
Under the new LaGD paradigm, the old datasets are no longer suitable for fire-new tasks. We designed a high-quality, diversified, and unified multimodal instruction-following dataset for RSI understanding. The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information.
arXiv Detail & Related papers (2024-06-18T10:34:28Z)
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC) This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z)
RelationVLM: Making Large Vision-Language Models Understand Visual Relations [66.70252936043688]
We present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video. Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations.
arXiv Detail & Related papers (2024-03-19T15:01:19Z)
Integrating Graphs with Large Language Models: Methods and Prospects [68.37584693537555]
Large language models (LLMs) have emerged as frontrunners, showcasing unparalleled prowess in diverse applications. Merging the capabilities of LLMs with graph-structured data has been a topic of keen interest. This paper bifurcates such integrations into two predominant categories.
arXiv Detail & Related papers (2023-10-09T07:59:34Z)
Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z)
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating Visio-Linguistic Reasoning [25.520406167426135]
We present TraVLR, a synthetic dataset comprising four visio-linguistic (V+L) reasoning tasks. Each example in TraVLR redundantly encodes the scene in two modalities, allowing either to be dropped or added during training or testing without losing relevant information. We compare the performance of four state-of-the-art V+L models, finding that while they perform well on test examples from the same modality, they all fail at cross-modal transfer.
arXiv Detail & Related papers (2021-11-21T07:22:44Z)
e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks. We also introduce e-SNLI-VE, the largest existing dataset with NLEs. We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.