Related papers: Beyond Embeddings: The Promise of Visual Table in Visual Reasoning

Beyond Embeddings: The Promise of Visual Table in Visual Reasoning

URL: http://arxiv.org/abs/2403.18252v2
Date: Mon, 17 Jun 2024 09:57:09 GMT
Title: Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
Authors: Yiwu Zhong, Zi-Yuan Hu, Michael R. Lyu, Liwei Wang,
Abstract summary: We propose Visual Table, a novel form of visual representation tailored for visual reasoning. Visual tables are constructed as hierarchical descriptions of visual scenes, featuring a scene description and multiple object-centric descriptions. They deliver instance-level world knowledge and detailed attributes that are essential for visual reasoning.
Score: 38.558250602212425
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often lack access to world knowledge critical for visual reasoning. In this work, we propose Visual Table, a novel form of visual representation tailored for visual reasoning. Visual tables are constructed as hierarchical descriptions of visual scenes, featuring a scene description and multiple object-centric descriptions covering categories, attributes, and knowledge. Thanks to the structural and textual formats, visual tables offer unique advantages over mere visual embeddings, such as interpretability and controllable editing. Furthermore, they deliver instance-level world knowledge and detailed attributes that are essential for visual reasoning. To create visual tables, we develop a generator trained on the dataset with collected, small-scale annotations. Extensive results on 11 visual reasoning benchmarks demonstrate that the generated visual tables significantly outperform previous structural and text-based representations. Moreover, they consistently enhance state-of-the-art multimodal large language models across diverse benchmarks, showcasing their potential for advancing visual reasoning tasks. Our code is available at https://github.com/LaVi-Lab/Visual-Table.

Related papers

OntView: What you See is What you Meant [40.572754656757475]
OntView is an intuitive visual representation of concepts and their definitions through a user-friendly visualize.<n>OntView has been released with an open-source license for the whole community.
arXiv Detail & Related papers (2025-07-18T09:06:49Z)
Capturing Visualization Design Rationale [5.051297047598238]
We present a new dataset and methodology for probing visualization design rationale through natural language.<n>We leverage a unique source of real-world visualizations and natural language narratives: literate visualization notebooks created by students as part of a data visualization course.<n>We also use large language models (LLMs) to generate and categorize question-answer-rationale triples from the narratives and articulations in the notebooks.
arXiv Detail & Related papers (2025-06-19T19:52:53Z)
Visual Adaptive Prompting for Compositional Zero-Shot Learning [0.0]
Vision-Language Models (VLMs) have demonstrated impressive capabilities in learning joint representations of visual and textual data. CZSL requires models to generalize to novel combinations of visual primitives-such as attributes and objects-that were not explicitly encountered during training. We propose Visual Adaptive Prompting System (VAPS) to bridge the gap between semantic and visual features.
arXiv Detail & Related papers (2025-02-27T17:17:43Z)
UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition [55.153629718464565]
We introduce UniTabNet, a novel framework for table structure parsing based on the image-to-text model. UniTabNet employs a divide-and-conquer'' strategy, utilizing an image-to-text model to decouple table cells and integrating both physical and logical decoders to reconstruct the complete table structure.
arXiv Detail & Related papers (2024-09-20T01:26:32Z)
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation [82.88378582161717]
State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction. We present ViStruct, a training framework to learn VLMs for effective visual structural knowledge extraction.
arXiv Detail & Related papers (2023-11-22T09:23:34Z)
ReSee: Responding through Seeing Fine-grained Visual Knowledge in Open-domain Dialogue [34.223466503256766]
We provide a new paradigm of constructing multimodal dialogues by splitting visual knowledge into finer granularity. To boost the accuracy and diversity of augmented visual information, we retrieve them from the Internet or a large image dataset. By leveraging text and vision knowledge, ReSee can produce informative responses with real-world visual concepts.
arXiv Detail & Related papers (2023-05-23T02:08:56Z)
Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason? [30.16956370267339]
We introduce a protocol to evaluate visual representations for the task of Visual Question Answering. In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module. We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth.
arXiv Detail & Related papers (2022-12-20T14:36:45Z)
Learning Structured Representations of Visual Scenes [1.6244541005112747]
We study how machines can describe the content of the individual image or video with visual relationships as the structured representations. Specifically, we explore how structured representations of visual scenes can be effectively constructed and learned in both the static-image and video settings.
arXiv Detail & Related papers (2022-07-09T05:40:08Z)
Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics. We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts. We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z)
Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling. With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling. We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z)
PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning [135.2892665079159]
We introduce a new large-scale diagnostic visual reasoning dataset named PTR. PTR contains around 70k RGBD synthetic images with ground truth object and part level annotations. We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes.
arXiv Detail & Related papers (2021-12-09T18:59:34Z)
Learning Visual Representations with Caption Annotations [19.24013129952071]
We propose a proxy task to learn visual representations over image-caption pairs. ICMLM consists in predicting masked words in captions by relying on visual cues. Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations.
arXiv Detail & Related papers (2020-08-04T08:04:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.