Tell Me the Evidence? Dual Visual-Linguistic Interaction for Answer
Grounding
- URL: http://arxiv.org/abs/2207.05703v1
- Date: Tue, 21 Jun 2022 03:15:27 GMT
- Title: Tell Me the Evidence? Dual Visual-Linguistic Interaction for Answer
Grounding
- Authors: Junwen Pan, Guanlin Chen, Yi Liu, Jiexiang Wang, Cheng Bian, Pengfei
Zhu, Zhicheng Zhang
- Abstract summary: We propose Dual Visual-Linguistic Interaction (DaVI), a novel unified end-to-end framework with the capability for both linguistic answering and visual grounding.
DaVI innovatively introduces two visual-linguistic interaction mechanisms: 1) visual-based linguistic encoder that understands questions incorporated with visual features and produces linguistic-oriented evidence for further answer decoding, and 2) linguistic-based visual decoder that focuses visual features on the evidence-related regions for answer grounding.
- Score: 27.9150632791267
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Answer grounding aims to reveal the visual evidence for visual question
answering (VQA), which entails highlighting relevant positions in the image
when answering questions about images. Previous attempts typically tackle this
problem using pretrained object detectors, but without the flexibility for
objects not in the predefined vocabulary. However, these black-box methods
solely concentrate on the linguistic generation, ignoring the visual
interpretability. In this paper, we propose Dual Visual-Linguistic Interaction
(DaVI), a novel unified end-to-end framework with the capability for both
linguistic answering and visual grounding. DaVI innovatively introduces two
visual-linguistic interaction mechanisms: 1) visual-based linguistic encoder
that understands questions incorporated with visual features and produces
linguistic-oriented evidence for further answer decoding, and 2)
linguistic-based visual decoder that focuses visual features on the
evidence-related regions for answer grounding. This way, our approach ranked
the 1st place in the answer grounding track of 2022 VizWiz Grand Challenge.
Related papers
- Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts? [62.984473889987605]
We present a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system.
We pose these questions along with the query image to a VQA system and aggregate the answers to determine the presence or absence of an object in the test images.
Our experiments demonstrate comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches.
arXiv Detail & Related papers (2024-10-17T15:16:10Z) - Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - VideoDistill: Language-aware Vision Distillation for Video Question Answering [24.675876324457747]
We propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process.
VideoDistill generates answers only from question-related visual embeddings.
We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-01T07:44:24Z) - Equivariant and Invariant Grounding for Video Question Answering [68.33688981540998]
Most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure.
We devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV)
EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment.
arXiv Detail & Related papers (2022-07-26T10:01:02Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding [35.01174511816063]
We present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training.
Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images.
We develop a visual-language model equipped with multi-level cross-modality attention mechanism.
arXiv Detail & Related papers (2022-03-16T09:17:41Z) - Learning to Ground Visual Objects for Visual Dialog [26.21407651331964]
We propose a novel approach to Learn to Ground visual objects for visual dialog.
A posterior distribution over visual objects is inferred from both context (history and questions) and answers.
A prior distribution, which is inferred from context only, is used to approximate the posterior distribution so that appropriate visual objects can be grounded even without answers.
arXiv Detail & Related papers (2021-09-13T14:48:44Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Vision and Language: from Visual Perception to Content Creation [100.36776435627962]
"vision to language" is probably one of the most popular topics in the past five years.
This paper reviews the recent advances along these two dimensions: "vision to language" and "language to vision"
arXiv Detail & Related papers (2019-12-26T14:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.