Progressive Evidence Refinement for Open-domain Multimodal Retrieval
Question Answering
- URL: http://arxiv.org/abs/2310.09696v1
- Date: Sun, 15 Oct 2023 01:18:39 GMT
- Title: Progressive Evidence Refinement for Open-domain Multimodal Retrieval
Question Answering
- Authors: Shuwen Yang, Anran Wu, Xingjiao Wu, Luwei Xiao, Tianlong Ma, Cheng
Jin, Liang He
- Abstract summary: Current multimodal retrieval question-answering models face two main challenges.
utilizing compressed evidence features as input to the model results in the loss of fine-grained information within the evidence.
We propose a two-stage framework for evidence retrieval and question-answering to alleviate these issues.
- Score: 20.59485758381809
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained multimodal models have achieved significant success in
retrieval-based question answering. However, current multimodal retrieval
question-answering models face two main challenges. Firstly, utilizing
compressed evidence features as input to the model results in the loss of
fine-grained information within the evidence. Secondly, a gap exists between
the feature extraction of evidence and the question, which hinders the model
from effectively extracting critical features from the evidence based on the
given question. We propose a two-stage framework for evidence retrieval and
question-answering to alleviate these issues. First and foremost, we propose a
progressive evidence refinement strategy for selecting crucial evidence. This
strategy employs an iterative evidence retrieval approach to uncover the
logical sequence among the evidence pieces. It incorporates two rounds of
filtering to optimize the solution space, thus further ensuring temporal
efficiency. Subsequently, we introduce a semi-supervised contrastive learning
training strategy based on negative samples to expand the scope of the question
domain, allowing for a more thorough exploration of latent knowledge within
known samples. Finally, in order to mitigate the loss of fine-grained
information, we devise a multi-turn retrieval and question-answering strategy
to handle multimodal inputs. This strategy involves incorporating multimodal
evidence directly into the model as part of the historical dialogue and
question. Meanwhile, we leverage a cross-modal attention mechanism to capture
the underlying connections between the evidence and the question, and the
answer is generated through a decoding generation approach. We validate the
model's effectiveness through extensive experiments, achieving outstanding
performance on WebQA and MultimodelQA benchmark tests.
Related papers
- Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering [53.39158264785098]
Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task.
We present an entirely end-to-end solution for VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation model.
arXiv Detail & Related papers (2024-10-12T06:21:58Z) - Multimodal Misinformation Detection using Large Vision-Language Models [7.505532091249881]
Large language models (LLMs) have shown remarkable performance in various tasks.
Few approaches consider evidence retrieval as part of misinformation detection.
We propose a novel re-ranking approach for multimodal evidence retrieval.
arXiv Detail & Related papers (2024-07-19T13:57:11Z) - Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition [52.522244807811894]
We propose a novel multimodal Transformer framework using prompt learning to address the issue of missing modalities.
Our method introduces three types of prompts: generative prompts, missing-signal prompts, and missing-type prompts.
Through prompt learning, we achieve a substantial reduction in the number of trainable parameters.
arXiv Detail & Related papers (2024-07-07T13:55:56Z) - Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models [17.60243337898751]
We present a Chain-of-Action framework for multimodal and retrieval-augmented Question-Answering (QA)
Compared to the literature, CoA overcomes two major challenges of current QA applications: (i) unfaithful hallucination that is inconsistent with real-time or domain facts and (ii) weak reasoning performance over compositional information.
arXiv Detail & Related papers (2024-03-26T03:51:01Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Quantifying & Modeling Multimodal Interactions: An Information
Decomposition Framework [89.8609061423685]
We propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task.
To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks.
We demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies.
arXiv Detail & Related papers (2023-02-23T18:59:05Z) - Enhancing Multi-modal and Multi-hop Question Answering via Structured
Knowledge and Unified Retrieval-Generation [33.56304858796142]
Multi-modal multi-hop question answering involves answering a question by reasoning over multiple input sources from different modalities.
Existing methods often retrieve evidences separately and then use a language model to generate an answer based on the retrieved evidences.
We propose a Structured Knowledge and Unified Retrieval-Generation (RG) approach to address these issues.
arXiv Detail & Related papers (2022-12-16T18:12:04Z) - Causal Intervention-based Prompt Debiasing for Event Argument Extraction [19.057467535856485]
We compare two kinds of prompts, name-based prompt and ontology-base prompt, and reveal how ontology-base prompt methods exceed its counterpart in zero-shot event argument extraction (EAE)
Experiments on two benchmarks demonstrate that modified by our debias method, the baseline model becomes both more effective and robust, with significant improvement in the resistance to adversarial attacks.
arXiv Detail & Related papers (2022-10-04T12:32:00Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - SRQA: Synthetic Reader for Factoid Question Answering [21.28441702154528]
We introduce a new model called SRQA, which means Synthetic Reader for Factoid Question Answering.
This model enhances the question answering system in the multi-document scenario from three aspects.
We perform SRQA on the WebQA dataset, and experiments show that our model outperforms the state-of-the-art models.
arXiv Detail & Related papers (2020-09-02T13:16:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.