Unified Language Representation for Question Answering over Text,
Tables, and Images
- URL: http://arxiv.org/abs/2306.16762v1
- Date: Thu, 29 Jun 2023 08:02:23 GMT
- Title: Unified Language Representation for Question Answering over Text,
Tables, and Images
- Authors: Bowen Yu, Cheng Fu, Haiyang Yu, Fei Huang, Yongbin Li
- Abstract summary: We call for an alternative paradigm, which transforms the images and tables into unified language representations.
This idea takes advantage of the power of pre-trained language models and is implemented in a framework called Solar.
Our experimental results show that Solar outperforms all existing methods by 10.6-32.3 pts on two datasets.
- Score: 42.54647250377826
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When trying to answer complex questions, people often rely on multiple
sources of information, such as visual, textual, and tabular data. Previous
approaches to this problem have focused on designing input features or model
structure in the multi-modal space, which is inflexible for cross-modal
reasoning or data-efficient training. In this paper, we call for an alternative
paradigm, which transforms the images and tables into unified language
representations, so that we can simplify the task into a simpler textual QA
problem that can be solved using three steps: retrieval, ranking, and
generation, all within a language space. This idea takes advantage of the power
of pre-trained language models and is implemented in a framework called Solar.
Our experimental results show that Solar outperforms all existing methods by
10.6-32.3 pts on two datasets, MultimodalQA and MMCoQA, across ten different
metrics. Additionally, Solar achieves the best performance on the WebQA
leaderboard
Related papers
- ABC: Achieving Better Control of Multimodal Embeddings using VLMs [61.396457715710774]
Visual embedding models excel at zero-shot tasks like visual retrieval and classification.
Existing CLIP-based approaches embed images and text independently, and fuse the result.
We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone.
arXiv Detail & Related papers (2025-03-01T03:29:02Z) - MST5 -- Multilingual Question Answering over Knowledge Graphs [1.6470999044938401]
Knowledge Graph Question Answering (KGQA) simplifies querying vast amounts of knowledge stored in a graph-based model using natural language.
Existing multilingual KGQA systems face challenges in achieving performance comparable to English systems.
We propose a simplified approach to enhance multilingual KGQA systems by incorporating linguistic context and entity information directly into the processing pipeline of a language model.
arXiv Detail & Related papers (2024-07-08T15:37:51Z) - TANQ: An open domain dataset of table answered questions [15.323690523538572]
TANQ is the first open domain question answering dataset where the answers require building tables from information across multiple sources.
We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups.
Our best-performing baseline, GPT4 reaches an overall F1 score of 29.1, lagging behind human performance by 19.7 points.
arXiv Detail & Related papers (2024-05-13T14:07:20Z) - Generating Multi-Aspect Queries for Conversational Search [6.974395116689502]
We show that the same retrieval model would perform better with more than one rewritten query by 85% in terms of nDCG@3.
We propose a multi-aspect query generation and retrieval framework, called MQ4CS.
arXiv Detail & Related papers (2024-03-28T10:40:22Z) - Semantic Parsing for Conversational Question Answering over Knowledge
Graphs [63.939700311269156]
We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof.
We present two different semantic parsing approaches and highlight the challenges of the task.
Our dataset and models are released at https://github.com/Edinburgh/SPICE.
arXiv Detail & Related papers (2023-01-28T14:45:11Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - MultiModalQA: Complex Question Answering over Text, Tables and Images [52.25399438133274]
We present MultiModalQA: a dataset that requires joint reasoning over text, tables and images.
We create MMQA using a new framework for generating complex multi-modal questions at scale.
We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions.
arXiv Detail & Related papers (2021-04-13T09:14:28Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.