FashionVQA: A Domain-Specific Visual Question Answering System
- URL: http://arxiv.org/abs/2208.11253v1
- Date: Wed, 24 Aug 2022 01:18:13 GMT
- Title: FashionVQA: A Domain-Specific Visual Question Answering System
- Authors: Min Wang, Ata Mahjoubfar, Anupama Joshi
- Abstract summary: We train a visual question answering (VQA) system to answer complex natural language questions about apparel in fashion photoshoot images.
The accuracy of the best model surpasses the human expert level, even when answering human-generated questions.
Our approach for generating a large-scale multimodal domain-specific dataset provides a path for training specialized models capable of communicating in natural language.
- Score: 2.6924405243296134
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans apprehend the world through various sensory modalities, yet language
is their predominant communication channel. Machine learning systems need to
draw on the same multimodal richness to have informed discourses with humans in
natural language; this is particularly true for systems specialized in
visually-dense information, such as dialogue, recommendation, and search
engines for clothing. To this end, we train a visual question answering (VQA)
system to answer complex natural language questions about apparel in fashion
photoshoot images. The key to the successful training of our VQA model is the
automatic creation of a visual question-answering dataset with 168 million
samples from item attributes of 207 thousand images using diverse templates.
The sample generation employs a strategy that considers the difficulty of the
question-answer pairs to emphasize challenging concepts. Contrary to the recent
trends in using several datasets for pretraining the visual question answering
models, we focused on keeping the dataset fixed while training various models
from scratch to isolate the improvements from model architecture changes. We
see that using the same transformer for encoding the question and decoding the
answer, as in language models, achieves maximum accuracy, showing that visual
language models (VLMs) make the best visual question answering systems for our
dataset. The accuracy of the best model surpasses the human expert level, even
when answering human-generated questions that are not confined to the template
formats. Our approach for generating a large-scale multimodal domain-specific
dataset provides a path for training specialized models capable of
communicating in natural language. The training of such domain-expert models,
e.g., our fashion VLM model, cannot rely solely on the large-scale
general-purpose datasets collected from the web.
Related papers
- Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models [36.56689822791777]
Knowledge-Based Visual Question Answering (KBVQA) advances this concept by adding external knowledge along with images to respond to questions.
Our main contribution involves enhancing questions by incorporating relevant external knowledge extracted from knowledge graphs, using a dynamic triple extraction method.
Our model, enriched with knowledge, demonstrates an average improvement of 4.75% in Exact Match Score over the state-of-the-art on three different KBVQA datasets.
arXiv Detail & Related papers (2024-06-14T13:07:46Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge [10.074327344317116]
We propose Q&A Prompts to equip AI models with robust cross-modality reasoning ability.
We first use the image-answer pairs and the corresponding questions in a training set as inputs and outputs to train a visual question generation model.
We then use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers.
arXiv Detail & Related papers (2024-01-19T14:22:29Z) - ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese.
We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.
We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language [49.82293730925404]
Large foundation models can exhibit unique capabilities depending on the domain of data they are trained on.
We show that this model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue.
arXiv Detail & Related papers (2022-04-01T17:43:13Z) - Image Captioning for Effective Use of Language Models in Knowledge-Based
Visual Question Answering [17.51860125438028]
We propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models.
Our results on a visual question answering task which requires external knowledge (OK-VQA) show that our text-only model outperforms pretrained multimodal (image-text) models of comparable number of parameters.
arXiv Detail & Related papers (2021-09-15T14:11:29Z) - What do we expect from Multiple-choice QA Systems? [70.86513724662302]
We consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets.
We evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model's inputs.
arXiv Detail & Related papers (2020-11-20T21:27:10Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.