Exploring External Knowledge for Accurate modeling of Visual and
Language Problems
- URL: http://arxiv.org/abs/2302.08901v1
- Date: Fri, 27 Jan 2023 02:01:50 GMT
- Title: Exploring External Knowledge for Accurate modeling of Visual and
Language Problems
- Authors: Xuewen Yang
- Abstract summary: This dissertation focuses on visual and language understanding which involves many challenging tasks.
The state-of-the-art methods for solving these problems usually involves only two parts: source data and target labels.
We developed a methodology that we can first extract external knowledge and then integrate it with the original models.
- Score: 2.7190267444272056
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The interest in Artificial Intelligence (AI) and its applications has seen
unprecedented growth in the last few years. The success can be partly
attributed to the advancements of deep neural networks made in the sub-fields
of AI such as Computer Vision (CV) and Natural Language Processing (NLP). The
promising research area that this dissertation focuses on is visual and
language understanding which involves many challenging tasks, i.e.,
classification, detection, segmentation, machine translation and captioning,
etc. The state-of-the-art methods for solving these problems usually involves
only two parts: source data and target labels, which is rather insufficient
especially when the dataset is small. Meanwhile, many external tools or sources
can provide extra useful information (external knowledge) that can help improve
the performance of these methods. For example, a detection model has been
applied to provide better object features than state-of-the-art ResNet for
image captioning models. Inspired by this observation, we developed a
methodology that we can first extract external knowledge and then integrate it
with the original models. The external knowledge has to be extracted from the
dataset, or can directly come from external, e.g., grammar rules or scene
graphs. We apply this methodology to different AI tasks, including machine
translation and image captioning and improve the original state-of-the-art
models by a large margin.
Related papers
- Deep Learning and Machine Learning -- Natural Language Processing: From Theory to Application [17.367710635990083]
We focus on natural language processing (NLP) and the role of large language models (LLMs)
This paper discusses advanced data preprocessing techniques and the use of frameworks like Hugging Face for implementing transformer-based models.
It highlights challenges such as handling multilingual data, reducing bias, and ensuring model robustness.
arXiv Detail & Related papers (2024-10-30T09:35:35Z) - VALE: A Multimodal Visual and Language Explanation Framework for Image Classifiers using eXplainable AI and Language Models [0.0]
We propose a novel framework named VALE Visual and Language Explanation.
VALE integrates explainable AI techniques with advanced language models to provide comprehensive explanations.
In this paper, we conduct a pilot study of the VALE framework for image classification tasks.
arXiv Detail & Related papers (2024-08-23T03:02:11Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Towards A Unified Agent with Foundation Models [18.558328028366816]
We investigate how to embed and leverage such abilities in Reinforcement Learning (RL) agents.
We design a framework that uses language as the core reasoning tool, exploring how this enables an agent to tackle a series of fundamental RL challenges.
We demonstrate substantial performance improvements over baselines in exploration efficiency and ability to reuse data from offline datasets.
arXiv Detail & Related papers (2023-07-18T22:37:30Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Knowledge-Aware Procedural Text Understanding with Multi-Stage Training [110.93934567725826]
We focus on the task of procedural text understanding, which aims to comprehend such documents and track entities' states and locations during a process.
Two challenges, the difficulty of commonsense reasoning and data insufficiency, still remain unsolved.
We propose a novel KnOwledge-Aware proceduraL text understAnding (KOALA) model, which effectively leverages multiple forms of external knowledge.
arXiv Detail & Related papers (2020-09-28T10:28:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.