Classifying Textual Data with Pre-trained Vision Models through Transfer
Learning and Data Transformations
- URL: http://arxiv.org/abs/2106.12479v1
- Date: Wed, 23 Jun 2021 15:53:38 GMT
- Title: Classifying Textual Data with Pre-trained Vision Models through Transfer
Learning and Data Transformations
- Authors: Charaf Eddine Benarab
- Abstract summary: We propose to use knowledge acquired by benchmark Vision Models which are trained on ImageNet to help a much smaller architecture learn to classify text.
An analysis of different domains and the Transfer Learning method is carried out.
The main contribution of this work is a novel approach which links large pretrained models on both language and vision to achieve state-of-the-art results.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Knowledge is acquired by humans through experience, and no boundary is set
between the kinds of knowledge or skill levels we can achieve on different
tasks at the same time. When it comes to Neural Networks, that is not the case,
the major breakthroughs in the field are extremely task and domain specific.
Vision and language are dealt with in separate manners, using separate methods
and different datasets. In this work, we propose to use knowledge acquired by
benchmark Vision Models which are trained on ImageNet to help a much smaller
architecture learn to classify text. After transforming the textual data
contained in the IMDB dataset to gray scale images. An analysis of different
domains and the Transfer Learning method is carried out. Despite the challenge
posed by the very different datasets, promising results are achieved. The main
contribution of this work is a novel approach which links large pretrained
models on both language and vision to achieve state-of-the-art results in
different sub-fields from the original task. Without needing high compute
capacity resources. Specifically, Sentiment Analysis is achieved after
transferring knowledge between vision and language models. BERT embeddings are
transformed into grayscale images, these images are then used as training
examples for pretrained vision models such as VGG16 and ResNet
Index Terms: Natural language, Vision, BERT, Transfer Learning, CNN, Domain
Adaptation.
Related papers
- Enhancing Vision Models for Text-Heavy Content Understanding and Interaction [0.0]
We build a visual chat application integrating CLIP for image encoding and a model from the Massive Text Embedding Benchmark.
The aim of the project is to increase and also enhance the advance vision models' capabilities in understanding complex visual textual data interconnected data.
arXiv Detail & Related papers (2024-05-31T15:17:47Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Diversify Your Vision Datasets with Automatic Diffusion-Based
Augmentation [66.6546668043249]
ALIA (Automated Language-guided Image Augmentation) is a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains.
To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information.
We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks.
arXiv Detail & Related papers (2023-05-25T17:43:05Z) - Vision Learners Meet Web Image-Text Pairs [32.36188289972377]
In this work, we consider self-supervised pre-training on noisy web sourced image-text paired data.
We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training.
We present a new visual representation pre-training method, MUlti-modal Generator(MUG), that learns from scalable web sourced image-text data.
arXiv Detail & Related papers (2023-01-17T18:53:24Z) - Context-driven Visual Object Recognition based on Knowledge Graphs [0.8701566919381223]
We propose an approach that enhances deep learning methods by using external contextual knowledge encoded in a knowledge graph.
We conduct a series of experiments to investigate the impact of different contextual views on the learned object representations for the same image dataset.
arXiv Detail & Related papers (2022-10-20T13:09:00Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Factors of Influence for Transfer Learning across Diverse Appearance
Domains and Task Types [50.1843146606122]
A simple form of transfer learning is common in current state-of-the-art computer vision models.
Previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood.
In this paper we carry out an extensive experimental exploration of transfer learning across vastly different image domains.
arXiv Detail & Related papers (2021-03-24T16:24:20Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.