Grounded Language Acquisition From Object and Action Imagery
- URL: http://arxiv.org/abs/2309.06335v1
- Date: Tue, 12 Sep 2023 15:52:08 GMT
- Title: Grounded Language Acquisition From Object and Action Imagery
- Authors: James Robert Kubricht and Zhaoyuan Yang and Jianwei Qiu and Peter
Henry Tu
- Abstract summary: We explore the development of a private language for visual data representation.
For object recognition, a set of sketches produced by human participants from real imagery was used.
For action recognition, 2D trajectories were generated from 3D motion capture systems.
- Score: 1.5566524830295307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning approaches to natural language processing have made great
strides in recent years. While these models produce symbols that convey vast
amounts of diverse knowledge, it is unclear how such symbols are grounded in
data from the world. In this paper, we explore the development of a private
language for visual data representation by training emergent language (EL)
encoders/decoders in both i) a traditional referential game environment and ii)
a contrastive learning environment utilizing a within-class matching training
paradigm. An additional classification layer utilizing neural machine
translation and random forest classification was used to transform symbolic
representations (sequences of integer symbols) to class labels. These methods
were applied in two experiments focusing on object recognition and action
recognition. For object recognition, a set of sketches produced by human
participants from real imagery was used (Sketchy dataset) and for action
recognition, 2D trajectories were generated from 3D motion capture systems
(MOVI dataset). In order to interpret the symbols produced for data in each
experiment, gradient-weighted class activation mapping (Grad-CAM) methods were
used to identify pixel regions indicating semantic features which contribute
evidence towards symbols in learned languages. Additionally, a t-distributed
stochastic neighbor embedding (t-SNE) method was used to investigate embeddings
learned by CNN feature extractors.
Related papers
- On the Transition from Neural Representation to Symbolic Knowledge [2.2528422603742304]
We propose a Neural-Symbolic Transitional Dictionary Learning (TDL) framework that employs an EM algorithm to learn a transitional representation of data.
We implement the framework with a diffusion model by regarding the decomposition of input as a cooperative game.
We additionally use RL enabled by the Markovian of diffusion models to further tune the learned prototypes.
arXiv Detail & Related papers (2023-08-03T19:29:35Z) - Label Aware Speech Representation Learning For Language Identification [49.197215416945596]
We propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task.
This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function.
arXiv Detail & Related papers (2023-06-07T12:14:16Z) - Multi-Domain Norm-referenced Encoding Enables Data Efficient Transfer
Learning of Facial Expression Recognition [62.997667081978825]
We propose a biologically-inspired mechanism for transfer learning in facial expression recognition.
Our proposed architecture provides an explanation for how the human brain might innately recognize facial expressions on varying head shapes.
Our model achieves a classification accuracy of 92.15% on the FERG dataset with extreme data efficiency.
arXiv Detail & Related papers (2023-04-05T09:06:30Z) - Natural Language-Assisted Sign Language Recognition [28.64871971445024]
We propose the Natural Language-Assisted Sign Language Recognition framework.
It exploits semantic information contained in glosses (sign labels) to mitigate the problem of visually indistinguishable signs (VISigns) in sign languages.
Our method achieves state-of-the-art performance on three widely-adopted benchmarks: MSASL, WLASL, and NMFs-CSL.
arXiv Detail & Related papers (2023-03-21T17:59:57Z) - Open-Vocabulary Multi-Label Classification via Multi-modal Knowledge
Transfer [55.885555581039895]
Multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding.
We propose a novel open-vocabulary framework, named multimodal knowledge transfer (MKT) for multi-label classification.
arXiv Detail & Related papers (2022-07-05T08:32:18Z) - Leveraging Systematic Knowledge of 2D Transformations [6.668181653599057]
Humans have a remarkable ability to interpret images, even if the scenes in the images are rare.
This work focuses on 1) the acquisition of systematic knowledge of 2D transformations, and 2) architectural components that can leverage the learned knowledge in image classification tasks.
arXiv Detail & Related papers (2022-06-02T06:46:12Z) - Unified Contrastive Learning in Image-Text-Label Space [130.31947133453406]
Unified Contrastive Learning (UniCL) is effective way of learning semantically rich yet discriminative representations.
UniCL stand-alone is a good learner on pure imagelabel data, rivaling the supervised learning methods across three image classification datasets.
arXiv Detail & Related papers (2022-04-07T17:34:51Z) - A Transformer-Based Contrastive Learning Approach for Few-Shot Sign
Language Recognition [0.0]
We propose a novel Contrastive Transformer-based model, which demonstrate to learn rich representations from body key points sequences.
Experiments showed that the model could generalize well and achieved competitive results for sign classes never seen in the training process.
arXiv Detail & Related papers (2022-04-05T11:42:55Z) - Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner.
We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z) - Extending Maps with Semantic and Contextual Object Information for Robot
Navigation: a Learning-Based Framework using Visual and Depth Cues [12.984393386954219]
This paper addresses the problem of building augmented metric representations of scenes with semantic information from RGB-D images.
We propose a complete framework to create an enhanced map representation of the environment with object-level information.
arXiv Detail & Related papers (2020-03-13T15:05:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.