Words are all you need? Capturing human sensory similarity with textual
descriptors
- URL: http://arxiv.org/abs/2206.04105v1
- Date: Wed, 8 Jun 2022 18:09:19 GMT
- Title: Words are all you need? Capturing human sensory similarity with textual
descriptors
- Authors: Raja Marjieh, Pol van Rijn, Ilia Sucholutsky, Theodore R. Sumers,
Harin Lee, Thomas L. Griffiths, Nori Jacoby
- Abstract summary: We explore the relation between human similarity judgments and language.
We introduce a novel adaptive pipeline for tag mining that is both efficient and domain-general.
We show that our prediction pipeline based on text descriptors exhibits excellent performance.
- Score: 12.191617984664683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in multimodal training use textual descriptions to
significantly enhance machine understanding of images and videos. Yet, it
remains unclear to what extent language can fully capture sensory experiences
across different modalities. A well-established approach for characterizing
sensory experiences relies on similarity judgments, namely, the degree to which
people perceive two distinct stimuli as similar. We explore the relation
between human similarity judgments and language in a series of large-scale
behavioral studies ($N=1,823$ participants) across three modalities (images,
audio, and video) and two types of text descriptors: simple word tags and
free-text captions. In doing so, we introduce a novel adaptive pipeline for tag
mining that is both efficient and domain-general. We show that our prediction
pipeline based on text descriptors exhibits excellent performance, and we
compare it against a comprehensive array of 611 baseline models based on
vision-, audio-, and video-processing architectures. We further show that the
degree to which textual descriptors and models predict human similarity varies
across and within modalities. Taken together, these studies illustrate the
value of integrating machine learning and cognitive science approaches to
better understand the similarities and differences between human and machine
representations. We present an interactive visualization at
https://words-are-all-you-need.s3.amazonaws.com/index.html for exploring the
similarity between stimuli as experienced by humans and different methods
reported in the paper.
Related papers
- Autoregressive Pre-Training on Pixels and Texts [35.82610192457444]
We explore the dual modality of language--both visual and textual--within an autoregressive framework, pre-trained on both document images and texts.
Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head.
We find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks.
arXiv Detail & Related papers (2024-04-16T16:36:50Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Vision+X: A Survey on Multimodal Learning in the Light of Data [64.03266872103835]
multimodal machine learning that incorporates data from various sources has become an increasingly popular research area.
We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions.
We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels.
arXiv Detail & Related papers (2022-10-05T13:14:57Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - Seeing the advantage: visually grounding word embeddings to better
capture human semantic knowledge [8.208534667678792]
Distributional semantic models capture word-level meaning that is useful in many natural language processing tasks.
We create visually grounded word embeddings by combining English text and images and compare them to popular text-based methods.
Our analysis shows that visually grounded embedding similarities are more predictive of the human reaction times than the purely text-based embeddings.
arXiv Detail & Related papers (2022-02-21T15:13:48Z) - Exploring the Sensory Spaces of English Perceptual Verbs in Natural
Language Data [0.40611352512781856]
We focus on the most frequent perception verbs of English analyzed from an and Agentive vs. Experiential distinction.
In this study we report on a data-driven approach based on distributional-semantic word embeddings and clustering models.
arXiv Detail & Related papers (2021-10-19T03:58:44Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z) - On Vocabulary Reliance in Scene Text Recognition [79.21737876442253]
Methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary.
We call this phenomenon "vocabulary reliance"
We propose a simple yet effective mutual learning strategy to allow models of two families to learn collaboratively.
arXiv Detail & Related papers (2020-05-08T11:16:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.