Unsupervised Speech Representation Learning for Behavior Modeling using
Triplet Enhanced Contextualized Networks
- URL: http://arxiv.org/abs/2104.03899v1
- Date: Thu, 1 Apr 2021 22:44:23 GMT
- Title: Unsupervised Speech Representation Learning for Behavior Modeling using
Triplet Enhanced Contextualized Networks
- Authors: Haoqi Li, Brian Baucom, Shrikanth Narayanan, Panayiotis Georgiou
- Abstract summary: We exploit the stationary properties of human behavior within an interaction and present a representation learning method to capture behavioral information from speech.
We present an encoder-decoder based Deep Contextualized Network (DCN) as well as a Triplet-Enhanced DCN (TE-DCN) framework to capture the behavioral context.
- Score: 28.957236790411585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech encodes a wealth of information related to human behavior and has been
used in a variety of automated behavior recognition tasks. However, extracting
behavioral information from speech remains challenging including due to
inadequate training data resources stemming from the often low occurrence
frequencies of specific behavioral patterns. Moreover, supervised behavioral
modeling typically relies on domain-specific construct definitions and
corresponding manually-annotated data, rendering generalizing across domains
challenging. In this paper, we exploit the stationary properties of human
behavior within an interaction and present a representation learning method to
capture behavioral information from speech in an unsupervised way. We
hypothesize that nearby segments of speech share the same behavioral context
and hence map onto similar underlying behavioral representations. We present an
encoder-decoder based Deep Contextualized Network (DCN) as well as a
Triplet-Enhanced DCN (TE-DCN) framework to capture the behavioral context and
derive a manifold representation, where speech frames with similar behaviors
are closer while frames of different behaviors maintain larger distances. The
models are trained on movie audio data and validated on diverse domains
including on a couples therapy corpus and other publicly collected data (e.g.,
stand-up comedy). With encouraging results, our proposed framework shows the
feasibility of unsupervised learning within cross-domain behavioral modeling.
Related papers
- A study of animal action segmentation algorithms across supervised, unsupervised, and semi-supervised learning paradigms [3.597220870252727]
We introduce a semi-supervised action segmentation model that bridges the gap between supervised deep neural networks and unsupervised graphical models.
We find that fully supervised temporal convolutional networks with the addition of temporal information perform the best on our supervised metrics across all datasets.
arXiv Detail & Related papers (2024-07-23T14:22:16Z) - player2vec: A Language Modeling Approach to Understand Player Behavior in Games [2.2216044069240657]
Methods for learning latent user representations from historical behavior logs have gained traction for recommendation tasks in e-commerce, content streaming, and other settings.
We present a novel method for overcoming this limitation by extending a long-range Transformer model to player behavior data.
We discuss specifics of behavior tracking in games and propose preprocessing and tokenization approaches by viewing in-game events in an analogous way to words in sentences.
arXiv Detail & Related papers (2024-04-05T17:29:47Z) - Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - Revisiting Self-supervised Learning of Speech Representation from a
Mutual Information Perspective [68.20531518525273]
We take a closer look into existing self-supervised methods of speech from an information-theoretic perspective.
We use linear probes to estimate the mutual information between the target information and learned representations.
We explore the potential of evaluating representations in a self-supervised fashion, where we estimate the mutual information between different parts of the data without using any labels.
arXiv Detail & Related papers (2024-01-16T21:13:22Z) - Pretraining on Interactions for Learning Grounded Affordance
Representations [22.290431852705662]
We train a neural network to predict objects' trajectories in a simulated interaction.
We show that our network's latent representations differentiate between both observed and unobserved affordances.
Our results suggest a way in which modern deep learning approaches to grounded language learning can be integrated with traditional formal semantic notions of lexical representations.
arXiv Detail & Related papers (2022-07-05T19:19:53Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Audio-Adaptive Activity Recognition Across Video Domains [112.46638682143065]
We leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening.
We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation.
We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically.
arXiv Detail & Related papers (2022-03-27T08:15:20Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Beyond Tracking: Using Deep Learning to Discover Novel Interactions in
Biological Swarms [3.441021278275805]
We propose training deep network models to predict system-level states directly from generic graphical features from the entire view.
Because the resulting predictive models are not based on human-understood predictors, we use explanatory modules.
This represents an example of augmented intelligence in behavioral ecology -- knowledge co-creation in a human-AI team.
arXiv Detail & Related papers (2021-08-20T22:50:41Z) - Learning Asynchronous and Sparse Human-Object Interaction in Videos [56.73059840294019]
Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene.
ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
arXiv Detail & Related papers (2021-03-03T23:43:55Z) - Local and non-local dependency learning and emergence of rule-like
representations in speech data by Deep Convolutional Generative Adversarial
Networks [0.0]
This paper argues that training GANs on local and non-local dependencies in speech data offers insights into how deep neural networks discretize continuous data.
arXiv Detail & Related papers (2020-09-27T00:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.