EasyCom: An Augmented Reality Dataset to Support Algorithms for Easy
Communication in Noisy Environments
- URL: http://arxiv.org/abs/2107.04174v1
- Date: Fri, 9 Jul 2021 02:00:47 GMT
- Title: EasyCom: An Augmented Reality Dataset to Support Algorithms for Easy
Communication in Noisy Environments
- Authors: Jacob Donley, Vladimir Tourbabin, Jung-Suk Lee, Mark Broyles, Hao
Jiang, Jie Shen, Maja Pantic, Vamsi Krishna Ithapu, Ravish Mehra
- Abstract summary: We release a dataset that contains over 5 hours of multi-modal data useful for training and testing algorithms for the application of improving conversations for an AR glasses wearer.
We provide speech intelligibility, quality and signal-to-noise ratio improvement results for a baseline method and show improvements across all tested metrics.
- Score: 43.05826988957987
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Augmented Reality (AR) as a platform has the potential to facilitate the
reduction of the cocktail party effect. Future AR headsets could potentially
leverage information from an array of sensors spanning many different
modalities. Training and testing signal processing and machine learning
algorithms on tasks such as beam-forming and speech enhancement require high
quality representative data. To the best of the author's knowledge, as of
publication there are no available datasets that contain synchronized
egocentric multi-channel audio and video with dynamic movement and
conversations in a noisy environment. In this work, we describe, evaluate and
release a dataset that contains over 5 hours of multi-modal data useful for
training and testing algorithms for the application of improving conversations
for an AR glasses wearer. We provide speech intelligibility, quality and
signal-to-noise ratio improvement results for a baseline method and show
improvements across all tested metrics. The dataset we are releasing contains
AR glasses egocentric multi-channel microphone array audio, wide field-of-view
RGB video, speech source pose, headset microphone audio, annotated voice
activity, speech transcriptions, head bounding boxes, target of speech and
source identification labels. We have created and are releasing this dataset to
facilitate research in multi-modal AR solutions to the cocktail party problem.
Related papers
- Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding [61.89781979702939]
This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets.
Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations.
We introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods.
arXiv Detail & Related papers (2024-09-29T03:33:35Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Headset: Human emotion awareness under partial occlusions multimodal
dataset [19.57427512904342]
We present a new multimodal database to help advance the development of immersive technologies.
Our proposed database provides ethically compliant and diverse volumetric data, in particular 27 participants displaying posed facial expressions and subtle body movements while speaking, plus 11 participants wearing head-mounted displays (HMDs)
The dataset can be helpful in the evaluation and performance testing of various XR algorithms, including but not limited to facial expression recognition and reconstruction, facial reenactment, and volumetric video.
arXiv Detail & Related papers (2024-02-14T11:42:15Z) - Multimodal Data and Resource Efficient Device-Directed Speech Detection
with Large Foundation Models [43.155061160275196]
We explore the possibility of making interactions with virtual assistants more natural by eliminating the need for a trigger phrase.
Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone.
We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder.
arXiv Detail & Related papers (2023-12-06T17:29:03Z) - AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications.
We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Audio-Visual Deception Detection: DOLOS Dataset and Parameter-Efficient
Crossmodal Learning [21.270905512076425]
We introduce DOLOS, the largest gameshow deception detection dataset with rich deceptive conversations.
We provide train-test, duration, and gender protocols to investigate the impact of different factors.
We exploit multi-task learning to enhance performance by concurrently predicting deception and audio-visual features.
arXiv Detail & Related papers (2023-03-09T08:12:16Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z) - Audio Tagging by Cross Filtering Noisy Labels [26.14064793686316]
We present a novel framework, named CrossFilter, to combat the noisy labels problem for audio tagging.
Our method achieves state-of-the-art performance and even surpasses the ensemble models.
arXiv Detail & Related papers (2020-07-16T07:55:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.