The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple
Devices in Diverse Scenarios
- URL: http://arxiv.org/abs/2306.13734v2
- Date: Fri, 14 Jul 2023 09:45:21 GMT
- Title: The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple
Devices in Diverse Scenarios
- Authors: Samuele Cornell, Matthew Wiesner, Shinji Watanabe, Desh Raj, Xuankai
Chang, Paola Garcia, Matthew Maciejewski, Yoshiki Masuyama, Zhong-Qiu Wang,
Stefano Squartini, Sanjeev Khudanpur
- Abstract summary: We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge.
This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices.
The goal is for participants to devise a single system that can generalize across different array geometries.
- Score: 61.74042680711718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The CHiME challenges have played a significant role in the development and
evaluation of robust automatic speech recognition (ASR) systems. We introduce
the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task
comprises joint ASR and diarization in far-field settings with multiple, and
possibly heterogeneous, recording devices. Different from previous challenges,
we evaluate systems on 3 diverse scenarios: CHiME-6, DiPCo, and Mixer 6. The
goal is for participants to devise a single system that can generalize across
different array geometries and use cases with no a-priori information. Another
departure from earlier CHiME iterations is that participants are allowed to use
open-source pre-trained models and datasets. In this paper, we describe the
challenge design, motivation, and fundamental research questions in detail. We
also present the baseline system, which is fully array-topology agnostic and
features multi-channel diarization, channel selection, guided source separation
and a robust ASR model that leverages self-supervised speech representations
(SSLR).
Related papers
- SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering [53.00674706030977]
We introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for Audio-Visual Question Answering (AVQA)
SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question.
Experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.
arXiv Detail & Related papers (2024-11-07T18:12:49Z) - Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control [60.35553925189286]
We propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture.
We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets.
arXiv Detail & Related papers (2024-06-19T21:11:01Z) - All in One Framework for Multimodal Re-identification in the Wild [58.380708329455466]
multimodal learning paradigm for ReID introduced, referred to as All-in-One (AIO)
AIO harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning.
Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts.
arXiv Detail & Related papers (2024-05-08T01:04:36Z) - AISPACE at SemEval-2024 task 8: A Class-balanced Soft-voting System for Detecting Multi-generator Machine-generated Text [0.0]
SemEval-2024 Task 8 provides a challenge to detect human-written and machine-generated text.
This paper proposes a system that mainly deals with Subtask B.
It aims to detect if given full text is written by human or is generated by a specific Large Language Model (LLM), which is actually a multi-class text classification task.
arXiv Detail & Related papers (2024-04-01T06:25:47Z) - An Empirical Study of Training ID-Agnostic Multi-modal Sequential Recommenders [3.1093882314734285]
Sequential Recommendation (SR) aims to predict future user-item interactions based on historical interactions.
While many SR approaches concentrate on user IDs and item IDs, the human perception of the world through multi-modal signals, like text and images, has inspired researchers to delve into constructing SR from multi-modal information without using IDs.
This paper introduces a simple and universal textbfMulti-textbfModal textbfSequential textbfRecommendation (textbfMMSR) framework.
arXiv Detail & Related papers (2024-03-26T04:16:57Z) - NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant
Meeting Transcription [21.236634241186458]
We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (NOTSOFAR-1'') Challenge alongside datasets and baseline system.
The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios.
arXiv Detail & Related papers (2024-01-16T23:50:26Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - One Self-Configurable Model to Solve Many Abstract Visual Reasoning
Problems [0.0]
We propose a unified model for solving Single-Choice Abstract visual Reasoning tasks.
The proposed model relies on SCAR-Aware dynamic Layer (SAL), which adapts its weights to the structure of the problem.
Experiments show thatSAL-based models, in general, effectively solves diverse tasks, and its performance is on par with the state-of-the-art task-specific baselines.
arXiv Detail & Related papers (2023-12-15T18:15:20Z) - Multimodal Dialogue State Tracking By QA Approach with Data Augmentation [16.436557991074068]
This paper interprets the Audio-Video Scene-Aware Dialogue (AVSD) task from an open-domain Question Answering (QA) point of view.
The proposed QA system uses common encoder-decoder framework with multimodal fusion and attention.
Our experiments show that our model and techniques bring significant improvements over the baseline model on the DSTC7-AVSD dataset.
arXiv Detail & Related papers (2020-07-20T06:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.