Towards Event Extraction from Speech with Contextual Clues
- URL: http://arxiv.org/abs/2401.15385v1
- Date: Sat, 27 Jan 2024 11:07:19 GMT
- Title: Towards Event Extraction from Speech with Contextual Clues
- Authors: Jingqi Kang, Tongtong Wu, Jinming Zhao, Guitao Wang, Guilin Qi,
Yuan-Fang Li, Gholamreza Haffari
- Abstract summary: We introduce the Speech Event Extraction (SpeechEE) task and construct three synthetic training sets and one human-spoken test set.
Compared to event extraction from text, SpeechEE poses greater challenges mainly due to complex speech signals that are continuous and have no word boundaries.
Our method brings significant improvements on all datasets, achieving a maximum F1 gain of 10.7%.
- Score: 61.164413398231254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While text-based event extraction has been an active research area and has
seen successful application in many domains, extracting semantic events from
speech directly is an under-explored problem. In this paper, we introduce the
Speech Event Extraction (SpeechEE) task and construct three synthetic training
sets and one human-spoken test set. Compared to event extraction from text,
SpeechEE poses greater challenges mainly due to complex speech signals that are
continuous and have no word boundaries. Additionally, unlike perceptible sound
events, semantic events are more subtle and require a deeper understanding. To
tackle these challenges, we introduce a sequence-to-structure generation
paradigm that can produce events from speech signals in an end-to-end manner,
together with a conditioned generation method that utilizes speech recognition
transcripts as the contextual clue. We further propose to represent events with
a flat format to make outputs more natural language-like. Our experimental
results show that our method brings significant improvements on all datasets,
achieving a maximum F1 gain of 10.7%. The code and datasets are released on
https://github.com/jodie-kang/SpeechEE.
Related papers
- Scaling Speech-Text Pre-training with Synthetic Interleaved Data [31.77653849518526]
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction.
Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data.
We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora.
arXiv Detail & Related papers (2024-11-26T17:19:09Z) - Recent Advances in Speech Language Models: A Survey [45.968078636811356]
Speech Language Models (SpeechLMs) are end-to-end models that generate speech without converting from text.
This paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs.
arXiv Detail & Related papers (2024-10-01T21:48:12Z) - Double Mixture: Towards Continual Event Detection from Speech [60.33088725100812]
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events.
This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events.
We propose a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.
arXiv Detail & Related papers (2024-04-20T06:32:00Z) - Few-Shot Spoken Language Understanding via Joint Speech-Text Models [18.193191170754744]
Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations.
We leverage such shared representations to address the persistent challenge of limited data availability in spoken language understanding tasks.
By employing a pre-trained speech-text model, we find that models fine-tuned on text can be effectively transferred to speech testing data.
arXiv Detail & Related papers (2023-10-09T17:59:21Z) - GRASS: Unified Generation Model for Speech-to-Semantic Tasks [7.044414457214718]
We introduce a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data.
Our proposed model achieves state-of-the-art (SOTA) results on many benchmarks covering speech named entity recognition, speech sentiment analysis, speech question answering, and more.
To facilitate future work on instruction fine-tuning for speech-to-semantic tasks, we release our instruction dataset and code.
arXiv Detail & Related papers (2023-09-06T06:44:26Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Rich Event Modeling for Script Event Prediction [60.67635412135682]
We propose the Rich Event Prediction (REP) framework for script event prediction.
REP contains an event extractor to extract such information from texts.
The core component of the predictor is a transformer-based event encoder to flexibly deal with an arbitrary number of arguments.
arXiv Detail & Related papers (2022-12-16T05:17:59Z) - token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired
Speech and Text [65.04385919645395]
token2vec is a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech.
Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction.
arXiv Detail & Related papers (2022-10-30T06:38:19Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.