Developing Speech Processing Pipelines for Police Accountability
- URL: http://arxiv.org/abs/2306.06086v1
- Date: Fri, 9 Jun 2023 17:48:58 GMT
- Title: Developing Speech Processing Pipelines for Police Accountability
- Authors: Anjalie Field, Prateek Verma, Nay San, Jennifer L. Eberhardt, Dan
Jurafsky
- Abstract summary: Police body-worn cameras have the potential to improve accountability and transparency in policing. Yet in practice, they result in millions of hours of footage that is never reviewed.
We investigate the potential of large pre-trained speech models for facilitating reviews, focusing on ASR and officer speech detection in footage from traffic stops.
Our proposed pipeline includes training data alignment and filtering, fine-tuning with resource constraints, and combining officer speech detection with ASR for a fully automated approach.
- Score: 22.711149484932527
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Police body-worn cameras have the potential to improve accountability and
transparency in policing. Yet in practice, they result in millions of hours of
footage that is never reviewed. We investigate the potential of large
pre-trained speech models for facilitating reviews, focusing on ASR and officer
speech detection in footage from traffic stops. Our proposed pipeline includes
training data alignment and filtering, fine-tuning with resource constraints,
and combining officer speech detection with ASR for a fully automated approach.
We find that (1) fine-tuning strongly improves ASR performance on officer
speech (WER=12-13%), (2) ASR on officer speech is much more accurate than on
community member speech (WER=43.55-49.07%), (3) domain-specific tasks like
officer speech detection and diarization remain challenging. Our work offers
practical applications for reviewing body camera footage and general guidance
for adapting pre-trained speech models to noisy multi-speaker domains.
Related papers
- Auto-Drafting Police Reports from Noisy ASR Outputs: A Trust-Centered LLM Approach [11.469965123352287]
This study presents an innovative AI-driven system designed to generate police report drafts from complex, noisy, and multi-role dialogue data.
Our approach intelligently extracts key elements of law enforcement interactions and includes them in the draft.
This frame-work holds the potential to transform the reporting process, ensur- ing greater oversight, consistency, and fairness in future policing practices.
arXiv Detail & Related papers (2025-02-11T16:27:28Z) - Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM [3.6950912517562435]
We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities.
Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions.
arXiv Detail & Related papers (2024-09-25T20:59:12Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Direction-Aware Joint Adaptation of Neural Speech Enhancement and
Recognition in Real Multiparty Conversational Environments [21.493664174262737]
This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments.
We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions.
arXiv Detail & Related papers (2022-07-15T03:43:35Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Do We Still Need Automatic Speech Recognition for Spoken Language
Understanding? [14.575551366682872]
We show that learned speech features are superior to ASR transcripts on three classification tasks.
We highlight the intrinsic robustness of wav2vec 2.0 representations to out-of-vocabulary words as key to better performance.
arXiv Detail & Related papers (2021-11-29T15:13:36Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.