Abstract: End-to-end speech recognition generally uses hand-engineered acoustic
features as input and excludes the feature extraction module from its joint
optimization. To extract learnable and adaptive features and mitigate
information loss, we propose a new encoder that adopts globally attentive
locally recurrent (GALR) networks and directly takes raw waveform as input. We
observe improved ASR performance and robustness by applying GALR on different
window lengths to aggregate fine-grain temporal information into multi-scale
acoustic features. Experiments are conducted on a benchmark dataset AISHELL-2
and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
With faster speed and comparable model size, our proposed multi-scale GALR
waveform encoder achieved consistent character error rate reductions (CERRs)
from 7.9% to 28.1% relative over strong baselines, including Conformer and
TDNN-Conformer. In particular, our approach demonstrated notable robustness
than the traditional handcrafted features and outperformed the baseline
MFCC-based TDNN-Conformer model by a 15.2% CERR on a music-mixed real-world
speech test set.