Abstract: Previous studies demonstrated that a dynamic phone-informed compression of
the input audio is beneficial for speech translation (ST). However, they
required a dedicated model for phone recognition and did not test this solution
for direct ST, in which a single model translates the input audio into the
target language without intermediate representations. In this work, we propose
the first method able to perform a dynamic compression of the input indirect ST
models. In particular, we exploit the Connectionist Temporal Classification
(CTC) to compress the input sequence according to its phonetic characteristics.
Our experiments demonstrate that our solution brings a 1.3-1.5 BLEU improvement
over a strong baseline on two language pairs (English-Italian and
English-German), contextually reducing the memory footprint by more than 10%.