MAAS: Multi-modal Assignation for Active Speaker Detection
- URL: http://arxiv.org/abs/2101.03682v1
- Date: Mon, 11 Jan 2021 02:57:25 GMT
- Title: MAAS: Multi-modal Assignation for Active Speaker Detection
- Authors: Juan Le\'on-Alc\'azar, Fabian Caba Heilbron, Ali Thabet, and Bernard
Ghanem
- Abstract summary: We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem.
Our experiments show that, an small graph data structure built from a single frame, allows to approximate an instantaneous audio-visual assignment problem.
- Score: 59.08836580733918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Active speaker detection requires a solid integration of multi-modal cues.
While individual modalities can approximate a solution, accurate predictions
can only be achieved by explicitly fusing the audio and visual features and
modeling their temporal progression. Despite its inherent muti-modal nature,
current methods still focus on modeling and fusing short-term audiovisual
features for individual speakers, often at frame level. In this paper we
present a novel approach to active speaker detection that directly addresses
the multi-modal nature of the problem, and provides a straightforward strategy
where independent visual features from potential speakers in the scene are
assigned to a previously detected speech event. Our experiments show that, an
small graph data structure built from a single frame, allows to approximate an
instantaneous audio-visual assignment problem. Moreover, the temporal extension
of this initial graph achieves a new state-of-the-art on the AVA-ActiveSpeaker
dataset with a mAP of 88.8\%.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.