Conference proceeding
Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection
COMPUTER VISION - ECCV 2022, PT XXXV, Vol.13695, pp.371-387
Lecture Notes in Computer Science
01/01/2022
DOI: 10.1007/978-3-031-19833-5_22
Abstract
Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it requires learning effective audiovisual features and spatial-temporal correlations over long temporal windows. In this paper, we present SPELL, a novel spatial-temporal graph learning framework that can solve complex tasks such as ASD. To this end, each person in a video frame is first encoded in a unique node for that frame. Nodes corresponding to a single person across frames are connected to encode their temporal dynamics. Nodes within a frame are also connected to encode inter-person relationships. Thus, SPELL reduces ASD to a node classification task. Importantly, SPELL is able to reason over long temporal contexts for all nodes without relying on computationally expensive fully connected graph neural networks. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that learning graph-based representations can significantly improve the active speaker detection performance owing to its explicit spatial and temporal structure. SPELL outperforms all previous state-of-the-art approaches while requiring significantly lower memory and computational resources. Our code is publicly available: https://github.com/SRA2/SPELL.
Details
- Title: Subtitle
- Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection
- Creators
- Kyle Min - IntelSourya Roy - University of California, RiversideSubarna Tripathi - IntelTanaya Guha - University of GlasgowSomdeb Majumdar - Intel
- Contributors
- S Avidan (Editor)G Brostow (Editor)M Cisse (Editor)G M Farinella (Editor)T Hassner (Editor)
- Resource Type
- Conference proceeding
- Publication Details
- COMPUTER VISION - ECCV 2022, PT XXXV, Vol.13695, pp.371-387
- Publisher
- Springer Nature
- Series
- Lecture Notes in Computer Science
- DOI
- 10.1007/978-3-031-19833-5_22
- ISSN
- 0302-9743
- eISSN
- 1611-3349
- Number of pages
- 17
- Language
- English
- Date published
- 01/01/2022
- Academic Unit
- Computer Science
- Record Identifier
- 9984446516102771
Metrics
1 Record Views