MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos

1 UNC Charlotte 2 UNC Chapel Hill
Our paper's teaser image

Left: Temporal Action Detection poses unique challenges, including the need for models capable of processing long video sequences, the presence of both short and long actions with high intra-class variance, and the complexity of densely overlapping actions. Right: Our proposed method, MS-Temba, achieves state-of-the-art performance while being 5x more parameter-efficient compared to transformer-based approaches.

Abstract

Temporal Action Detection (TAD) in untrimmed videos poses significant challenges, particularly for Activities of Daily Living (ADL) requiring models to (1) process long-duration videos, (2) capture temporal variations in actions, and (3) simultaneously detect dense overlapping actions. Existing CNN and Transformer-based approaches, struggle to jointly capture fine-grained detail and long-range structure at scale. State-space Model (SSM) based Mamba offers powerful long-range modeling, but naive application to TAD collapses fine-grained temporal structure and fails to account for the challenges inherent to TAD. To this end, we propose Multi-Scale Temporal Mamba (MS-Temba), which extends Mamba to TAD with newly introduced dilated SSMs. Each Temba block, comprising dilated SSMs coupled with our proposed additional losses, enables the learning of discriminative representations across temporal scales. A lightweight Multi-scale Mamba Fuser then unifies these multi-scale features via SSM-based aggregation, yielding precise action-boundary localization. With only 17M parameters, MS-Temba achieves state-of-the-art performance on densely labeled ADL benchmarks TSU & Charades, and further generalizes to long-form video summarization, setting new state-of-the-art results on TVSum & SumMe.

Qualitative Results

BibTeX


            @article{sinha2025ms,
                title={MS-Temba: Multi-Scale Temporal Mamba for Efficient Temporal Action Detection},
                author={Sinha, Arkaprava and Raj, Monish Soundar and Wang, Pu and Helmy, Ahmed and Das, Srijan},
                journal={arXiv preprint arXiv:2501.06138},
                year={2025}
                }