Skip to content

Preset method

TSN

Introduction

The TSN (Temporal Segment Network) method proposes sparse temporal sampling and segment-level aggregation strategies based on the two-stream method to balance the operational efficiency and temporal modeling ability of convolutional neural networks in long videos. The TSN method can effectively capture long-span dynamic information.

Videos usually contain thousands or even tens of thousands of frames. Directly processing all video frames is not only inefficient but also incurs a huge computational burden. The TSN method adopts a sparse sampling strategy, which divides the video into several segments and randomly samples one frame (or a short clip) in each segment. Then, through the segment-level modeling and aggregation stage, the frames obtained by sparse sampling are respectively input into a shared 2D convolutional neural network (such as ResNet, Inception, etc.) to extract the spatio-temporal features of each segment. Finally, global aggregation is performed to obtain the global representation of the video.

Citation

@article{wang2018temporal,
  title={Temporal segment networks for action recognition in videos},
  author={Wang, Limin and Xiong, Yuanjun and Wang, Zhe and Qiao, Yu and Lin, Dahua and Tang, Xiaoou and Van Gool, Luc},
  journal={IEEE transactions on pattern analysis and machine intelligence},
  volume={41},
  number={11},
  pages={2740--2755},
  year={2018},
  publisher={IEEE}
}