Abstract:Given a long, untrimmed video consisting of multiple action instances and complex background contents, temporal action detection needs not only to recognize their action categories, but also to localize the start time and end time of each instance. To this end, a temporal action detection network based on two-stream convolutional networks is proposed. First, the two-stream convolutional networks is used to extract the feature sequence of the video, and then TAG (Temporal Actionness Grouping) is used to generate the proposal. In order to construct high-quality proposals, the proposal is feed to the boundary regression network to correct the boundary and make it closer to the ground truth, then extend the proposal to a three-segment feature design with context information, and finally use a multi-layer perception to identify behavior. The experimental results show that the proposed algorithm achieves a great mAP in the THUMOS 2014 dataset and the ActivityNet v1.3 dataset.