In this work, we focus on semi-supervised learning for
video action detection which utilizes both labeled as well
as unlabeled data. We propose a simple end-to-end consistency based approach which effectively utilizes the un-
labeled data. Video action detection requires both, action
class prediction as well as a spatio-temporal localization
of actions. Therefore, we investigate two types of constraints, classification consistency, and spatio-temporal
consistency. The presence of predominant background
and static regions in a video makes it challenging to utilize spatio-temporal consistency for action detection. To
address this, we propose two novel regularization constraints for spatio-temporal consistency; 1) temporal co-
herency, and 2) gradient smoothness. Both these aspects exploit the temporal continuity of action in videos
and are found to be effective for utilizing unlabeled videos
for action detection. We demonstrate the effectiveness of
the proposed approach on two different action detection
benchmark datasets, UCF101-24 and JHMDB-21. In addition, we also show the effectiveness of the proposed ap-
proach for video object segmentation on the Youtube-VOS
which demonstrates its generalization capability. The proposed approach achieves competitive performance by us-
ing merely 20% of annotations on UCF101-24 when compared with recent fully supervised methods. On UCF101-24, it improves the score by +8.9% and +11% at 0.5 f-mAP
and v-mAP respectively, compared to supervised approach.
|