Overview
VideoMAE is a self-supervised video pre-training framework utilizing masked autoencoders. It employs an extremely high masking ratio (90%-95%) and a tube masking strategy to create a challenging pre-training task. The architecture involves a simple masked autoencoder with a plain ViT backbone. This approach significantly reduces pre-training time compared to contrastive learning methods, offering a 3.2x speedup. VideoMAE serves as a strong baseline for self-supervised video pre-training research. It excels in performance without requiring extra data, achieving state-of-the-art results on Kinetics-400, Something-Something V2, UCF101, and HMDB51 benchmarks using vanilla ViT backbones. The framework is implemented in PyTorch and integrates into platforms like MMAction2 and Hugging Face Transformers. Supports for action detection tasks are provided.
