How does VideoMAE compare to contrastive learning methods?

VideoMAE is significantly faster than contrastive learning methods, offering a 3.2x speedup in pre-training time.

Is VideoMAE suitable for real-time applications?

When fine-tuned, VideoMAE can be used for real-time applications like action recognition in surveillance videos.

VideoMAE

VideoMAE | Find AI List

Overview

VideoMAE is a self-supervised video pre-training framework utilizing masked autoencoders. It employs an extremely high masking ratio (90%-95%) and a tube masking strategy to create a challenging pre-training task. The architecture involves a simple masked autoencoder with a plain ViT backbone. This approach significantly reduces pre-training time compared to contrastive learning methods, offering a 3.2x speedup. VideoMAE serves as a strong baseline for self-supervised video pre-training research. It excels in performance without requiring extra data, achieving state-of-the-art results on Kinetics-400, Something-Something V2, UCF101, and HMDB51 benchmarks using vanilla ViT backbones. The framework is implemented in PyTorch and integrates into platforms like MMAction2 and Hugging Face Transformers. Supports for action detection tasks are provided.

Common tasks

Video Pre-training Action Recognition Video Feature Extraction

FAQ

View all

What is the main advantage of VideoMAE?

VideoMAE's primary advantage is its data efficiency. It achieves state-of-the-art performance with minimal data requirements through masked autoencoding.

What datasets can VideoMAE be used with?

VideoMAE has been tested on Kinetics-400, Something-Something V2, UCF101, and HMDB51. It can be adapted to other video datasets as well.

What is the masking ratio used in VideoMAE?

VideoMAE uses an extremely high masking ratio of 90%-95% to create a challenging self-supervised pre-training task.

What backbones are supported by VideoMAE?

VideoMAE supports Vision Transformer (ViT) backbones, including ViT-S, ViT-B, ViT-L, and ViT-H.

FAQ+