What hardware is required to run the code?

The code can be run on CPUs, GPUs, and TPUs. GPUs and TPUs are recommended for faster training and fine-tuning.

How do I choose the right checkpoint for fine-tuning?

Refer to Figure 3 in the "How to train your ViT? ..." paper for guidance on selecting the best checkpoint based on upstream validation accuracy.

Vision Transformer

Vision Transformer | Find AI List

Overview

The Vision Transformer (ViT) is a deep learning model architecture based on the Transformer, originally designed for natural language processing, adapted for computer vision tasks. ViT models break down an image into patches, treat these patches as tokens, and input them into a Transformer encoder. This architecture allows the model to capture global relationships between image regions, enabling it to achieve state-of-the-art performance on image classification tasks. The repository provides JAX/Flax implementations of ViT and MLP-Mixer models, pre-trained on ImageNet and ImageNet-21k datasets. It includes code for fine-tuning these models, allowing users to adapt them to specific datasets and tasks. The models were originally trained in the Big Vision codebase, offering advanced features like multi-host training.

Common tasks

Image Classification Image Segmentation Object Detection Fine-tuning

FAQ

View all

What is the Vision Transformer?

The Vision Transformer (ViT) is a deep learning model architecture that applies the Transformer architecture, originally designed for natural language processing, to computer vision tasks.

What datasets were the models pre-trained on?

The models were pre-trained on the ImageNet and ImageNet-21k datasets.

What is JAX/Flax?

JAX is a numerical computation library, and Flax is a neural network library built on top of JAX. They provide efficient and scalable computation on GPUs and TPUs.

Can I fine-tune the pre-trained models on my own dataset?

Yes, the repository provides code and examples for fine-tuning the pre-trained models on custom datasets.

FAQ+