Overview
The Vision Transformer (ViT) Large model is a transformer encoder model pre-trained on ImageNet-21k (14 million images, 21,843 classes) and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes), both at a resolution of 224x224. It processes images as a sequence of fixed-size patches (16x16) which are then linearly embedded and fed into the transformer encoder, enhanced with a classification token ([CLS]) and positional embeddings. The model's architecture leverages the attention mechanism to capture global relationships within the image, making it suitable for various downstream image classification tasks. The model weights were converted from JAX to PyTorch by Ross Wightman.
Common tasks
