At the embedding level
- 1D tokens (truncating patch suffix (at the sequence dim))
- 2D tokens (truncating patch suffixes (at the embed dim))
At the network level (embed_dim, depth, etc)
- EA-ViT - Efficient Adaptation for Elastic Vision Transformer
- MatFormer - Nested Transformer for Elastic Inference
- HydraViT - Stacking Heads for a Scalable ViT
- Slicing Vision Transformer for Flexible Inference
At the network level we also have patch pruning and merging methods