Hypothesis

ViT’s patchify convolution is contrary to standard early layers in CNNs. Maybe that’s the cause?

Main idea

Replace patchify convolution with a small number of convolutional layers and drop one transformer block to make comparison fair.

Notes for myself:

  • Interesting experimentation regarding optimizability , maybe take into account into hessian analysis