Tweeted By @ak92501
Can Vision Transformers Perform Convolution?
— AK (@ak92501) November 3, 2021
abs: https://t.co/rsHhON89sV
a single ViT layer with image patches as the input can perform any convolution operation constructively, where the multi-head attention mechanism and the relative positional encoding play essential roles pic.twitter.com/Qw1RqqEfjV