Tweeted By @m__dehghani
The "recurrent inductive bias" of RNNs usually helps them be more data efficient, compared to vanilla Transformer. If you introduce such a bias to Transformers (like recurrence in depth in Universal Transformers), they generalize better on small datasets: https://t.co/gWzKXz8xRU
— Mostafa Dehghani (@m__dehghani) May 17, 2019