what a simple yet effective idea! :)
— Kyunghyun Cho (@kchonyc) January 27, 2021
looking at it from the architectural depth perspective (https://t.co/DM4MvzqFQW by zheng et al.,) the depth (# of layers between a particular input at time t' and output at time t) is now (t-t') x L rather than (t-t') + L. https://t.co/GkkuMH7a5u pic.twitter.com/eXWeqoEKxx