Tweeted By @jekbradbury
Something like half the appendix of the DALL-E paper (https://t.co/fIBdsdA3lQ) describes work the authors had to do on GPUs that they wouldn't have had to do on TPUs:
— James Bradbury (@jekbradbury) February 25, 2021
- scaling fp16 mixed precision
- reducing gradient all-reduce comms w/ PowerSGD
- manual optimizer sharding