Tweeted By @jekbradbury

on 2021-02-25 (UTC)
research

a group with 7 other tweets.

Something like half the appendix of the DALL-E paper (https://t.co/fIBdsdA3lQ) describes work the authors had to do on GPUs that they wouldn't have had to do on TPUs:
- scaling fp16 mixed precision
- reducing gradient all-reduce comms w/ PowerSGD
- manual optimizer sharding
— James Bradbury (@jekbradbury) February 25, 2021

Tags