Tweeted By @karpathy
The raw value of a loss (in a multitask setting) does not reflect how much your model "cares" about that component. E.g. an L1 loss can report arbitrarily large loss value based on loss scale but the gradient will always be \in {-1,1}. The grad magnitude is what actually matters.
— Andrej Karpathy (@karpathy) January 7, 2019