Tweeted By @ogrisel
Gradient Descent Provably Optimizes Over-parameterized (single hidden layer relu) Neural Networks (trained with l2 loss assuming random init and non degenerate data): https://t.co/NUt74aCUf6
— Olivier Grisel (@ogrisel) October 7, 2018