Towards understanding self-attention and batch normalization : optimization and margin dynamics in neural networks