# 为什么梯度裁剪能加速训练过程？一个简明的分析

$$\theta \leftarrow \theta-\eta \nabla_{\theta} f(\theta)$$

$$\theta \leftarrow \theta- \eta \nabla_{\theta} f(\theta)\times \min\left\{1, \frac{\gamma}{\Vert \nabla_{\theta} f(\theta)\Vert}\right\}\label{eq:clip-1}$$

$$\theta \leftarrow \theta- \eta \nabla_{\theta} f(\theta)\times \frac{\gamma}{\Vert \nabla_{\theta} f(\theta)\Vert+\gamma}\label{eq:clip-2}$$

$$\frac{1}{2}\min\left\{1, \frac{\gamma}{\Vert \nabla_{\theta} f(\theta)\Vert}\right\}\leq \frac{\gamma}{\Vert \nabla_{\theta} f(\theta)\Vert+\gamma}\leq \min\left\{1, \frac{\gamma}{\Vert \nabla_{\theta} f(\theta)\Vert}\right\}$$

$$\Vert \nabla_{\theta} f(\theta + \Delta \theta) – \nabla_{\theta} f(\theta)\Vert\leq L\Vert \Delta\theta\Vert\label{eq:l-cond}$$

$$f(\theta+\Delta\theta) \leq f(\theta) + \left\langle \nabla_{\theta}f(\theta), \Delta\theta\right\rangle + \frac{1}{2}L \Vert \Delta\theta\Vert^2\label{eq:neq-1}$$

$$f(\theta+\Delta\theta) \leq f(\theta) + \left(\frac{1}{2}L\eta^2 – \eta\right) \Vert \nabla_{\theta}f(\theta)\Vert^2$$

$$\Vert \nabla_{\theta} f(\theta + \Delta \theta) – \nabla_{\theta} f(\theta)\Vert\leq \left(L_0 + L_1\Vert \nabla_{\theta} f(\theta)\Vert\right)\Vert \Delta\theta\Vert$$

$$f(\theta+\Delta\theta) \leq f(\theta) + \left\langle \nabla_{\theta}f(\theta), \Delta\theta\right\rangle + \frac{1}{2}\left(L_0 + L_1\Vert \nabla_{\theta} f(\theta)\Vert\right) \Vert \Delta\theta\Vert^2$$

$$f(\theta+\Delta\theta) \leq f(\theta) + \left(\frac{1}{2}\left(L_0 + L_1\Vert \nabla_{\theta} f(\theta)\Vert\right)\eta^2 – \eta\right) \Vert \nabla_{\theta}f(\theta)\Vert^2$$

$$\eta<\frac{2}{L_0 + L_1\Vert \nabla_{\theta} f(\theta)\Vert}$$

$$\eta = \frac{1}{L_0 + L_1\Vert \nabla_{\theta} f(\theta)\Vert}$$