5.10 进阶：梯度公式的推导

在本节中，我们将推导逻辑回归的交叉熵损失函数 $L_{CE}$ 的梯度。首先，回顾一些基本的微积分知识。

第一，$\ln(x)$ 的导数：

$$ \frac{d}{dx} \ln(x) = \frac{1}{x} \tag{5.49} $$

第二，sigmoid 函数（非常优美的）导数：

$$ \frac{d\sigma(z)}{dz} = \sigma(z)(1 - \sigma(z)) \tag{5.50} $$

第三，导数的链式法则（chain rule）。假设要计算复合函数 $f(x) = u(v(x))$ 的导数。$f(x)$ 的导数等于 $u(x)$ 对 $v(x)$ 的导数乘以 $v(x)$ 对 $x$ 的导数：

$$ \frac{df}{dx} = \frac{du}{dv} \cdot \frac{dv}{dx} \tag{5.51} $$

首先，求损失函数对单个权重 $w_j$ 的偏导数（需要对每个权重以及偏置 $b$ 都进行计算）：

$$ \begin{aligned} \frac{\partial L_{CE}}{\partial w_j} &= \frac{\partial}{\partial w_j} \left( -[y \log \sigma(\mathbf{w} \cdot \mathbf{x} + b) + (1 - y)\log(1 - \sigma(\mathbf{w} \cdot \mathbf{x} + b))] \right) \\ &= -\left[ \frac{\partial}{\partial w_j} y \log \sigma(\mathbf{w} \cdot \mathbf{x} + b) + \frac{\partial}{\partial w_j} (1 - y)\log(1 - \sigma(\mathbf{w} \cdot \mathbf{x} + b)) \right] \end{aligned} \tag{5.52} $$

接下来，使用链式法则，并利用对数函数的导数：

$$ \frac{\partial L_{CE}}{\partial w_j} = -\frac{y}{\sigma(\mathbf{w} \cdot \mathbf{x} + b)} \frac{\partial}{\partial w_j} \sigma(\mathbf{w} \cdot \mathbf{x} + b) - \frac{1 - y}{1 - \sigma(\mathbf{w} \cdot \mathbf{x} + b)} \frac{\partial}{\partial w_j} \left(1 - \sigma(\mathbf{w} \cdot \mathbf{x} + b)\right) \tag{5.53} $$

整理各项：

$$ \frac{\partial L_{CE}}{\partial w_j} = -\left[ \frac{y}{\sigma(\mathbf{w} \cdot \mathbf{x} + b)} - \frac{1 - y}{1 - \sigma(\mathbf{w} \cdot \mathbf{x} + b)} \right] \frac{\partial}{\partial w_j}\sigma(\mathbf{w} \cdot \mathbf{x} + b) \tag{5.54} $$

现在代入 sigmoid 函数的导数，并再次使用链式法则，最终得到公式 (5.55)：

$$ \begin{aligned} \frac{\partial L_{CE}}{\partial w_j} &= -\left[ \frac{y - \sigma(\mathbf{w} \cdot \mathbf{x} + b)}{\sigma(\mathbf{w} \cdot \mathbf{x} + b)[1 - \sigma(\mathbf{w} \cdot \mathbf{x} + b)]} \right] \sigma(\mathbf{w} \cdot \mathbf{x} + b)[1 - \sigma(\mathbf{w} \cdot \mathbf{x} + b)] \frac{\partial (\mathbf{w} \cdot \mathbf{x} + b)}{\partial w_j} \\ &= -\left[ \frac{y - \sigma(\mathbf{w} \cdot \mathbf{x} + b)}{\sigma(\mathbf{w} \cdot \mathbf{x} + b)[1 - \sigma(\mathbf{w} \cdot \mathbf{x} + b)]} \right] \sigma(\mathbf{w} \cdot \mathbf{x} + b)[1 - \sigma(\mathbf{w} \cdot \mathbf{x} + b)] x_j \\ &= -[y - \sigma(\mathbf{w} \cdot \mathbf{x} + b)] x_j \\ &= [\sigma(\mathbf{w} \cdot \mathbf{x} + b) - y] x_j \end{aligned} \tag{5.55} $$

这正是公式 (5.30) 给出的梯度表达式。