弄懂系列: 反向传播算法

backpropagation

参数化模型

Parameterized models: $\bar{y}=G(x,w)$.

简单来讲,参数化模型就是依赖于输入和可训练参数的函数。 其中,可训练参数在不同训练样本中是共享的,而输入则因每个样本不同而不同。在大多数深度学习框架中,参数是隐性的(implicit):当函数被调用时,参数不被传递。 如果把模型比作面向对象编程,这些参数相当于被“储存在函数中”。

  • 变量(张量,标量,连续变量,离散变量)

    • $x$是参数模型的输入
    • $\tilde{y}$是由函数计算出的变量,这个函数是确定,非随机的。
  • 确定性函数: $x\rightarrow G(x,w) \rightarrow \tilde{y}$

    • 可以有多个输入和多个输出
    • 包含一些隐性的参数变量 $w$
  • 标量函数: $\tilde{y}\rightarrow C(y,\tilde{y}\leftarrow y$

    • 用来表示代价函数
    • 拥有一个隐性的标量输出
    • 用多个输入计算出一个值(通常情况下这个值是这几个输入之间的距离)

==深度学习说白了就是围绕着梯度相关方法展开的。==

SGD (随机梯度下降)的参数更新法则:

  • 从样本集中$\lbrace 0, \cdots, P-1 \rbrace$里选取一个 $p$ ,进行参数更新: $w \leftarrow w-\eta \frac{\partial L(x[p], y[p], w)}{\partial w}$

Notation summary

notation meaning
$a_i^l$ output of a neuron
$a^l$ output vector of a layer
$z_i^l$ input of activation function
$z^l$ input vector of activation function for a layer
$w_{ij}^l$ a weight
$W^l$ a weight matrix
$b_i^l$ a bias
$b^l$ a bias vector

Layer output relation: from $a$ to $z$

back1

其次,$z_i^l$与$a_i^l$的关系为$a_i^l=\sigma(z_i^l)$,因此有:

$$ \left[\begin{array}{c}a_{1}^{l} \ a_{2}^{l} \ \vdots \ a_{i}^{l} \ \vdots\end{array}\right]=\left[\begin{array}{l}\sigma\left(z_{1}^{l}\right) \ \sigma\left(z_{2}^{l}\right) \ \vdots \ \sigma\left(z_{i}^{l}\right) \ \vdots\end{array}\right]\quad\Rightarrow \quad a^l=\sigma(z^l)=\sigma(W^la^{l-1}+b^l) $$


back2

$$ y=f(x)=\sigma\left(W^{L} \cdots \sigma\left(W^{2} \sigma\left(W^{1} x+b^{1}\right)+b^{2}\right) \ldots+b^{L}\right) $$


Loss function for training

  • A "Good" function: $f(x ; \theta) \sim \hat{y} \Rightarrow |\hat{y}-f(x ; \theta)| \approx 0$
  • Define an example loss function: $C(\theta)=\sum_{k} \left | \hat{y}_{k}-f\left(x_{k} ; \theta \right) \right |$

Gradient Descent for Neural Network

back3


Backpropagation

In a feedforward neural network:

  • forward propagation
    • from input $𝑥$ to output $𝑦$ information flows forward through the network
    • during training, forward propagation can continue onward until it produces a scalar cost $C(θ)$
  • back-propagation
    • allows the information from the cost to then flow backwards through the network, in order to compute the gradient
    • can be applied to any function

back4


back5

$$ \frac{\partial C(\theta)}{\partial w_{i j}^{l}}=\frac{\partial C(\theta)}{\partial z_{i}^{l}} \frac{\partial z_{i}^{l}}{\partial w_{i j}^{l}} $$

$$ z^{l}=W^{l} a^{l-1}+b^{l} $$

$$ z_{i}^{l}=\sum_{j} w_{i j}^{l} a_{j}^{l-1}+b_{i}^{l} $$

$$ \frac{\partial z_{i}^{l}}{\partial w_{i j}^{l}}=a_{j}^{l-1} $$

$$ \frac{\partial z_{i}^{l}}{\partial w_{i j}^{l}}=\left\{\begin{array}{cl}a_{j}^{l-1} & , l>1 \\ x_{j} & , l=1\end{array}\right. $$


back6

★★★★★ : $\delta_i^l$是第$l$层的传播梯度。直接将这个难求的$\partial{C(\theta)}/\partial{z_i^l}$设为$\delta_i^l$.

Idea: from $L$ to $1$:

  • Initialization: compute $\delta^L$

back7

back8

  • ★★★★★ : Compute $\delta^l$ based on $\delta^{𝑙+1}$

back9

back10


back11

back15


$$ \frac{\partial C(\theta)}{\partial w_{i j}^{l}}={\color{red}{\frac{\partial C(\theta)}{\partial z_{i}^{l}}}} \frac{\partial z_{i}^{l}}{\partial w_{i j}^{l}} $$

★★★★重要

$$ \begin{array}{l} \delta^{L}=\sigma^{\prime}\left(z^{L}\right) \odot \nabla C(y) \ \delta^{l}=\sigma^{\prime}\left(z^{l}\right) \odot\left(W^{l+1}\right)^{T} \delta^{l+1} \end{array} $$

back11


总结

$\frac{\partial C(\theta)}{\partial w_{i j}^{l}}=\frac{\partial C(\theta)}{\partial z_{i}^{l}} {\color{red}\frac{\partial z_{i}^{l}}{\partial w_{i j}^{l}}}$ $\frac{\partial C(\theta)}{\partial w_{i j}^{l}}={\color{red}\frac{\partial C(\theta)}{\partial z_{i}^{l}}} \frac{\partial z_{i}^{l}}{\partial w_{i j}^{l}}$
back13 back14

向量-雅克比乘积

PyTorch :自动求导.


其他注意事项

  1. 反向传播不只适用于层层堆叠的模型;它可用于任何有向无环图(DAG)只要模组间具有偏序关系,

Reference

  1. 李宏毅机器学习课程

  2. # 神经网络学习(三)反向(BP)传播算法(1)

updatedupdated2022-08-252022-08-25
加载评论