Back Propagation
约 487 个字 18 行代码 预计阅读时间 3 分钟
Back Propagation:
Simple Example:\(f(x,y,z)=(x+y)z\)
1.Forward Pass: Compute outputs from left to right(from inputs to outputs)
\(q=x+y\) \(f=qz\)
2.Backward Pass: Compute derivatives
Want: \(\frac{\partial f}{\partial x},\frac{\partial f}{\partial y},\frac{\partial f}{\partial z}\)
Order: \(\frac{\partial f}{\partial f}\rightarrow \frac{\partial f}{\partial z}\rightarrow \frac{\partial f}{\partial q}\rightarrow \frac{\partial f}{\partial y}=\frac{\partial f}{\partial q}\frac{\partial q}{\partial y}\rightarrow \frac{\partial f}{\partial x}=\frac{\partial f}{\partial q}\frac{\partial q}{\partial x}\)
[Downstream] = [Local]*[Upstream]
sigmoid function:\(\sigma(x)=\frac{1}{1+e^{-x}}\)
\(\frac{d\sigma(x)}{dx}=(1-\sigma(x))\sigma(x)\)
add: don’t change derivatives so the elements linked by the “+” share the same gradient.
copy: one side’s derivatives add to the other side.
multiply: swap multiplier
max: gradient router,that reduces one element to \(0\) and another,full gradient the same with other side’s element’s.
Flat: “Reverse Thinking Method”:
#Forward
def f(w0,x0,w1,x1,w2):
s0 = w0 * x0
s1 = w1 * x1
s2 = s0 + s1
s3 = s2+ w2
L = sigmoid(s3)
#Backward
grad_L = 1.0
grad_s3 = grad_L * (1-L) * L #sigmoid function has a special form of derivatives
grad_w2 = grad_s3 #gradient copier
grad_s2 = grad_s3 #gradient copier
grad_s0 = grad_s2 #gradient copier
grad_s1 = grad_s2 #gradient copier
grad_w1 = grad_s1 * x1 #gradient multiplier
grad_x1 = grad_s1 * w1 #gradient multiplier
grad_w0 = grad_s0 * x0 #gradient multiplier
grad_x0 = grad_w0 * w0 #gradient multiplier
\(y \in \mathbb{R},x \in \mathbb{R}^M,\frac{dy}{dx}\in \mathbb{R}^M\)
\(y \in \mathbb{R}^N,x \in \mathbb{R}^M,\frac{dy}{dx}\in \mathbb{R}^{M\times N}\) :Jacobian Matrix
4D INPUT \(x\):[1,-2,3,-3] $\rightarrow f(x)=\max(0,x)(elementwise)\rightarrow $ 4D OUTPUT \(y\) = [1,0,3,0]
4D \(\frac{dL}{dy}\):[4,-1,5,9] $\rightarrow $ \(\frac{dy}{dx}\frac{dL}{dy}\)
\(\frac{dy}{dx}\)=\(\begin{bmatrix}1&0&0&0 \\0&0&0&0 \\0&0&1&0\\0&0&0&0 \end{bmatrix}\)positive: 1 negative: 0
Jacobian is sparse!: off-diagonal entries all zero! When doing a big Jacobian Matrix Multiply,it’ll cause large resources waste because almost every element is zero!Never explicitly form Jacobian,instead use implicit multiplication.
\(y=xw\)(\(x:[N\times D]\) \(w:[D\times M]\) \(y:[N \times M]\))
\(\frac{dL}{dx_{i,j}}=\frac{dy}{dx_{i,j}}\cdot\frac{dL}{dy}=w_{j,:}\cdot\frac{dL}{dy_{i,:}}\)
\(\frac{dL}{dx}=\frac{dL}{dy}w^T\)(How to remember? Use the shape!)
\(\frac{dL}{dw}=x^T\frac{dL}{dy}\)
Hint: \(\cdot\) means inner product while blank space means matrix multiply
Backward-Mode: A vector input and a scalar output
Forward-Mode: A scalar input and a vector output
Compute Higher-Order Derivatives(Cool!):
\(x_0 --f1-->x1--f2-->L--f_2'-->\frac{dL}{dx_1}--f_1'-->\frac{dL}{dx_0}--\cdot v-->\frac{dL}{dx_0}\cdot v\)
we want to calculate \(\frac{\partial^2L}{\partial x_0^2}\) then we can calculate\(\frac{\partial^2L}{\partial x_0^2}\cdot v\) , and surprisingly,\(\frac{\partial^2L}{\partial x_0^2}\cdot v=\frac{\partial}{\partial x_0}[\frac{\partial L}{\partial x_0}\cdot v]\)
(\(v\) is independent from \(x_0\))
use backprop we will get the answer(remember: backprop gets the derivatives of output with regard to the input)
再分析:倒着写一遍