Multiclass SVM Loss:¶

约 366 个字预计阅读时间 2 分钟

Let \(f(x_i,W)\) be scores,then the SVM scores has the form: \(L_i = \sum_{j\neq y_i}\max(0,s_j-s_{y_i}+1)\)

\(s_{y_i}\) is the correct label’s score,while \(s_j\) is the wrong label’s scores. When \(s_j\) is larger than \(s_{y_i} - 1\)

,that means it contributes to the loss,so that \(L_i\) is greater than \(0\).

Characteristics: 1.When give the \(s_{y_i}\) a little bit change,the Loss function will not change. Because after change,\(s_{y_i}\) is still 1 more than the wrong label’s scores.

min possible : 0 max:\(+\infty\)

When all scores are small random values,loss is \(C - 1\)(\(s_j \approx s_{y_i}\)) where C stands for the number of categories.

Regularization¶

\(L(W)=\frac{1}{N}\sum_{i=1}^NL_i(f(x_i,W),y_i)+\lambda R(W)\)

The most common regularization: L2-norm \(\sum_i\sum_jW_{i,j}^2\)

Why we need that?:

Express preferences in among models beyond “minimize training error”,allow people to integrate their wisdom and knowledge they’ve already obtained.
Avoid overfitting

Example: \(x = [1,1,1,1] \newline w_1=[1,0,0,0] \newline w_2=[0.25,0.25,0.25,0.25]\)

It’s obvious that \(w_1^\mathrm T \cdot x = w_2^\mathrm T\cdot x = 1\)

L2-norm regularization prefer more balanced matrix,which is \(w_2\) in this example. This implies that use as many functions as possible in this preference.”spread out the weights”

prefer simple models: Occam's Razor reveals the truth that simplicity is much preferred.

Cross Entropy Loss¶

SoftMax function:

cat	3.2	24.5	0.13
car	5.1	164.0	0.87
frog	-1.7	0.18	0.00

unnormalized log-prob/logits --exp→ unnormalized prob --normalize→probabilities

\(L_i = -\ln P(Y = y_i |X = x_i)\) Maximum Likelihood Estimation

min possible loss:0 (it can only approach to 0 but never truly reach) max:\(+\infty\)

When all scores are small random values,loss is \(-\ln C\) where C stands for the number of categories.