Neural-Networks
A note for the Coursera course "Advanced Learning Algorithms"
Neural networks intuition
Neural networks model
右上角的角标代表他在哪一层,右下角的角标代表一层layer里的每一个神经元里包含的两个参数。除了输出层其他的都叫hidden layer. Layer3 有三个neuran. g()叫做激活函数(activation function).他现在是一个sigmoid function
Sigmoid function: \(S(x) = \frac{1}{1+e^{-x}}\)
Forward propagation(正向传播)
Use a 3-layer neural network to do handwritten digit recognition. Each layer is a dense layer.
这里x以一个列距阵的形式输入,求完内积之后会得到一个数,然后再加b.所以a1是一个25个内容的列矩阵。
Use TensorFlow to implement
x = np.array([[1],[2],[3]])
列向量
Define a
layer:layer_1 = Dense(units=3,activation='sigmoid')
Get the result from layer:a1 = layer_1(x)
Use the Sequential function to let TensorFlow build a model for you:
model = Sequential([layer_1,layer_2])
Then use model.fit(x,y)
to train the model
Use model.predict(x_new)
to predict
Define a model:
1 | model = Sequential( |
Use a loss function:
1 | model.conpile(loss=BinaryCrossentropy()) |
After fitting make a prediction:
The input to predict
is an array so the single example
is reshaped to be two dimensional.
1 | prediction = model.predict(X[0].reshape(1,400)) # a zero |
要想减小计算误差可以添加
model.conpile(loss=BinaryCrossentropy(from_logitc = true))
这可以让TensorFlow自动优化浮点数计算
Vectorisation
Originally the calculation can be done in a for loop, but by using vector maltiplication it can be much more easier.
This image is the for loop code, it needs to get each parameter of the neurons and multiply and plus number b.
The code for this dense function:
1 | def my_dense(a_in, W, b, g): |
The following img is vectorized implementation:
Code:
1 | def my_dense_v(A_in, W, b, g): |
matrix multiplication
点乘是简单乘,绝大多数都是点乘.
\(\mathbf{XW}\) is a matrix-matrix operation with dimensions \((m,j_1)(j_1,j_2)\) which results in a matrix with dimension \((m,j_2)\). To that, we add a vector \(\mathbf{b}\) with dimension \((1,j_2)\). \(\mathbf{b}\) must be expanded to be a \((m,j_2)\) matrix for this element-wise operation to make sense. This expansion is accomplished for you by NumPy broadcasting.
Neural network training
Create the model
Define how many layers and how many units in each layer and their activation function
Select loss and cost functions
Select a loss function such as
BinaryCrossentropy()
from keras.lossesTrain the model using gradient descent
repeat \[ \begin{align*} w_j^{[l]} &= w_j^{[l]} - \alpha \frac{\partial}{\partial w_j} J(\hat{w}, b) \\ b_j^{[l]} &= b_j^{[l]} - \alpha \frac{\partial}{\partial b_j} J(\hat{w}, b) \end{align*} \]
tensorflow can do it for you, you can just use
model.fit(X,y,epochs=100)
Activation Functions
Sigmoid
Better used on the binary classification problem
The sigmoid activation function is defined as:
\[ S(x) = \frac{1}{1 + e^{-x}} \]
where: - ( S(x) ) is the output of the sigmoid function, - ( e ) is the base of the natural logarithm, - ( x ) is the input to the function.
ReLU
Better used on the regression problem
The ReLU (Rectified Linear Unit) activation function is defined as:
\[ g(z) = \max(0, z) \]
where: - ( g(z) ) is the output of the ReLU function, - ( z ) is the input to the function.
Linear function
Better used on the regression problem
\(g(z) = z\)
What if we don't use activation function?
It will be the same as linear regression
Multiclass Classification
Softmax algorithm
For original logistic regression using sigmoid output can be denoted as \[ a_1 = g(z) = \frac{1}{1+e^{-z}} = P(y = 1|x) \]
Softmax allows to have multiple options being predicted with formula as \[ a_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3} + e^{z_4}} = P(y = 1|x) \] a是y成为某一个结果的概率
It in general can be denoted as: $$ z_j = _j + b_j j = 1, , N\
a_j = = P(y = j|x) $$
Cost function for softmax
\[ a_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2} + \dots + e^{z_N}} = P(y = 1|\mathbf{x}) \] \[ \vdots \] \[ a_N = \frac{e^{z_N}}{e^{z_1} + e^{z_2} + \dots + e^{z_N}} = P(y = N|\mathbf{x}) \] \[ \text{loss}(a_1, \dots, a_N, y) = \begin{cases} -\log a_1 & \text{if } y = 1 \\ -\log a_2 & \text{if } y = 2 \\ \vdots \\ -\log a_N & \text{if } y = N \end{cases} \]
How to add Softmax into neural network
Change the last output layer from 1 unit to multiple units and let it work as softmax algorithm.
Multi-label classification
在这种问题里y可以是一个数组,例如识别一张图片中有没有三种不一样的物体,y就会输出三个数
Solution 1 is to treat it as 3 separate different problems and build 3 neural networks.
Solution2 is to build one neural network with three outputs, so a[3] has three elements.
optimization algorithm
Adam Algorithm
Adaptive Moment estimation
If w or b keeps moving in same direction it will increase \(\alpha\).
If w or b keeps oscillating it will reduce \(\alpha\).
要想使用需要如下代码:
1 | model.compile(optimizer=tf.keras.optimizera.Adam(learning_rate=le-3),lost=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)) |
Complex model example
Below, compose a three-layer model:
- Dense layer with 120 units, relu activation
- Dense layer with 40 units, relu activation
- Dense layer with 6 units and a linear activation (not softmax) Compile using
- loss with
SparseCategoricalCrossentropy
, remember to usefrom_logits=True
- Adam optimizer with learning rate of 0.01.
1 | # UNQ_C3 |
Reconstruct your complex model, but this time include regularization. Below, compose a three-layer model:
- Dense layer with 120 units, relu activation,
kernel_regularizer=tf.keras.regularizers.l2(0.1)
- Dense layer with 40 units, relu activation,
kernel_regularizer=tf.keras.regularizers.l2(0.1)
- Dense layer with 6 units and a linear activation. Compile using
- loss with
SparseCategoricalCrossentropy
, remember to usefrom_logits=True
- Adam optimizer with learning rate of 0.01.
1 | # UNQ_C5 |
Evaluating a model
Bias and variance
Calculate \(J_{train}\) and \(J_{CV}\)and see what's the result.
CV means cross-validation
If both are high it means bias is high, if train is low and cv is high it means variance is high.
Regularization and bias/variance
\(J(\hat{w},b) = \frac{1}{2m} \sum_{i=1}^{m} \left( f_{\hat{w},b}(x^{(i)}) - y^{(i)} \right)^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2\)
成本函数中有lambda用于控制正则化(Regularization)过程的参数。lambda大的时候正则化效果强,模型会有high bias. lambda小的时候模型会过拟合导致high variance.
如何选择合适的lambda?
从0开始试,然后试0.01,0.02……逐渐增加。计算所有的Jcv,选择最小的那个对应的lambda。
Baseline level of performance
给一些例子来帮助判断数字是高还是低
假如人类的误差是10.6%,训练集误差是10.8%,交叉验证集误差是14.8%那么训练集误差可视为很低,但是交叉验证集误差高,代表可能是方差很高
如果训练集的误差也很高说明可能是high bias
Learning curve
画一条曲线显示误差随着训练集增加的变化。训练集误差和验证集误差趋势如下图。
如果模型有high bias,那么增加数据量不会对模型有帮助。
如果有high variance,增加数据量会提高Jtrain并且接近人类水平并且降低Jcv。
Example reciple
如果在训练集效果不好-》使用更大的网络。如果训练集效果好,验证集效果不好-》更多数据。目标是验证集效果好
Decision tree model
Decision tree
用于当特征是离散值的时候,例如耳朵形状,脸大小等。
最上边是root node,中间的是decision node,下边的是leaf node
构建决策树:
- 从根节点出发把训练集分成两部分
- 左右分别选择特征来分离数据
- 当结束条件触发的时候结束
- 把结果按照标签构建叶子结点,即决策的结果
问题:
什么时候结束?
- 当一个节点结果全都是一个标签的数据
- 当深度触发最大值
- 当纯净值足够大
- 当某个节点的结果数量太少
measure of impurity
使用嫡函数(entropy function)来表示纯净性。
这里的p1是指是一种类型的比例,纵坐标是嫡的数据。
\(p_0 = 1 - p_1\)
\(H(p_1) = -p_1log_2(p_1) - (1 - p_1) log_2(1 - p_1)\)
Choosing a split
如何选择使用哪个属性来做分隔,需要使用information gain方法。
xxxxxxxxxxx
Use one-hot encoding to code features
如果一个feature可以接受k个不一样的赋值那么就创建k个二进制数来代表这个feature。
这样就可以吧多个特征变成一串01数
使用连续变量
可以画一条线,高于这个值的都是1,低于这个值的都是0
这条线的选址通常是使用Information gain来获取最佳的取值