机器学习笔记02：多元线性回归、梯度降落和Normal equation

机器学习笔记02：多元线性回归、梯度下降和Normal equation

在《机器学习笔记01》中已经讲了关于单变量的线性回归以及梯度下降法。今天这篇文章作为之前的扩展，讨论多变量（特征）的线性回归问题、多变量梯度下降、Normal equation（矩阵方程法），以及其中需要注意的问题。

单元线性回归

首先来回顾一下单变量线性回归的假设函数:

Size( $feet^2$ )	Price( $\$$ 1000)
2104	460
1416	232
1534	315
852	178
…	…

我们的假设函数为 $h_\theta(x)=\theta_0+\theta_1 x$

多元线性回归

下面介绍多元线性回归(Linear Regression with Multiple features/variables)。同样以预测房价为例，假设我们对房价的预测涉及到4个因素：Size、Number of bedrooms、Number of floors、Age of house。假设我们的训练集如下：

Size( $feet^2$ )	Number of bedrooms	Number of floors	Age of house(years)	Price( $\$$ 1000)
2104	5	1	43	460
1416	3	2	40	232
1534	3	2	30	315
852	2	1	36	178
…	…	…	…	…

符号说明（Notation）：

符号	含义
$n$	number of features(特征的数量，上表中为4)
$x^{(i)}$	input(features) of $i^{th}$ training example(第 $i$ 组训练数据，比如 $x^2$ 表示上表中第二行)
$x_j^{i}$	value of feature j in $i^{th}$ training example(第 $i$ 组训练接的第 $j$ 个特征值，比如 $x_2^3$ 表示上表中的第三行第二列的值2)
$m$	number of training examples(训练集样本的数量，比如上表为4)

1、假设函数(Hypothesis function)

既然是线性回归，我们的假设函数当然应该是一条直线：

h θ (x) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + . . . + θ n x n

$h_\theta(x)=\theta_0+\theta_1 x_1+\theta_2 x_2+\theta_3 x_3+...+\theta_n x_n$ 或者

h θ (x) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + . . . + θ n x n

$h_\theta(x)=\theta_0 x_0+\theta_1 x_1+\theta_2 x_2+\theta_3 x_3+...+\theta_n x_n$ 其中

x0 $x_0$ 始终为1。所以上面两个函数是等价的。
为了方便，我们记

X = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ x 0 x 1 x 2 . . . x n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥; θ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ θ 0 θ 1 θ 2 . . . θ n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

$X=\left[ \begin{matrix} x_0 \\ x_1 \\ x_2 \\ ... \\ x_n \end{matrix} \right];\quad \theta=\left[\begin{matrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ ... \\ \theta_n \end{matrix}\right]$
所以有

h θ (x) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + . . . + θ n x n = [θ 0 θ 1 θ 2 . . . θ n] ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ x 0 x 1 x 2 . . . x n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ = θ T X

$\begin{align} h_\theta(x) &= \theta_0+\theta_1 x_1+\theta_2 x_2+\theta_3 x_3+...+\theta_n x_n \\ &= \left[\begin{matrix} \theta_0 & \theta_1 & \theta_2 & ... & \theta_n \end{matrix}\right] \left[\begin{matrix} x_0 \\ x_1 \\ x_2 \\ ... \\ x_n \end{matrix} \right] \\ &= \theta^{T}X \end{align}$
其中，

θT $\theta^T$ 是一个规模为

1×(n+1) $1\times(n+1)$ 的矩阵，

X $X$ 是一个规模为

(n+1)×1 $(n+1)\times1$ 的矩阵。假设函数的说明就到这里，下面我们来看看多变量的梯度下降法。

2、多变量梯度下降(Gradient descent for Multiple Variables)

　　1.假设函数(Hypothesis function)：

h θ (x) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + . . . + θ n x n = θ T X

$\begin{align} h_\theta(x) &= \theta_0+\theta_1 x_1+\theta_2 x_2+\theta_3 x_3+...+\theta_n x_n \\ &= \theta^{T}X \end{align}$
　　2.误差函数(Cost function)：

J (θ 0, θ 1, . . ., θ n) = 1 2 m \sum i = 1 m (h θ (x (i)) - y (i)) 2

$J(\theta_0,\theta_1,...,\theta_n)=\frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2$

J (θ) = 1 2 m \sum i = 1 m (θ T x (i) - y (i)) 2

$J(\theta)=\frac{1}{2m} \sum_{i=1}^m (\theta^Tx^{(i)}-y^{(i)})^2$

J (θ) = 1 2 m \sum i = 1 m (\sum j = 0 n θ j x (i) j) - y (i)) 2

$J(\theta)=\frac{1}{2m} \sum_{i=1}^m (\sum_{j=0}^n\theta_j x_j^{(i)})-y^{(i)})^2$
注意：上面三种形式都是等价的。

在单元线性回归中，我们只对

θ0 $\theta_0$ 和

θ1 $\theta_1$ 使用了梯度下降法，在多元变量的梯度下降中，我们将对每个

θ $\theta$ 都求偏导。其形式如下：
Repeat until convergence:{ //重复直到收敛

θ j = θ j - α \partial \partial θ j J (θ)

$\theta_j = \theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)$ } (Notice: simultaneously update

θj $\theta_j$ for every

j=0,1,2,...,n $j=0,1,2,...,n$ )
注意：在一次迭代过程中，必须同时更新每个

θ $\theta$ 。例如不能在更新了

θ1 $\theta_1$ 之后，就把新的

θ1 $\theta_1$ 用于更新后面的

θ2 $\theta_2$ ，而应该使用上一次迭代产生的

θ1 $\theta_1$ 来更新这一次迭代中的

θ2 $\theta_2$ 。

上面的

θ j = θ j - α \partial \partial θ j J (θ)

$\theta_j = \theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)$ 等价于

θ 0 = θ 0 - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) x i 0

$\theta_0 = \theta_0-\alpha\frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})x_0^{i}$

θ 1 = θ 1 - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) x i 1

$\theta_1 = \theta_1-\alpha\frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})x_1^{i}$

θ 2 = θ 2 - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) x i 2

$\theta_2 = \theta_2-\alpha\frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})x_2^{i}$

. . .

$...$

θ n = θ n - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) x i n

$\theta_n = \theta_n-\alpha\frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})x_n^{i}$

在执行足够次数的迭代 (iteration) 之后，我们就能取得最佳的 $\theta_j$ 的值。但是在特征(features)数量很大的情况下会遇到一个问题，那就是梯度下降算法可能会非常的慢，下面来看看原因与解决办法。

3、特征缩放(Feature scaling)

我们来考虑这样一个实例：还是预测房价，但是假设每个训练样本有两个特征：

	Size( $feet^2$ )	Number of bedrooms
Range	0-2000	1-5

如果直接进行梯度下降的话，速度可能会非常的慢。至于为什么我们先来看看 $J(\theta)$ 的等高线图(contour):

假如只考虑假设函数的

θ1 $\theta_1$ 和

θ2 $\theta_2$ ，令

h θ (x) = θ 1 x 1 + θ 2 x 2

$h_\theta(x)=\theta_1 x_1+\theta_2 x_2$ 则

J (θ 1, θ 2) = 1 2 m \sum i = 1 2 (h θ (x (i)) - y (i)) 2

$J(\theta_1,\theta_2)=\frac{1}{2m}\sum_{i=1}^2{(h_\theta(x^{(i)})-y^{(i)}})^2$ 由上面这个函数可以看出，因为

x1 $x_1$ 的取值范围很大，即便是

θ1 $\theta_1$ 微小的变化，

θ1x1 $\theta_1 x_1$ 的值也可能会发生很大的变化；而因为

x2 $x_2$ 的取值范围较小，即所以在同程度上

θ2 $\theta_2$ 的变化，不会使得

θ2x2 $\theta_2 x_2$ 的值发生很大的变化。所以我们得到了刚才那幅等高线图，因此，我们需要一种方法来解决这个问题。我们用的方法很简单，叫做特征缩放（feature scaling）

特征缩放（feature scaling），顾名思义，就是将特征的取值范围进行缩放，我们采用如下公式对特征进行缩放(还是以上面那个例子来解释)：

x 1 = S i z e ( f e e t 2 ) 2000 ⟹ 0 \leq x 1 \leq 1

$x_1=\frac{Size(feet^2)}{2000} \Longrightarrow 0\le x_1\le 1$

x 2 = N u m b e r O f B e d r o o m s 5 ⟹ 0 \leq x 2 \leq 1

$x_2=\frac{Number Of Bedrooms}{5}\Longrightarrow 0 \le x_2 \le 1$ 当我们对每个特征值进行了类似的缩放之后，我们得到了如下的等高线图：
机器学习笔记02：多元线性回归、梯度降落和Normal equation

由此可见，缩放之后的等高线图更接近一个圆，而不是一个很扁的椭圆，这样梯度下降法将运行得更加快速。
另外需要说的是，我们一般会希望将每个特征的范围都缩放到 -1 与 1 之间，并且对于偏离这个范围不太大的特征不进行缩放（当然也可以按自己喜好进行缩放）。例如：

Origin range	Need feature scaling?
$0\le x_1\le3$	not need
$-2\le x_2\le0.5$	not need
$-100\le x_3\le100$	need
$-0,0001\le x_4\le0.0001$	need

上面的need和not need并没有一个准确的界限，大可酌情而定。

3、均值归一化(Mean normalization)

正如上面提到的，我们希望将每个特征的范围都缩放到 -1 与 1 之间，并且对于偏离这个范围不太大的特征不进行缩放。

Mean normalization的具体方法是用 $x_j^{(i)}-\mu_j$ 来代替 $x_j^{(i)}$ 以将特征的范围大致的约束在0的附近（一般为-1到1），注意我们不必对 $x_0^{(i)}=1$ 进行归一化。其中 $\mu_j$ 表示特征 $x_j$ 的平均值。我们可将均值归一化公式总结为：

x (i) j = x ( i ) j - μ j S ( j )

$x_j^{(i)}=\frac{x_j^{(i)}-\mu_j}{S^{(j)}}$ 其中

μj $\mu_j$ 的意义已经说明了，

S(j) $S^{(j)}$ 表示第

j $j$ 个特征的范围。

4、学习速率 $\alpha$ (Learning rate $\alpha$ )

关于学习速率 $\alpha$ 我们需要注意两点：

1.确保梯度下降(gradient descent)能够正确地工作:

一方面，我们先来思考一下梯度下降的速率，先来看一个关于梯度下降的图：

注意到，在100次迭代之后，误差还是很大的，在200次迭代之后误差任然很可观，但是在300次迭代之后，误差算是比较小了，在400次迭代之后，误差也比较令人满意。但这里我们的关注点在300和400这两个阶段上，300次迭代之后，我们发现误差还比较令人满意，而却需要花费额外的100次才能使得误差好那么一点点，所以我们可以声明一个低度下降的下界，比如将

10−3 $10^{-3}$ 作为一个下降的下界来避免不必要的额外计算花费。但是，这个下界通常是很难选取的。

另一方面，来看看，梯度下降无法正常工作的情况，看图：

出现这三种情况的原因都是因为学习速率

α $\alpha$ 过大，只要逐渐减小学习速率

α $\alpha$ 直到正常工作即可。至于为什么是学习速率

α $\alpha$ 过大，参考《机器学习笔记01》。可以明确的一点是，只要

α $\alpha$ 足够小，

J(θ) $J(\theta)$ 会在每次迭代中都减小，但是如果

α $\alpha$ 太小的话，梯度下降将会变得很慢。所以我么可以总结如下：

Causes	Results
If $\alpha$ is too small	Slow convergence
If $\alpha$ is too large	$J(\theta)$ may not decrease on every iteration; may not converge

2.如何选取一个合适的 $\alpha$ :

如何选取一个合适的 $\alpha$ 关乎到 Gradient descent 的速率。一般的方法如下：
在自己选定的一个大致范围内进行debug。一般间隔为三倍的关系：

. . ., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, . . .

$...,0.001,0.003,0.01,0.03,0.1,0.3,1,...$ 通过试验，选择一个最佳的学习速率，比如0.03。

5、特征和多项式回归(Features and Polynomial Regression)

这部分只是额外的一些关于选择假设函数的东西。下面看个例子，还是房价预测。

假设函数为:

h θ (x) = θ 0 + θ 1 \times f r o n t a g e + θ 2 \times d e p t h

$h_\theta(x)=\theta_0+\theta_1\times frontage+\theta_2\times depth$ 令

x = A r e a = f r o n t a g e \times d e p t h

$x=Area=frontage\times depth$ 则

h θ (x) = θ 0 + θ 1 x

$h_\theta(x)=\theta_0+\theta_1 x$

假设训练数据的分布如下：
机器学习笔记02：多元线性回归、梯度降落和Normal equation

我们可能选则次数更高的函数比如二次函数（红色）

h θ (x) = θ 0 + θ 1 x + θ 2 x 2

$h_\theta(x)=\theta_0+\theta_1 x+\theta_2 x^2$ 但是如图，我们发现二次函数的后半部分明显不符合数据的走向，总不可房子越大，价钱越低吧。
再来看看三次函数（绿色）

h θ (x) = θ 0 + θ 1 x + θ 2 x 2 + θ 3 x 3

$h_\theta(x)=\theta_0+\theta_1 x+\theta_2 x^2+\theta_3 x^3$ 这个函数是比较合理的一个，另外还有一个函数也比较合适：

h θ (x) = θ 0 + θ 1 x + θ 2 x \sqrt

$h_\theta(x)=\theta_0+\theta_1 x+\theta_2 \sqrt x$
以上就是关于这个例子的一些候选函数。

另外，我们需要注意一下特征缩放。加入有一个假设函数为:

h θ (x) = θ 0 + θ 1 x + θ 2 x 2 + θ 3 x 3 = θ 0 + θ 1 (s i z e) + θ 2 (s i z e) 2 + θ 3 (s i z e) 3

$\begin{align} h_\theta(x)&=\theta_0+\theta_1 x+\theta_2 x^2+\theta_3 x^3\\ &=\theta_0+\theta_1 (size)+\theta_2 (size)^2+\theta_3 (size)^3 \end{align}$ 他们的范围如下：

features	range
$x_1=(size)$	$1-1,000$
$x_2=(size)^2$	$1-1,000,000$
$x_3=(size)^3$	$1-10^9$

在进行feature scaling的时候一定要注意除以对应的正确的range。

5、Normal equation(矩阵方程法)

由于篇幅限制，而且normal equation内容较多，所以暂时留个位置在这里，或者会新开一篇。

上面就是多元变量线性回归的大概内容，希望能帮助到大家。
如有错误，期望您能纠正，留言或者是e-mail：artprog@163.com
——–转载请注明出处——–