非线性最小二乘问题的求解方法

1. 非线性最小二乘问题的定义
2. 最速下降法
3. 牛顿法
4. 高斯牛顿法（Gauss Newton）
5. 列文伯格-马尔夸特法 (Levenberg-Marquardt)

希望朋友们阅读后能够留下一些提高的建议呀，哈哈哈！

1. 非线性最小二乘问题的定义

对于形如（1）的函数，希望寻找一个局部最优的(x^*)，使得(F(x))达到局部极小值(F(x^*)) 。

[egin{equation} F(mathbf{x})=frac{1}{2} sum_{i=1}^{m}left(f_{i}(mathbf{x}) ight)^{2} end{equation} ]

其中，(f_{i} : mathbf{R}^{n} mapsto mathbf{R}, i=1, ldots, m)，即 (x in R^n,f_i(x)in R)。

局部极小值：存在(delta>0)，对任意满足(left|mathbf{x}-mathbf{x}^{*} ight|<delta) 的(x)，都有(Fleft(mathbf{x}^{*} ight) leq F(mathbf{x}))。

这里举三个简单的例子：

(xin R,m=1,f_1(x)=x+1)，则(F(x)=frac{1} {2}（x+1）^2)，局部极小值在(x^*=-1)处取得。
(xin R , m=2,f_1(x)=x+1,f_2(x)=exp(3x^2+2x+1))，则(F(x)=frac{1} {2}（x+1）^2+exp(3x^2+2x+1))，此时就不容易计算出局部最优(x^*)的位置了。
(xin R^3, m=1,f_1(x)=x^Tx)，则(F(x)=frac {1} {2}(x^Tx)^2)

事实上，(f_i)也可以将(x)映射到(R^m)空间中，因为(f_i(x)^2=f_i(x)^Tf_i(x) in R)，最终计算出来的值总是一个实数。对于简单的最小二乘问题，如1，可以用求导取极值的方法得到局部极小值的位置，然而复杂的、高维的，如2和3，就只能采取一些迭代的策略来求解局部极小值了。

注意，是局部极小值而非全局最小值！对于凸函数而言是可以得到全局最小值的。

2. 最速下降法

假设函数（1）是可导并且光滑的，则可以对函数（1）在(x)处进行二阶泰勒展开为（2）式

[egin{equation} F(mathbf{x}+Delta mathbf{x})=F(mathbf{x})+Delta mathbf{x}^T mathbf{J} +frac{1}{2} Delta mathbf{x}^{ op} mathbf{H} Delta mathbf{x}+Oleft(|Delta mathbf{x}|^{3} ight) end{equation} ]

其中 (J=left[egin{array}{c}{frac{partial J(mathbf{x})}{partial x_{1}}} \ {vdots} \ {frac{partial J(mathbf{x})}{partial x_{n}}}end{array} ight])，(H=left[egin{array}{ccc}{frac{partial^{2} H(mathbf{x})}{partial x_{1} partial x_{1}}} & {cdots} & {frac{partial^{2} H(mathbf{x})}{partial x_{1} partial x_{n}}} \ {vdots} & {ddots} & {vdots} \ {frac{partial^{2} H(mathbf{x})}{partial x_{n} partial x_{1}}} & {cdots} & {frac{partial^{2} H(mathbf{x})}{partial x_{n} partial x_{n}}}end{array} ight])，(J)和(H)分别(F)对变量(x)的一阶导和二阶导。

忽略公式（2）二阶以上的导数，可以将其写成二次多项式的形式

[egin{equation} F(mathbf{x}+Delta mathbf{x}) approx F (mathbf{x})+Delta mathbf{x}^T mathbf{J} +frac{1}{2} Delta mathbf{x}^{ op} mathbf{H} Delta mathbf{x} end{equation} ]

当$x=x^* $时，我们可以得出一些结论

(J= mathbf{0})，我们称此点为驻点。（原因，假设其不为0，则必定存在使(F(x))下降的(Delta x)）
(H)为对称矩阵，且(H)为正定矩阵。（原因同样是保证不会存在使(F(x))下降的(Delta x)）

那么，如何寻找(Delta x)使得(F(x+Delta x))保持下降呢？最速下降法给出了一个解决方法，首先是忽略掉一阶以上的导数，则公式（3）可以重写为

[egin{equation} F(mathbf{x}+Delta mathbf{x}) approx F (mathbf{x})+Delta mathbf{x}^T mathbf{J} end{equation} ]

而

[egin{equation} frac{partial F(mathbf{x}+Delta mathbf{x})}{partial Delta mathbf{x} } approx mathbf{J} end{equation} ]

最速下降法的迭代更新方式为

[egin{equation} Delta x = lambda J end{equation} ]

则

[egin{equation} F(mathbf{x}+Delta mathbf{x}) approx F (mathbf{x})-lambda mathbf{J}^T mathbf{J} end{equation} ]

此方法最速体现在负梯度方向是局部下降最快的梯度方向。

3. 牛顿法

牛顿法利用了（3）式的二阶导数，收敛速度为二阶收敛，比最速下降法的一阶收敛要快。

对（3）式的(Delta x)求导可得

[egin{equation} frac{partial F(mathbf{x}+Delta mathbf{x})}{partial Delta mathbf{x} } approx mathbf{J} + mathbf{H}Delta mathbf{x} end{equation} ]

令上式等于0，即可计算出每次迭代的(Delta mathbf{x})步长

[egin{equation} Delta mathbf{x} = -mathbf{H}^{-1}mathbf{J} end{equation} ]

然而，(mathbf{H})可能不存在逆矩阵，因此可以加上阻尼因子(mu>0)

[egin{equation} Delta mathbf{x} = -(mathbf{H}+mu mathbf{I})^{-1}mathbf{J} end{equation} ]

此法保证(mathbf{H}+mu mathbf{I})一定可逆，同时阻尼因子是对(Delta x)的大小做了限制。因为最优化的式子变成了

[egin{equation} mathop {min }limits_{Delta x} F(mathbf{x}+Delta mathbf{x})+frac{1}{2}mu Delta mathbf{x}^{ op} Delta mathbf{x} approx F (mathbf{x})+Delta mathbf{x}^T mathbf{J} +frac{1}{2} Delta mathbf{x}^{ op} mathbf{H} Delta mathbf{x} + frac{1}{2}mu Delta mathbf{x}^{ op} Delta mathbf{x} end{equation} ]

对(Delta x)求导很容易得到（10）式的结论。

当然，阻尼高斯方法也存在一定的缺陷:(mathbf{H})矩阵不好计算，可能会花费更多的时间。

4. 高斯牛顿法（Gauss Newton）

前面提到的方法没有利用到最小二乘问题的形式，我想此方法之所以前面有高斯二字就是因为最小二乘形式的应用吧。

为了使公式看起来尽量简洁，我们将公式（1）写成矩阵形式

[egin{equation} F({x})=frac{1}{2}f(x)^2 end{equation} ]

其中 (mathbf{f}(mathbf{x})=left[egin{array}{c}{f_{1}(mathbf{x})} \ {vdots} \ {f_{m}(mathbf{x})}end{array} ight])，这也是我此前说(f_i)也可以将(x)映射到(R^m)空间中的原因，这种堆叠方式其实与前述的映射在原理上是一致的。

记

[egin{equation} J_{i}(mathbf{x})=frac{partial f_{i}(mathbf{x})}{partial mathbf{x}}=left[frac{partial f_{i}(mathbf{x})}{partial x_{1}} cdots frac{partial f_{i}(mathbf{x})}{partial x_{n}} ight] end{equation} ]

[egin{equation} J(mathbf{x})=left[egin{array}{cccc}{frac{partial f_{1}(mathbf{x})}{partial mathbf{x}}} & {cdots} & {frac{partial f_{m}(mathbf{x})}{partial mathbf{x}}}end{array} ight]=left[egin{array}{ccc}{frac{partial f_{1}(mathbf{x})}{partial x_{1}}} & {cdots} & {frac{partial f_{m}(mathbf{x})}{partial x_{n}}} \ {vdots} & {ddots} & {vdots} \ {frac{partial f_{1}(mathbf{x})}{partial x_{1}}} & {cdots} & {frac{partial f_{m}(mathbf{x})}{partial x_{n}}}end{array} ight] end{equation} ]

则(J_i(mathbf{x}))是一个 $ 1 imes n $ 向量，而(J(x))是一个 $ n imes n $ 矩阵。

将(f(x))进行一阶泰勒展开

[egin{equation} l(Delta mathbf{x}) =f(mathbf{x})+mathbf{J}Delta mathbf{x}approx f(x+Delta mathbf{x}) end{equation} ]

则

[egin{equation} F(x) approx L(Delta mathbf{x}) = frac{1}{2} l(Delta mathbf{x})^Tl(Delta mathbf{x}) =frac{1}{2} (f(mathbf{x})+mathbf{J}Delta mathbf{x})^T(f(mathbf{x})+mathbf{J}Delta mathbf{x}) \ =frac{1}{2}f(x)^2 + f(x)^TJ Delta x + Delta x^T J^TJDelta mathbf{x} end{equation} ]

对(L(Delta x))求导并令其为0，则

[egin{equation} frac{partial L(Delta mathbf{x})}{partial Delta mathbf{x}} = f(x)^TJ + Delta x^T J^TJ \ = J^Tf(x) + J^TJDelta x = 0 end{equation} ]

可得

[egin{equation} Delta x = -（J^TJ)^{-1}J^Tf(x) end{equation} ]

高斯牛顿法有效的避免了Hessian矩阵的计算，可以加快运算速度。

5. 列文伯格-马尔夸特法 (Levenberg-Marquardt)

相当于添加阻尼因子，目的是使步长不要太大，起到限制步长的作用。

[egin{equation} mathop {min }limits_{Delta x} {F(mathbf{x}+Delta mathbf{x})+frac{1}{2}mu Delta mathbf{x}^{ op} Delta mathbf{x}} approx frac{1}{2}f(x)^2 + f(x)^TJ Delta x + Delta x^T J^TJDelta mathbf{x} + frac{1}{2}mu Delta mathbf{x}^{ op} Delta mathbf{x} end{equation} ]

对(Delta x)求导之后得到

[egin{equation} f(x)^TJ +J^TJDelta x+mu Delta x = 0 end{equation} ]

[egin{equation} \ Delta x = -(J^TJ+mu I)^{-1}f(x)^TJ end{equation} ]

当然，这里的步长都有一定的更新策略，而基本的方法就是上面这些内容。

LM方法在高斯牛顿方法的基础上添加了正则化约束，能够权衡步长与非线性程度之间的利弊。当迭代点附近的非线性程度比较高时，倾向于增大阻尼因子而减小步长，此时接近最速下降方法；而当系统近乎线性时，减小阻尼因子而增大步长，此时接近高斯牛顿方法。同时阻尼因子的引入也保证了((J^TJ+mu I)^{-1})有意义。

参考文献

[1] methods for non-linear least squares problems.

转载请获得作者同意。