<h1>
<center>
CRF及求解算法L-BFGS,Line Search原理及代码分析
</center> </h1>
<center>
wangpeng(qqlantian@126.com)
</center>
<center>
Last updated on 2017-2-17
</center>
<p>由于博客园对markdown支持不完善(或者我不太会用),一些公式和引用展示不正常或不好看,建议下载pdf版阅读https://github.com/AlexPengW/resource/tree/master/pdf</p>
<h2>Abstract</h2>
<p> 本文首先介绍<strong>CRF</strong>[^CRF]模型及用来提高CRF求解和预测效率的<strong>Forward-Backward</strong> 算法,然后介绍经典的无约束优化算法<strong>Line Search</strong>[^Numerical optimization],<strong>DFP</strong>[^3],<strong>BFGS</strong>,<strong>L-BFGS</strong>[^4], <strong>IIS</strong>[^5],给出一些必要性的证明,最后结合[CRF++代码][crfpp]讲解CRF和L-BFGS的代码实现。</p>
<p>[TOC]</p>
<h2>CRF模型</h2>
<h3>CRF模型概述</h3>
<blockquote>
<p>CRFs are a type of discriminative undirected probabilistic graphical model. It is used to encode known relationships between observations and construct consistent interpretations. It is often used for labeling or parsing of sequential data, such as natural language text or biological sequences and in computer vision. —— Wikipedia</p>
</blockquote>
<p> 本文主要描述Linear-chain CRF,<br> 记<span class="math inline">\(P(y|x,\lambda )\)</span>为CRF模型定义的条件概率, 即给定观测序列<span class="math inline">\(x\)</span>时,任意标注序列<span class="math inline">\(y\)</span>的条件概率,则<span class="math inline">\(P(y|x,\lambda )\)</span>定义如下:<br> \begin{equation}<br> \begin{cases}<br> P\left(y|x,\lambda \right)&=\dfrac{1}{Z(x)}exp\left(\sum_{i,k}\lambda_kt_k\left(y_{i-1},y_i,x,i\right)+\sum_{i,l}\mu_ls_l\left(y_i,x,i\right)\right) \\<br> Z\left(x\right)&=\sum_yexp\left(\sum_{i,k}\lambda_kt_k\left(y_{i-1},y_i,x,i\right)+\sum_{i,l}\mu_ls_l\left(y_i,x,i\right)\right)<br> \end{cases}<br> \end{equation}<br> 其中<span class="math inline">\(s_l(y_i,x,i)\)</span>为状态特征函数,<span class="math inline">\(t_k(y_{i-1},y_i,x,i)\)</span>为转移特征函数。特征函数由用户定义,[CRF++代码][crfpp]里面用Unigram模板提取状态特征,用Bigram模板提取转移特征,对应的特征函数返回0或1。<span class="math inline">\(\mu_l\)</span>和<span class="math inline">\(\lambda_k\)</span>分别是对应的特征权值,即待求解的模型参数。<br> 上面的概率公式具体来历可以参考统计学习方法[^LiHang]书中的最大熵模型,或者这篇博客最大熵模型[^MME_zl],这里不再叙述。<br> 为了简化公式,记<span class="math inline">\(s(y_i,x,i)=s(y_{i-1},y_i,x,i)\)</span>, <span class="math inline"> \(f(y_{i-1},y_i,x,i)\)</span>或者是个状态函数<span class="math inline">\(s(y_{i-1},y_i,x,i)\)</span>或者是个转移函数<span class="math inline">\(t(y_{i-1},y_i,x,i)\)</span>,记<span class="math inline">\(N\)</span>为某个样本序列长度,不同样本序列长度可能不同,记<span class="math inline">\(L\)</span>为标记序列<span class="math inline">\(y\)</span>在任意位置可能取到的标记个数,记<span class="math inline">\(T\)</span>为特征数量。则<span class="math inline">\(P(y|x,\lambda )\)</span>也可以定义如下:<br> \begin{equation}\begin{cases}<br> P\left(y|x,\lambda \right)&=\dfrac{1}{Z(x)}exp\left(\sum_j\lambda_jF_j\left(y,x\right)\right)\\<br> F_j\left(y,x\right)&=\sum_{i=1}^Nf_j\left(y_{i-1}, y_i,x,i\right)\\<br> Z(x)&=\sum_yexp\left(\sum_j\lambda_jF_j\left(y,x\right)\right)<br> \end{cases}\end{equation}</p>
<h3>CRF模型求解</h3>
<p> 给定训练样本<span class="math inline">\(\{(x^m,y^m)\}, m =\{1,...,M\}\)</span>, 则CRF模型的对数似然函数为:<br> \begin{equation}<br> \mathcal{L}(\lambda)=\sum_mlogP\left(y^m|x^m,\lambda\right)=\sum_m\left(\sum_j\lambda_jF_j\left(x^m,y^m\right)-logZ\left(x^m\right)\right)<br> \end{equation}<br> 可以证明上面的函数是关于<span class="math inline">\(\lambda\)</span>的凹函数,我只试了一个不优美的证明方法,所以就不列出来了,简单来说就是利用<strong>[定理A function is convex if and only if it is convex when restricted to any line that intersects its domain,这个定理也容易证明]</strong>将高维上的证明转化为一维上的证明,然后证明一维函数二阶导数小于0即可。通常我们求解时都会在后面减去<span class="math inline">\(L_2\)</span>正则化项<span class="math inline">\(\dfrac{\sum_j\lambda_j^2}{2C}\)</span>,减去<span class="math inline">\(L_2\)</span>正则化项后就是严格凹函数(strictly concave),由严格凹函数性质知<span class="math inline">\(\mathcal{L}(\lambda)\)</span>有唯一一个全局最大值,且最大值处导出为<span class="math inline">\(0\)</span>。<br> 将对数似然对<span class="math inline">\(\lambda_j\)</span>求导得:<br> \begin{equation}<br> \begin{cases}<br> \dfrac{\partial\mathcal{L} |
|