一、GBDT公式推导
1、第一个基函数:
(1.1) F 0 ( X ) = 1 2 l o g 1 + y 1 y F_0(X)=\frac{1}{2}log\frac{1+\overline{y}}{1-\overline{y}} \tag{1.1} F 0 ( X ) = 2 1 l o g 1 y 1 + y ( 1 . 1 )
对于损失函数L ( y , F ) = l o g ( 1 + e 2 y F ) , y ∈ { 1 , 1 } L(y, F)=log(1+e^{-2yF}), y\in \left\{-1, 1\right\} L ( y , F ) = l o g ( 1 + e 2 y F ) , y ∈ { 1 , 1 } ,求损失函数最小对应的F F F 值。求一阶导数:
(1.2) L ′ = 2 y e 2 y F 1 + e 2 y F L^{'}=\frac{-2ye^{-2yF}}{1+e^{-2yF}} \tag{1.2} L ′ = 1 + e 2 y F 2 y e 2 y F ( 1 . 2 )
设
(1.3) { c o u n t ( y = + 1 ) = m 1 c o u n t ( y = 1 ) = m 2
\begin{cases}
&count(y=+1)=m1\\
&count(y=-1)=m2
\end{cases} \tag{1.3}
{ c o u n t ( y = + 1 ) = m 1 c o u n t ( y = 1 ) = m 2 ( 1 . 3 )
(1.7) L ′ = ∑ y = 1 L ′ + ∑ y = 1 L ′ = n 2 ( 1 + y ) 2 e 2 F 1 + e 2 F + n 2 ( 1 y ) 2 e 2 F 1 + e 2 F = n 2 ( 1 + y ) 2 1 + e 2 F + n 2 ( 1 y ) 2 e 2 F 1 + e 2 F = n 1 + e 2 F [ ( 1 + y ) + e 2 F ( 1 y ) ]
\begin{aligned}
L^{'}&=\sum_{y=1}L^{'}+\sum_{y=-1}L^{'}\\\\
&=\frac{n}{2}(1+\overline{y})*\frac{-2e^{-2F}}{1+e^{-2F}} + \frac{n}{2}(1-\overline{y})*\frac{2e^{2F}}{1+e^{2F}}\\\\
&=\frac{n}{2}(1+\overline{y})*\frac{-2}{1+e^{2F}} + \frac{n}{2}(1-\overline{y})*\frac{2e^{2F}}{1+e^{2F}}\\\\
&=\frac{n}{1+e^{2F}}[-(1+\overline{y}) + e^{2F}(1-\overline{y})]
\end{aligned} \tag{1.7}
L ′ = y = 1 ∑ L ′ + y = 1 ∑ L ′ = 2 n ( 1 + y ) 1 + e 2 F 2 e 2 F + 2 n ( 1 y ) 1 + e 2 F 2 e 2 F = 2 n ( 1 + y ) 1 + e 2 F 2 + 2 n ( 1 y ) 1 + e 2 F 2 e 2 F = 1 + e 2 F n [ ( 1 + y ) + e 2 F ( 1 y ) ] ( 1 . 7 )
则有
(1.4) { m 1 + m 2 = n m 1 m 2 n = y
\begin{cases}
m1+m2=n\\
\frac{m1-m2}{n}=\overline{y}
\end{cases} \tag{1.4}
{ m 1 + m 2 = n n m 1 m 2 = y ( 1 . 4 )
即
(1.5) { m 1 = n 2 ( 1 + y ) m 2 = n 2 ( 1 y )
\begin{cases}
m1=\frac{n}{2}(1+\overline{y})\\
m2=\frac{n}{2}(1-\overline{y})
\end{cases} \tag{1.5}
{ m 1 = 2 n ( 1 + y ) m 2 = 2 n ( 1 y ) ( 1 . 5 )
令∑ L ′ = 0 \sum{L^{'}=0} ∑ L ′ = 0 得到,
(1.6) ∑ L ′ = ∑ y = 1 L ′ + ∑ y = 1 L ′ = 0 \sum{L^{'}=\sum_{y=1}L^{'}+\sum_{y=-1}L^{'}=0} \tag{1.6} ∑ L ′ = y = 1 ∑ L ′ + y = 1 ∑ L ′ = 0 ( 1 . 6 )
将( 1.5 ) (1.5) ( 1 . 5 ) 带入( 1.6 ) (1.6) ( 1 . 6 ) ,得到,
(1.7) L ′ = ∑ y = 1 L ′ + ∑ y = 1 L ′ = n 2 ( 1 + y ) 2 e 2 F 1 + e 2 F + n 2 ( 1 y ) 2 e 2 F 1 + e 2 F = n 2 ( 1 + y ) 2 1 + e 2 F + n 2 ( 1 y ) 2 e 2 F 1 + e 2 F = n 1 + e 2 F [ ( 1 + y ) + e 2 F ( 1 y ) ]
\begin{aligned}
L^{'}&=\sum_{y=1}L^{'}+\sum_{y=-1}L^{'}\\\\
&=\frac{n}{2}(1+\overline{y})*\frac{-2e^{-2F}}{1+e^{-2F}} + \frac{n}{2}(1-\overline{y})*\frac{2e^{2F}}{1+e^{2F}}\\\\
&=\frac{n}{2}(1+\overline{y})*\frac{-2}{1+e^{2F}} + \frac{n}{2}(1-\overline{y})*\frac{2e^{2F}}{1+e^{2F}}\\\\
&=\frac{n}{1+e^{2F}}[-(1+\overline{y}) + e^{2F}(1-\overline{y})]
\end{aligned} \tag{1.7}
L ′ = y = 1 ∑ L ′ + y = 1 ∑ L ′ = 2 n ( 1 + y ) 1 + e 2 F 2 e 2 F + 2 n ( 1 y ) 1 + e 2 F 2 e 2 F = 2 n ( 1 + y ) 1 + e 2 F 2 + 2 n ( 1 y ) 1 + e 2 F 2 e 2 F = 1 + e 2 F n [ ( 1 + y ) + e 2 F ( 1 y ) ] ( 1 . 7 )
由式( 1.7 ) = 0 (1.7)=0 ( 1 . 7 ) = 0 解得F 0 = 1 2 l n 1 + y 1 y F_0=\frac{1}{2}ln\frac{1+\overline{y}}{1-\overline{y}} F 0 = 2 1 l n 1 y 1 + y ,当F < F 0 F<F_0 F < F 0 时L ′ < 0 L^{'}<0 L ′ < 0 ;当F > F 0 F>F_0 F > F 0 时L ′ > 0 L^{'}>0 L ′ > 0 ,所以F 0 F_0 F 0 即为最小值点。得证。
2、最优叶子节点取值:
(1.8) γ j m = ∑ x i ∈ R j m y ~ i / ∑ x i ∈ R j m ∣ y ~ i ∣ ( 2 ∣ y ~ i ∣ ) , j = 1 , J \gamma_{jm}=\sum_{x_{i}\in R_{jm}}\widetilde{y}_i/\sum_{x_{i}\in R_{jm}}|\widetilde{y}_i|(2-|\widetilde{y}_i|),j=1,J\tag{1.8} γ j m = x i ∈ R j m ∑ y i / x i ∈ R j m ∑ ∣ y i ∣ ( 2 ∣ y i ∣ ) , j = 1 , J ( 1 . 8 )
由文献[ 1 ] [1] [ 1 ] 中式( 23 ) (23) ( 2 3 ) 可得优化目标函数:
(1.9) γ j m = a r g m i n γ ∑ x i ∈ R j m l o g ( 1 + e 2 y i ( F m 1 ( x i ) + γ ) ) \gamma_{jm}=arg\mathop{min}\limits_{\gamma}\sum\limits_{x_i\in{R_jm}}log(1+e^{-2y_i(F_{m-1}(x_i)+\gamma)}) \tag{1.9} γ j m = a r g γ m i n x i ∈ R j m ∑ l o g ( 1 + e 2 y i ( F m 1 ( x i ) + γ ) ) ( 1 . 9 )
由文献[ 1 ] [1] [ 1 ] 中式( 22 ) (22) ( 2 2 ) 可得损失函数负梯度:
(1.10) y ~ i = [ L ( y i , F ( x i ) ) F ( x i ) ] F ( x ) = F m 1 ( x ) = 2 y i 1 + e 2 y i F m 1 ( x i ) \widetilde{y}_i=-[\frac{\partial{L(y_i,F(x_i))}}{\partial{F(x_i)}}]_{F(x)=F_{m-1}(x)}=\frac{2y_i}{1+e^{2y_iF_{m-1}(x_i)}} \tag{1.10} y i = [ F ( x i ) L ( y i , F ( x i ) ) ] F ( x ) = F m 1 ( x ) = 1 + e 2 y i F m 1 ( x i ) 2 y i ( 1 . 1 0 )
则
(1.11) e 2 y i F m 1 ( x i ) = 2 y i y ~ 1 e^{2y_iF_{m-1}(x_i)}=\frac{2y_i}{\widetilde{y}}-1 \tag{1.11} e 2 y i F m 1 ( x i ) = y 2 y i 1 ( 1 . 1 1 )
对式( 1.9 ) (1.9) ( 1 . 9 ) 在F m 1 F_{m-1} F m 1 处进行泰勒展开,得到,
(1.12) γ j m = a r g m i n γ ∑ x i ∈ R j m [ L F m 1 + L F m 1 ′ γ + 1 2 L F m 1 ′ ′ γ 2 ] \gamma_{jm}=arg\mathop{min}\limits_{\gamma}\sum\limits_{x_i\in{R_jm}}[L_{F_{m-1}}+L_{F_{m-1}}^{'}*\gamma+\frac{1}{2}L_{F_{m-1}}^{''}*\gamma^{2}] \tag{1.12} γ j m = a r g γ m i n x i ∈ R j m ∑ [ L F m 1 + L F m 1 ′ γ + 2 1 L F m 1 ′ ′ γ 2 ] ( 1 . 1 2 )
其中,
(1.13) { L F m 1 = l o g ( 1 + e 2 y i F m 1 ( x i ) ) L F m 1 ′ = y ~ L F m 1 ′ ′ = e 2 y i F m 1 y ~ 2
\begin{cases}
L_{F_{m-1}}=log(1+e^{-2y_iF_{m-1}(x_i)})\\\\
L_{F_{m-1}}^{'}=-\widetilde{y}\\\\
L_{F_{m-1}}^{''}=e^{2y_iF_{m-1}}*\widetilde{y}^{2}
\end{cases}\tag{1.13}
L F m 1 = l o g ( 1 + e 2 y i F m 1 ( x i ) ) L F m 1 ′ = y L F m 1 ′ ′ = e 2 y i F m 1 y 2 ( 1 . 1 3 )
由于式( 1.12 ) (1.12) ( 1 . 1 2 ) 中L F m 1 ′ ′ > 0 L_{F_{m-1}}^{''}>0 L F m 1 ′ ′ > 0 ,所以存在全局最小值点。式( 1.12 ) (1.12) ( 1 . 1 2 ) 对γ \gamma γ 求导等于0的点即为最小值点,即
(1.14) ∑ x i ∈ R j m [ L F m 1 ′ + L F m 1 ′ ′ γ j m ] ∑ x i ∈ R j m [ L F m 1 ′ + L F m 1 ′ ′ γ j m ] = 0 γ j m = ∑ x i ∈ R j m L F m 1 ′ ∑ x i ∈ R j m L F m 1 ′ ′ γ j m = ∑ x i ∈ R j m y ~ ∑ x i ∈ R j m ( 2 y i y ~ 1 ) y ~ 2 γ j m = ∑ x i ∈ R j m y ~ ∑ x i ∈ R j m ( 2 y i y ~ ) y ~
\begin{matrix}
&&\sum\limits_{x_i\in{R_jm}}[L_{F_{m-1}}^{'}+L_{F_{m-1}}^{''}*\gamma_{jm}]\\\\
&\Longrightarrow&\sum\limits_{x_i\in{R_jm}}[L_{F_{m-1}}^{'}+L_{F_{m-1}}^{''}*\gamma_{jm}]=0\\\\
&\Longrightarrow&\gamma_{jm}=-\frac{\sum\limits_{x_i\in{R_jm}}L_{F_{m-1}}^{'}}{\sum\limits_{x_i\in{R_jm}}L_{F_{m-1}}^{''}}\\\\
&\Longrightarrow&\gamma_{jm}=\frac{\sum\limits_{x_i\in{R_jm}}\widetilde{y}}{\sum\limits_{x_i\in{R_jm}}(\frac{2y_i}{\widetilde{y}}-1)*\widetilde{y}^{2}}\\\\
&\Longrightarrow&\gamma_{jm}=\frac{\sum\limits_{x_i\in{R_jm}}\widetilde{y}}{\sum\limits_{x_i\in{R_jm}}(2y_i-\widetilde{y})*\widetilde{y}}
\end{matrix}\tag{1.14}
x i ∈ R j m ∑ [ L F m 1 ′ + L F m 1 ′ ′ γ j m ] x i ∈ R j m ∑ [ L F m 1 ′ + L F m 1 ′ ′ γ j m ] = 0 γ j m = x i ∈ R j m ∑ L F m 1 ′ ′ x i ∈ R j m ∑ L F m 1 ′ γ j m = x i ∈ R j m ∑ ( y 2 y i 1 ) y 2 x i ∈ R j m ∑ y γ j m = x i ∈ R j m ∑ ( 2 y i y ) y x i ∈ R j m ∑ y ( 1 . 1 4 )
由式( 1.10 ) (1.10) ( 1 . 1 0 ) 可知,当y i = 1 y_i=1 y i = 1 时:y ~ i ∈ ( 0 , 2 ) \widetilde{y}_i\in(0, 2) y i ∈ ( 0 , 2 ) ;当y i = 1 y_i=-1 y i = 1 时:y ~ i ∈ ( 2 , 0 ) \widetilde{y}_i\in(-2,0) y i ∈ ( 2 , 0 ) 。所以( 1.14 ) (1.14) ( 1 . 1 4 ) 可化简为
γ j m = ∑ x i ∈ R j m y ~ ∑ x i ∈ R j m ( 2 ∣ y ~ ∣ ) ∣ y ~ ∣
\gamma_{jm}=\frac{\sum\limits_{x_i\in{R_jm}}\widetilde{y}}{\sum\limits_{x_i\in{R_jm}}(2-|\widetilde{y}|)*|\widetilde{y}|}
γ j m = x i ∈ R j m ∑ ( 2 ∣ y ∣ ) ∣ y ∣ x i ∈ R j m ∑ y
得证.
二、XGBoost公式推导
1、XGBoost目标函数泰勒展开:
(2.1) { L ( t ) = ∑ i = 1 n l ( y i , y i ^ t 1 + f t ( x i ) ) + Ω ( f t ) ≈ ∑ i = 1 n [ l ( y i , y i ^ t 1 ) + g i f i ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t )
\begin{cases}
L^{(t)}&=\sum_{i=1}^{n} l(y_i,\hat{y_i}^{t-1}+f_t(x_i))+\Omega(f_t) \\\\
&\approx\sum_{i=1}^{n}[ l(y_i,\hat{y_i}^{t-1})+g_if_i(x_i)+\frac{1}{2}h_if_t^2(x_i)]+\Omega(f_t) \tag{2.1}
\end{cases}
L ( t ) = ∑ i = 1 n l ( y i , y i ^ t 1 + f t ( x i ) ) + Ω ( f t ) ≈ ∑ i = 1 n [ l ( y i , y i ^ t 1 ) + g i f i ( x i ) + 2 1 h i f t 2 ( x i ) ] + Ω ( f t ) ( 2 . 1 )
由于 l ( y i , y i ^ t 1 ) l(y_i,\hat{y_i}^{t-1}) l ( y i , y i ^ t 1 ) 只跟前一棵树的预测结果有关,所以本次迭代时为定值,在计算本次迭代的损失函数时可以忽略。因此,本轮需要需要的目标函数为:
(2.2) { L ′ ( t ) = ∑ i = 1 n [ g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) = ∑ i = 1 n [ g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + γ T + 1 2 λ ∑ j = 1 T w j 2 = ∑ j = 1 T [ ( ∑ i ∈ I j g i ) w j + 1 2 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T
\begin{cases}
L^{'(t)}&=\sum_{i=1}^{n} [g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)] + \Omega(f_t) \\\\
&=\sum_{i=1}^{n} [g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T}w_j^2 \\\\
&=\sum_{j=1}^{T} [(\sum_{i\in I_j}g_i)w_j + \frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2] + \gamma T \tag{2.2}\\
\end{cases}
L ′ ( t ) = ∑ i = 1 n [ g i f t ( x i ) + 2 1 h i f t 2 ( x i ) ] + Ω ( f t ) = ∑ i = 1 n [ g i f t ( x i ) + 2 1 h i f t 2 ( x i ) ] + γ T + 2 1 λ ∑ j = 1 T w j 2 = ∑ j = 1 T [ ( ∑ i ∈ I j g i ) w j + 2 1 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T ( 2 . 2 )
2、logitloss损失函数
① 第一种形式:
(2.3) L = y l n ( 1 + e y ^ ) + ( 1 y ) l n ( 1 + e y ^ ) L = y *ln(1+ e^{-\hat{y}}) + (1-y) *ln(1+ e^{\hat{y}}) \tag{2.3} L = y l n ( 1 + e y ^ ) + ( 1 y ) l n ( 1 + e y ^ ) ( 2 . 3 )
(2.4) { L ′ = y + ( 1 y ) e y ^ 1 + e y ^ L ′ ′ = e y ^ ( 1 + e y ^ ) 2
\begin{cases}
L^{'}&=&\frac{-y+(1-y)*e^{\hat y}}{1+e^{\hat{y}}}\\\\
L^{''}&=&\frac{e^{\hat{y}}}{(1+e^{\hat{y}})^2}
\end{cases}\tag{2.4}
L ′ L ′ ′ = = 1 + e y ^ y + ( 1 y ) e y ^ ( 1 + e y ^ ) 2 e y ^ ( 2 . 4 )
由于y ^ ∈ ( 0 , 1 ) \hat{y}\in{(0,1)} y ^ ∈ ( 0 , 1 ) ,所以 L ′ ′ ∈ ( e ( 1 + e ) 2 , 0.25 ) . L^{''}\in{(\frac{e}{(1+e)^2}, 0.25)}. L ′ ′ ∈ ( ( 1 + e ) 2 e , 0 . 2 5 ) .
② 第二种形式:
(2.5) L = [ y l n ( y ^ ) + ( 1 y ) l n ( 1 y ^ ) ] L = -[y * ln(\hat{y}) + (1-y)*ln(1-\hat{y}) ] \tag{2.5} L = [ y l n ( y ^ ) + ( 1 y ) l n ( 1 y ^ ) ] ( 2 . 5 )
(2.6) { L ′ = y y ^ + 1 y 1 y ^ L ′ ′ = y y ^ 2 + 1 y 1 y ^ 2
\begin{cases}
L^{'} &=& \frac{-y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\\\\
L^{''} &=& \frac{y}{\hat{y}^2} + \frac{1-y}{1-\hat{y}^2}
\end{cases}\tag{2.6}
L ′ L ′ ′ = = y ^ y + 1 y ^ 1 y y ^ 2 y + 1 y ^ 2 1 y ( 2 . 6 )
由于y ^ ∈ ( 0 , 1 ) \hat{y}\in{(0,1)} y ^ ∈ ( 0 , 1 ) ,所以 L ′ ′ ∈ ( 1 , + ∞ ) L^{''}\in{(1, +\infty)} L ′ ′ ∈ ( 1 , + ∞ )
参考文献
[1].Jerome H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine[J]. The Annals of Statistics. 29(5):1189-1232.
[2].Tianqi Chen, Carlos Guestrin. XGBoost: A Scalable Tree Boosting System[D]. KDD’16.
Guolin Ke, Qi Meng, Thomas Finley. LightGBM: A Highly Efficient Gradient Boosting Decision Tree[D]. NIPS 2017.
[3].李航. 统计学习方法[M].138-154
[4].XGBoost入门系列第一讲. https://zhuanlan.zhihu.com/p/27816315