ml课程：XGBoost和lightGBM工具库学习及相关案例代码

以下是我的学习笔记，以及总结，如有错误之处请不吝赐教。

本文主要介绍kaggle大杀器xgboost和lightgbm两个工具库的简单使用，以及相关案例代码。

首先回忆一下boosting原理，以及由boosting衍生出来的算法：Adaboost和GBDT以及后面更强的xgboost，忘记的同学可以查阅我之前的文章：ml课程：决策树、随机森林、GBDT、XGBoost相关（含代码实现），除此之外当然还有树模型的相关集成算法的内容：ml课程：模型融合与调优及相关案例代码。回忆杀完了，我们进入正文。

XGboost：

是eXtreme Gradient Boosting的简称，源码在这：xgboost，是由陈天奇大佬团伙开发的实现可扩展，编写，分布式的GBDT算法的一个库，可以用于c++，python，R，julia，java，scala，hadoop，现在有很多协作者共同维护开发。

xgboost计算速度更快的原因有以下几点：

Parallelization：训练是可以用所有的cpu内核来并行化建树（单棵树）。
Distributed Computing ：用分布式计算来训练非常大的模型。
Out-of-Core Computing：对于非常大的数据集还可以进行out-of-core computing.
Cache Optimization of data structures and algorithms:可以更好的利用硬件。

下图是XGBoost与其他gradient boosting和bagged decision trees效果比较：

xgboost另一个优点是预测问题中模型表现非常好，具体可以看下面几个比赛大牛的链接：

Vlad Sandulescu, Mihai Chiru, 1st place of the KDD Cup 2016 competition. Link to the arxiv paper.
Marios Michailidis, Mathias Müller and HJ van Veen, 1st place of the Dato Truely Native? competition. Link to the Kaggle interview.
Vlad Mironov, Alexander Guschin, 1st place of the CERN LHCb experiment Flavour of Physics competition. Link to the Kaggle interview.

最常用XGboost部分：

与sklearn类似，这个库也有以下几个常用的部分：

XGBoost Tutorials，主要是如何使用这个库的一些案例介绍。
XGBoost Parameters，主要是需要调节的参数：通用参数（general parameters）、集成参数（booster parameters）、任务参数（task parameters）。

Python API Reference：各种api接口。

4.高级用法：在github上获取源码，更改相关参数；例如：我们可以自定义损失函数和评价指标

#注意：我们调用原数据需要转换为.train和.test
#!/usr/bin/python
import numpy as np
import xgboost as xgb
###
# advanced: customized loss function
#
print('start running example to used customized objective function')

dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')

# note: for customized objective function, we leave objective as default
# note: what we are getting is margin value in prediction
# you must know what you are doing
param = {'max_depth': 2, 'eta': 1, 'silent': 1}
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 2

# user define objective function, given prediction, return gradient and second order gradient
# this is log likelihood loss
def logregobj(preds, dtrain):
    labels = dtrain.get_label()
    preds = 1.0 / (1.0 + np.exp(-preds))
    grad = preds - labels
    hess = preds * (1.0 - preds)
    return grad, hess   #grad和hess分别表示一阶导数和二阶导数

# user defined evaluation function, return a pair metric_name, result
# NOTE: when you do customized loss function, the default prediction value is margin
# this may make builtin evaluation metric not function properly
# for example, we are doing logistic loss, the prediction is score before logistic transformation
# the builtin evaluation error assumes input is after logistic transformation
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    # return a pair metric_name, result. The metric name must not contain a colon (:) or a space
    # since preds are margin(before logistic transformation, cutoff at 0)
    return 'my-error', float(sum(labels != (preds > 0.0))) / len(labels)

# training with customized objective, we can also do step by step training
# simply look at xgboost.py's implementation of train
bst = xgb.train(param, dtrain, num_round, watchlist, obj=logregobj, feval=evalerror)

相关链接：

xgboost完整流程小项目：https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/

xgboost sklearn库API接口：https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

xgboost API:https://xgboost.readthedocs.io/en/latest/

github源码：https://github.com/dmlc/xgboost