|
以下是我的学习笔记,以及总结,如有错误之处请不吝赐教。
本文主要介绍kaggle大杀器xgboost和lightgbm两个工具库的简单使用,以及相关案例代码。
首先回忆一下boosting原理,以及由boosting衍生出来的算法:Adaboost和GBDT以及后面更强的xgboost,忘记的同学可以查阅我之前的文章:ml课程:决策树、随机森林、GBDT、XGBoost相关(含代码实现),除此之外当然还有树模型的相关集成算法的内容:ml课程:模型融合与调优及相关案例代码。回忆杀完了,我们进入正文。
XGboost:
是eXtreme Gradient Boosting的简称,源码在这:xgboost,是由陈天奇大佬团伙开发的实现可扩展,编写,分布式的GBDT算法的一个库,可以用于c++,python,R,julia,java,scala,hadoop,现在有很多协作者共同维护开发。
xgboost计算速度更快的原因有以下几点:
- Parallelization:训练是可以用所有的cpu内核来并行化建树(单棵树)。
- Distributed Computing :用分布式计算来训练非常大的模型。
- Out-of-Core Computing:对于非常大的数据集还可以进行out-of-core computing.
- Cache Optimization of data structures and algorithms:可以更好的利用硬件。
下图是XGBoost与其他gradient boosting和bagged decision trees效果比较:

xgboost另一个优点是预测问题中模型表现非常好,具体可以看下面几个比赛大牛的链接:
- Vlad Sandulescu, Mihai Chiru, 1st place of the KDD Cup 2016 competition. Link to the arxiv paper.
- Marios Michailidis, Mathias Müller and HJ van Veen, 1st place of the Dato Truely Native? competition. Link to the Kaggle interview.
- Vlad Mironov, Alexander Guschin, 1st place of the CERN LHCb experiment Flavour of Physics competition. Link to the Kaggle interview.
最常用XGboost部分:
与sklearn类似,这个库也有以下几个常用的部分:
- XGBoost Tutorials,主要是如何使用这个库的一些案例介绍。

- XGBoost Parameters,主要是需要调节的参数:通用参数(general parameters)、集成参数(booster parameters)、任务参数(task parameters)。

- Python API Reference:各种api接口。

4.高级用法:在github上获取源码,更改相关参数;例如:我们可以自定义损失函数和评价指标
#注意:我们调用原数据需要转换为.train和.test
#!/usr/bin/python
import numpy as np
import xgboost as xgb
###
# advanced: customized loss function
#
print('start running example to used customized objective function')
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')
# note: for customized objective function, we leave objective as default
# note: what we are getting is margin value in prediction
# you must know what you are doing
param = {'max_depth': 2, 'eta': 1, 'silent': 1}
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 2
# user define objective function, given prediction, return gradient and second order gradient
# this is log likelihood loss
def logregobj(preds, dtrain):
labels = dtrain.get_label()
preds = 1.0 / (1.0 + np.exp(-preds))
grad = preds - labels
hess = preds * (1.0 - preds)
return grad, hess #grad和hess分别表示一阶导数和二阶导数
# user defined evaluation function, return a pair metric_name, result
# NOTE: when you do customized loss function, the default prediction value is margin
# this may make builtin evaluation metric not function properly
# for example, we are doing logistic loss, the prediction is score before logistic transformation
# the builtin evaluation error assumes input is after logistic transformation
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
def evalerror(preds, dtrain):
labels = dtrain.get_label()
# return a pair metric_name, result. The metric name must not contain a colon (:) or a space
# since preds are margin(before logistic transformation, cutoff at 0)
return 'my-error', float(sum(labels != (preds > 0.0))) / len(labels)
# training with customized objective, we can also do step by step training
# simply look at xgboost.py's implementation of train
bst = xgb.train(param, dtrain, num_round, watchlist, obj=logregobj, feval=evalerror)
相关链接:
xgboost完整流程小项目:https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/
xgboost sklearn库API接口:https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
xgboost API:https://xgboost.readthedocs.io/en/latest/
github源码:https://github.com/dmlc/xgboost
lightGBM:
与XGboost类似,lightGBM也是微软开源的一个工具库,它与XGboost的区别是运行计算更快,尤其是在大数据的情况下,支持的算法也更多。
最常用lightGBM部分:
- Tutorials:没错,这是一个中文的文档,是不是很开心。

- 参数:各种参数:核心参数、学习控制参数、目标参数等等。

- API:真以为是中文文档啊?

最后,还是回到案例代码上:欢迎关注我的github
To be continue...... |