高级工具xgboost/lighGBM与建模实战

Xgboost

通用参数

booster[default=gbtree]:gbtree和gblinear

集成参数

eta[default=0.3,可视为学习率]

为了防止过拟合，更新过程中用到的收缩步长。在每次提升计算之后，算法会直接获得新特征的权重。 eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3
取值范围为：[0,1]
gamma[default=0]

为了对树的叶子节点做进一步的分割而必须设置的损失减少的最小值，该值越大，算法越保守
range: [0, $\infin$ ]
max_depth [default=6]

用于设置树的最大深度
range: [1, $\infin$ ]
min_child_weight [default=1]

表示子树观测权重之和的最小值，如果树的生长时的某一步所生成的叶子结点，其观测权重之和小于min_child_weight，那么可以放弃该步生长，在线性回归模式中，这仅仅与每个结点所需的最小观测数相对应。该值越大，算法越保守
range: [0, $\infin$ ]
subsample [default=1]

表示观测的子样本的比率，将其设置为0.5意味着xgboost将随机抽取一半观测用于数的生长，这将有助于防止过拟合现象
range: (0,1]
colsample_bytree [default=1]

表示用于构造每棵树时变量的子样本比率
range: (0,1]
lambda [default=1, alias: reg_lambda]

L2 权重的L2正则化项
scale_pos_weight, [default=1]

在各类别样本十分不平衡时，把这个参数设定为一个正值，可以使算法更快收敛
一个可以考虑的值: sum(negative cases) / sum(positive cases) see Higgs Kaggle competition demo for examples: R, py1, py2, py3

任务参数

objective [ default=reg:linear ] 这个参数定义需要被最小化的损失函数。最常用的值有

“reg:linear” --线性回归
“reg:logistic” --逻辑回归
“binary:logistic” --二分类的逻辑回归，返回预测的概率(不是类别)
“binary:logitraw” --输出归一化前的得分
“count:poisson” --poisson regression for count data, output mean of poisson distribution
max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)
“multi:softmax” --设定XGBoost做多分类，你需要同时设定num_class(类别数)的值
“multi:softprob” --输出维度为ndata * nclass的概率矩阵
“rank:pairwise” --设定XGBoost去完成排序问题(最小化pairwise loss)
“reg:gamma” --gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed
“reg:tweedie” --Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.
base_score [ default=0.5 ]

the initial prediction score of all instances, global bias
for sufficient number of iterations, changing this value will not have too much effect.
eval_metric [ 默认是根据损失函数/目标函数自动选定的 ]

有如下的选择:
a. “rmse”: 均方误差
b.“mae”: 绝对平均误差
c. “logloss”: negative log损失
d. “error”: 二分类的错误率
e. “error@t”: 通过提供t为阈值(而不是0.5)，计算错误率
f. “merror”: 多分类的错误类，计算公式为#(wrong cases)/#(all cases).
g. “mlogloss”: 多类log损失
h. “auc”: ROC曲线下方的面积 for ranking evaluation.
i. “ndcg”:Normalized Discounted Cumulative Gain
g. “map”:平均准确率
k. “ndcg@n”,“map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
l. “ndcg-”,“map-”,“ndcg@n-”,“map@n-”: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatedly
m. “poisson-nloglik”: negative log-likelihood for Poisson regression
n. “gamma-nloglik”: negative log-likelihood for gamma regression
o. “gamma-deviance”: residual deviance for gamma regression
p. “tweedie-nloglik”: negative log-likelihood for Tweedie regression (at a specified value of the tweedie_variance_power parameter)
seed [ default=0 ]

random number seed.

原始使用形态

  #!/usr/bin/python
  import numpy as np
  import xgboost as xgb
  ###
  # advanced: customized loss function
  #
  print('start running example to used customized objective function')
  
  dtrain = xgb.DMatrix('../data/agaricus.txt.train')
  dtest = xgb.DMatrix('../data/agaricus.txt.test')
  
  # note: for customized objective function, we leave objective as default
  # note: what we are getting is margin value in prediction
  # you must know what you are doing
  param = {'max_depth': 2, 'eta': 1, 'silent': 1}
  watchlist = [(dtest, 'eval'), (dtrain, 'train')]
  num_round = 2
  
  # user define objective function, given prediction, return gradient and second order gradient
  # this is log likelihood loss
  def logregobj(preds, dtrain):
      labels = dtrain.get_label()
      preds = 1.0 / (1.0 + np.exp(-preds))
      grad = preds - labels
      hess = preds * (1.0 - preds)
      return grad, hess
  
  # user defined evaluation function, return a pair metric_name, result
  # NOTE: when you do customized loss function, the default prediction value is margin
  # this may make builtin evaluation metric not function properly
  # for example, we are doing logistic loss, the prediction is score before logistic transformation
  # the builtin evaluation error assumes input is after logistic transformation
  # Take this in mind when you use the customization, and maybe you need write customized evaluation function
  def evalerror(preds, dtrain):
      labels = dtrain.get_label()
      # return a pair metric_name, result. The metric name must not contain a colon (:) or a space
      # since preds are margin(before logistic transformation, cutoff at 0)
      return 'my-error', float(sum(labels != (preds > 0.0))) / len(labels)
  
  # training with customized objective, we can also do step by step training
  # simply look at xgboost.py's implementation of train
  bst = xgb.train(param, dtrain, num_round, watchlist, obj=logregobj, feval=evalerror)

LightGBM

参考：http://lightgbm.apachecn.org/#/docs/6

样例

  # coding: utf-8
  import json
  import lightgbm as lgb
  import pandas as pd
  from sklearn.metrics import mean_squared_error
  
  
  # 加载数据集合
  print('Load data...')
  df_train = pd.read_csv('./data/regression.train.txt', header=None, sep='\t')
  df_test = pd.read_csv('./data/regression.test.txt', header=None, sep='\t')
  
  # 设定训练集和测试集
  y_train = df_train[0].values
  y_test = df_test[0].values
  X_train = df_train.drop(0, axis=1).values
  X_test = df_test.drop(0, axis=1).values
  
  # 构建lgb中的Dataset格式
  lgb_train = lgb.Dataset(X_train, y_train)
  lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
  
  # 敲定好一组参数
  params = {
      'task': 'train',
      'boosting_type': 'gbdt',
      'objective': 'regression',
      'metric': {'l2', 'auc'},
      'num_leaves': 31,
      'learning_rate': 0.05,
      'feature_fraction': 0.9,
      'bagging_fraction': 0.8,
      'bagging_freq': 5,
      'verbose': 0
  }
  
  print('开始训练...')
  # 训练
  gbm = lgb.train(params,
                  lgb_train,
                  num_boost_round=20,
                  valid_sets=lgb_eval,
                  early_stopping_rounds=5)
  
  # 保存模型
  print('保存模型...')
  # 保存模型到文件中
  gbm.save_model('model.txt')
  
  print('开始预测...')
  # 预测
  y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
  # 评估
  print('预估结果的rmse为:')
  print(mean_squared_error(y_test, y_pred) ** 0.5)

简单使用

  # coding: utf-8
  import lightgbm as lgb
  import pandas as pd
  from sklearn.metrics import mean_squared_error
  from sklearn.model_selection import GridSearchCV
  
  # 加载数据
  print('加载数据...')
  df_train = pd.read_csv('./data/regression.train.txt', header=None, sep='\t')
  df_test = pd.read_csv('./data/regression.test.txt', header=None, sep='\t')
  
  # 取出特征和标签
  y_train = df_train[0].values
  y_test = df_test[0].values
  X_train = df_train.drop(0, axis=1).values
  X_test = df_test.drop(0, axis=1).values
  
  print('开始训练...')
  # 直接初始化LGBMRegressor
  # 这个LightGBM的Regressor和sklearn中其他Regressor基本是一致的
  gbm = lgb.LGBMRegressor(objective='regression',
                          num_leaves=31,
                          learning_rate=0.05,
                          n_estimators=20)
  
  # 使用fit函数拟合
  gbm.fit(X_train, y_train,
          eval_set=[(X_test, y_test)],
          eval_metric='l1',
          early_stopping_rounds=5)
  
  # 预测
  print('开始预测...')
  y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
  # 评估预测结果
  print('预测结果的rmse是:')
  print(mean_squared_error(y_test, y_pred) ** 0.5)

绘制图形

  # coding: utf-8
  import lightgbm as lgb
  import pandas as pd
  
  try:
      import matplotlib.pyplot as plt
  except ImportError:
      raise ImportError('You need to install matplotlib for plotting.')
  
  # 加载数据集
  print('加载数据...')
  df_train = pd.read_csv('./data/regression.train.txt', header=None, sep='\t')
  df_test = pd.read_csv('./data/regression.test.txt', header=None, sep='\t')
  
  # 取出特征和标签
  y_train = df_train[0].values
  y_test = df_test[0].values
  X_train = df_train.drop(0, axis=1).values
  X_test = df_test.drop(0, axis=1).values
  
  # 构建lgb中的Dataset数据格式
  lgb_train = lgb.Dataset(X_train, y_train)
  lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train)
  
  # 设定参数
  params = {
      'num_leaves': 5,
      'metric': ('l1', 'l2'),
      'verbose': 0
  }
  
  evals_result = {}  # to record eval results for plotting
  
  print('开始训练...')
  # 训练
  gbm = lgb.train(params,
                  lgb_train,
                  num_boost_round=100,
                  valid_sets=[lgb_train, lgb_test],
                  feature_name=['f' + str(i + 1) for i in range(28)],
                  categorical_feature=[21],
                  evals_result=evals_result,
                  verbose_eval=10)
  
  print('在训练过程中绘图...')
  ax = lgb.plot_metric(evals_result, metric='l1')
  plt.show()
  
  print('画出特征重要度...')
  ax = lgb.plot_importance(gbm, max_num_features=10)
  plt.show()
  
  print('画出第84颗树...')
  ax = lgb.plot_tree(gbm, tree_index=83, figsize=(20, 8), show_info=['split_gain'])
  plt.show()
  
  #print('用graphviz画出第84颗树...')
  #graph = lgb.create_tree_digraph(gbm, tree_index=83, name='Tree84')
  #graph.render(view=True)

使用时可以参照速查表

高级工具xgboost/lighGBM与建模实战

Xgboost

LightGBM

浏览过的版块