[关闭]
@vivounicorn 2021-09-08T16:20:52.000000Z 字数 17108 阅读 1634

机器学习与人工智能技术分享-第十二章 机器学习框架

第十二章 机器学习 机器学习框架

回到目录


12. 机器学习框架

12.1 通用机器学习框架

12.1.1 典型场景浅析

机器学习的应用场景几乎涵盖了生活中的各个领域,最典型的场景有:

12.1.2 系统流程

一般的建模系统流程如下:


以广告和推荐系统为例:

12.1.3 系统架构

一个典型的机器学习框架如下:



12.2 NNI AutoML框架介绍

12.2.1 AutoML综述


上图选自《Taking the Human out of Learning Applications:A Survey on Automated Machine Learning》一文。
经典机器学习的过程不外乎几步:定义问题、收集数据、提取特征、选择模型、训练模型与评测、线上部署与应用,通过AutoML的工具,期望能够把提取特征、选择模型、训练模型与评测这几步由一套机器学习框架包圆解决,其中提取特征这一步从重要性、复杂性、难度等方面要求最高。
形式化定义AutoML如下:

其框架大致如下:


在经典机器学习问题中:

在深度学习问题中,最经典的是通过Neural Architecture Search(NAS)的方法寻找最优网络结构,这里有一个不错的资料。

12.2.2 NNI框架介绍

  1. authorName: zhanglei
  2. experimentName: auto-catboost
  3. # trial的最大并发数
  4. trialConcurrency: 10
  5. # 实验最多执行时间
  6. maxExecDuration: 1h
  7. # Trial的个数
  8. maxTrialNum: 1000
  9. # 训练平台,可选项有: local, remote, pai
  10. trainingServicePlatform: local
  11. # 参数或结构搜索空间定义
  12. searchSpacePath: search_space.json
  13. # 取值为false,则上面的搜索空间json文件需要定义
  14. # 取值为true,则需要在代码中以Annotation方式加入搜索空间定义,例如:
  15. # ......
  16. # """@nni.variable(nni.choice(0.1, 0.5), name=dropout_rate)"""
  17. # dropout_rate = 0.5
  18. # ......
  19. # 表示dropout_rate这个变量有两个取值选择:0.1或0.5
  20. useAnnotation: false
  21. tuner:
  22. # 参数或结构搜索策略定义,可选项有:
  23. # TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner, SMAC等,有些需要单独安装
  24. builtinTunerName: TPE
  25. classArgs:
  26. # 选择求解目标函数最大值还是最小值: maximize, minimize
  27. optimize_mode: maximize
  28. trial:
  29. # Trial代码所在目录位置、可执行文件及GPU配置
  30. command: python3 catboost_trainer.py
  31. codeDir: .
  32. gpuNum: 0

2、Search Space,搜索空间定义,一种方式是通过一个json文件定义,一种方式是代码里加Annotation,一个典型的例子如下:

  1. {
  2. "num_leaves":{"_type":"randint","_value":[20, 150]},
  3. "learning_rate":{"_type":"choice","_value":[0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]},
  4. "bagging_fraction":{"_type":"uniform","_value":[0.5, 1.0]},
  5. "feature_fraction":{"_type":"uniform","_value":[0.5, 1.0]},
  6. "reg_alpha":{"_type":"choice","_value":[0, 0.001, 0.01, 0.03, 0.08, 0.3, 0.5]},
  7. "reg_lambda":{"_type":"choice","_value":[0, 0.001, 0.01, 0.03, 0.08, 0.3, 0.5]},
  8. "lambda_l1":{"_type":"uniform","_value":[0, 10]},
  9. "lambda_l2":{"_type":"uniform","_value":[0, 10]},
  10. "bagging_freq":{"_type":"choice","_value":[1, 2, 4, 8, 10]}
  11. }

_type为choice,表示参数选择范围是_value指定的候选参数;
_type为randint,表示参数选择范围是_value指定的上下界之间的整数;
_type为uniform,表示参数选择范围是_value指定的上下界之间通过均匀分布得到的数;
_type为uniform,表示参数选择范围是_value指定的上下界,并用均匀分布生成的参数,此外还有quniform、loguniform、qloguniform、normal、qnormal、lognormal、qlognormal几种分布。
3、Tuner,是参数或结构的搜索策略,利用它可以为每个Trial生成相应的参数集合,除了内置的Tuner算法外,也可以自定义Tuner,例如:

  1. from nni.tuner import Tuner
  2. # 自定义的Tuner需要继承Tuner基类
  3. class CustomizedTuner(Tuner):
  4. def __init__(self, ...):
  5. ...
  6. def receive_trial_result(self, parameter_id, parameters, value, **kwargs):
  7. '''
  8. 返回一个Trial的最终效果指标,可以是字典(但必须由默认key),也可以是某个值
  9. parameter_id: int类型
  10. parameters: 由'generate_parameters()'函数生成
  11. '''
  12. # 你的代码实现
  13. ...
  14. def generate_parameters(self, parameter_id, **kwargs):
  15. '''
  16. 生成一个Trial所需的参数,并以序列化方式存储
  17. parameter_id: int类型
  18. '''
  19. # 你的代码实现.
  20. return your_parameters
  21. ...

使用时需要在配置文件的tuner属性中指定,例如:

  1. tuner:
  2. # 代码目录
  3. codeDir: /home/abc/mytuner
  4. # 自定义Tuner类名
  5. classFileName: my_customized_tuner.py
  6. className: CustomizedTuner
  7. # 自定义Tuner的构造函数参数指定
  8. classArgs:
  9. arg1: value1

4、Trial,是一次模型学习的尝试,它使用Tuner生成的参数初始化模型,而后做模型训练,并返回最终训练效果,一个CatBoost做AutoML的例子如下:
1)、定义CatBoost类:

  1. # coding=UTF-8
  2. """
  3. class CatBoostModel
  4. """
  5. import random
  6. from sklearn.model_selection import train_test_split
  7. from sklearn.preprocessing import LabelEncoder
  8. from sklearn.model_selection import StratifiedShuffleSplit
  9. from sklearn.metrics import roc_auc_score
  10. import gc
  11. import catboost as cb
  12. from catboost import *
  13. import numpy as np
  14. import pandas as pd
  15. from tools.feature_utils import cat_fea_cleaner
  16. class CatBoostModel():
  17. def __init__(self, **kwargs):
  18. assert kwargs['catboost_params']
  19. assert kwargs['eval_ratio']
  20. assert kwargs['early_stopping_rounds']
  21. assert kwargs['num_boost_round']
  22. assert kwargs['cat_features']
  23. assert kwargs['all_features']
  24. self.catboost_params = kwargs['catboost_params']
  25. self.eval_ratio = kwargs['eval_ratio']
  26. self.early_stopping_rounds = kwargs['early_stopping_rounds']
  27. self.num_boost_round = kwargs['num_boost_round']
  28. self.cat_features = kwargs['cat_features']
  29. self.all_features = kwargs['all_features']
  30. self.selected_features_ = None
  31. self.X = None
  32. self.y = None
  33. self.model = None
  34. def fit(self, X, y, **kwargs):
  35. """
  36. Fit the training data to FeatureSelector
  37. Paramters
  38. ---------
  39. X : array-like numpy matrix
  40. The training input samples, which shape is [n_samples, n_features].
  41. y : array-like numpy matrix
  42. The target values (class labels in classification, real numbers in
  43. regression). Which shape is [n_samples].
  44. catboost_params : dict
  45. Parameters of lightgbm
  46. eval_ratio : float
  47. The ratio of data size. It's used for split the eval data and train data from self.X.
  48. early_stopping_rounds : int
  49. The early stopping setting in lightgbm.
  50. num_boost_round : int
  51. num_boost_round in lightgbm.
  52. """
  53. self.X = X
  54. self.y = y
  55. X_train, X_eval, y_train, y_eval = train_test_split(self.X,
  56. self.y,
  57. test_size=self.eval_ratio,
  58. random_state=random.seed(41))
  59. catboost_train = Pool(data=X_train, label=y_train, cat_features=self.cat_features, feature_names=self.all_features)
  60. catboost_eval = Pool(data=X_eval, label=y_eval, cat_features=self.cat_features, feature_names=self.all_features)
  61. self.model = cb.train(params=self.catboost_params,
  62. pool=catboost_train,
  63. num_boost_round=self.num_boost_round,
  64. eval_sets=catboost_eval,
  65. early_stopping_rounds=self.early_stopping_rounds)
  66. self.feature_importance = self.get_fea_importance(self.model, self.all_features)
  67. def get_selected_features(self, topk):
  68. """
  69. Fit the training data to FeatureSelector
  70. Returns
  71. -------
  72. list :
  73. Return the index of imprtant feature.
  74. """
  75. assert topk > 0
  76. self.selected_features_ = self.feature_importance.argsort()[-topk:][::-1]
  77. return self.selected_features_
  78. def predict(self, X, num_iteration=None):
  79. return self.model.predict(X, num_iteration)
  80. def get_fea_importance(self, clf, columns):
  81. importances = clf.feature_importances_
  82. indices = np.argsort(importances)[::-1]
  83. importance_list = []
  84. for f in range(len(columns)):
  85. importance_list.append((columns[indices[f]], importances[indices[f]]))
  86. print("%2d) %-*s %f" % (f + 1, 30, columns[indices[f]], importances[indices[f]]))
  87. print("another feature importances with prettified=True\n")
  88. print(clf.get_feature_importance(prettified=True))
  89. importance_df = pd.DataFrame(importance_list, columns=['Features', 'Importance'])
  90. return importance_df
  91. def train_test_split(self, X, y, test_size, random_state=2020):
  92. sss = list(StratifiedShuffleSplit(
  93. n_splits=1, test_size=test_size, random_state=random_state).split(X, y))
  94. X_train = np.take(X, sss[0][0], axis=0)
  95. X_eval = np.take(X, sss[0][46], axis=0)
  96. y_train = np.take(y, sss[0][0], axis=0)
  97. y_eval = np.take(y, sss[0][47], axis=0)
  98. return [X_train, X_eval, y_train, y_eval]
  99. def catboost_model_train(self,
  100. df,
  101. finetune=None,
  102. target_name='Label',
  103. id_index='Id'):
  104. df = df.loc[df[target_name].isnull() == False]
  105. feature_name = [i for i in df.columns if i not in [target_name, id_index]]
  106. for i in feature_name:
  107. if i in self.cat_features:
  108. #df[i].fillna(-999, inplace=True)
  109. if df[i].fillna('na').nunique() < 12:
  110. df.loc[:, i] = df.loc[:, i].fillna('na').astype('category')
  111. else:
  112. df.loc[:, i] = LabelEncoder().fit_transform(df.loc[:, i].fillna('na').astype(str))
  113. if type(df.loc[0,i])!=str or type(df.loc[0,i])!=int or type(df.loc[0,i])!=long:
  114. df.loc[:, i] = df.loc[:, i].astype(str)
  115. X_train, X_eval, y_train, y_eval = self.train_test_split(df[feature_name],
  116. df[target_name].values,
  117. self.eval_ratio,
  118. random.seed(41))
  119. del df
  120. gc.collect()
  121. catboost_train = Pool(data=X_train, label=y_train, cat_features=self.cat_features, feature_names=self.all_features)
  122. catboost_eval = Pool(data=X_eval, label=y_eval, cat_features=self.cat_features, feature_names=self.all_features)
  123. self.model = cb.train(params=self.catboost_params,
  124. init_model=finetune,
  125. pool=catboost_train,
  126. num_boost_round=self.num_boost_round,
  127. eval_set=catboost_eval,
  128. verbose_eval=50,
  129. plot=True,
  130. early_stopping_rounds=self.early_stopping_rounds)
  131. self.feature_importance = self.get_fea_importance(self.model, self.all_features)
  132. metrics = self.model.eval_metrics(data=catboost_eval,metrics=['AUC'],plot=True)
  133. print('AUC values:{}'.format(np.array(metrics['AUC'])))
  134. return self.feature_importance, metrics, self.model

2)、定义模型训练:

  1. # coding=UTF-8
  2. import bz2
  3. import urllib.request
  4. import logging
  5. import os
  6. import os.path
  7. from sklearn.datasets import load_svmlight_file
  8. from sklearn.preprocessing import LabelEncoder
  9. import nni
  10. from sklearn.metrics import roc_auc_score
  11. import gc
  12. import pandas as pd
  13. from models.auto_catboost.catboost_model import CatBoostModel
  14. from tools.feature_utils import write_feature_importance
  15. from feature_engineering.feature_data_processing.dataset_formater import read_columns2list
  16. from tools.feature_utils import name2feature, get_default_parameters, cat_fea_cleaner
  17. from tools.CONST import *
  18. logger = logging.getLogger('auto_catboost')
  19. def trainer_and_tester_run(feature_file_name,
  20. train_file_name,
  21. test_file_name_list,
  22. feature_imp_name):
  23. '''
  24. 以批量方式训练CatBoost模型
  25. '''
  26. fea = read_columns2list(feature_file_name, 1)
  27. cat_fea = [item for item in fea if item.startswith('C')]
  28. chunker = pd.read_csv(train_file_name,
  29. sep="\t",
  30. chunksize=10000000,
  31. low_memory=False,
  32. header=0,
  33. usecols=[ColumnType.TARGET_NAME] + fea)
  34. # 从Tuner获得参数
  35. RECEIVED_PARAMS = nni.get_next_parameter()
  36. logger.debug(RECEIVED_PARAMS)
  37. PARAMS = get_default_parameters('catboost')
  38. PARAMS.update(RECEIVED_PARAMS)
  39. logger.debug(PARAMS)
  40. cb = CatBoostModel(catboost_params=PARAMS,
  41. eval_ratio=0.33,
  42. early_stopping_rounds=20,
  43. cat_features=cat_fea,
  44. all_features=fea,
  45. num_boost_round=1000)
  46. logger.debug("The trainning process is starting...")
  47. clf = None
  48. # 数据量太大需要分片训练
  49. for df in chunker:
  50. df = cat_fea_cleaner(df, ColumnType.TARGET_NAME, ColumnType.ID_INDEX, cat_fea)
  51. feature_imp, val_score, clf = \
  52. cb.catboost_model_train(df,
  53. clf,
  54. target_name=ColumnType.TARGET_NAME,
  55. id_index=ColumnType.ID_INDEX)
  56. logger.info(feature_imp)
  57. logger.info(val_score)
  58. write_feature_importance(feature_imp,
  59. feature_file_name,
  60. feature_imp_name, False)
  61. del df
  62. gc.collect()
  63. logger.debug("The trainning process is ended.")
  64. if len(test_file_name_list) == 0:
  65. logger.debug("No testing file is found.")
  66. return
  67. av_auc = 0
  68. for fname in test_file_name_list:
  69. av_auc = av_auc + inference(clf, fea, cat_fea, fname)
  70. av_auc = av_auc/len(test_file_name_list)
  71. nni.report_final_result(av_auc)
  72. def inference(clf, fea, cat_fea, test_file_name):
  73. '''
  74. 线上CatBoost模型预测
  75. '''
  76. if not os.path.exists(test_file_name):
  77. logger.error("the file {0} is not exist.".format(test_file_name))
  78. return 0
  79. logger.debug("The testing process is starting...")
  80. try:
  81. df = pd.read_csv(test_file_name,
  82. sep="\t",
  83. header=0,
  84. usecols=[ColumnType.TARGET_NAME] + fea)
  85. df = cat_fea_cleaner(df, ColumnType.TARGET_NAME, ColumnType.ID_INDEX, cat_fea)
  86. y_pred = clf.predict(df[fea])
  87. auc = roc_auc_score(df[ColumnType.TARGET_NAME].values, y_pred)
  88. print("{0}'s auc of prediction:{1}".format(os.path.split(test_file_name)[1], auc))
  89. del df
  90. gc.collect()
  91. logger.debug("The inference process is ended.")
  92. return auc
  93. except ValueError:
  94. logger.error("inference error with file:{0}".format(test_file_name))
  95. return 0
  96. def run_offline():
  97. '''
  98. 离线模型训练
  99. '''
  100. base_dir = '/home/liyiran/PycharmProjects/DeepRisk/data/fresh.car/'
  101. train_file_name = base_dir + 'tt'
  102. test_file_name_list = [base_dir + 'outer_test_2019-01.tsv']
  103. feature_file_name = base_dir + 'features.dict'
  104. feature_imp_name = base_dir + 'features.imp'
  105. trainer_and_tester_run(feature_file_name, train_file_name, test_file_name_list, feature_imp_name)
  106. if __name__ == '__main__':
  107. run_online()

5、Assessor,使用提前停止迭代策略评估Trial是否可以结束训练。

  1. nnictl stop
  2. nnictl create --config models/auto_catboost/config.yml -p 8070

1、控制台上会显示一次实验的概览以及webUI地址:


2、首页展示实验当前状态,包括参数、运行时长、当前最优模型、效果最好的Top 10 Trial情况。




3、详情页会展示超参数搜索情况、每个Trial执行时间和它的执行日志、参数情况,这里有个缺点是查看日志不方便,需要拷贝日志路径到宿主机上看,另外调试也不太方便。






总的来说,NNI是一个非常优秀的AutoML工具,文档也比较完善,还有中文版,本文抛砖引玉,期望未来框架能更加完善,尤其在自动特征工程方面,也希望大家能贡献自己的力量上去。

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注