qijiezhao/anti-fraud-prediction

Project: CreditHC@Anti-fraud-prediction

作者：吕新建，吕海利，赵祈杰，董兴华

(author: Xinjian Lv, Haili Lv, Qijie Zhao, Xinghua Dong)

item               usage

code/               代码、指令
config/             配置文件信息，超参数等的设置，训练/交叉验证等的设置
feat_range/         保存的每列特征的信息
log/                保存运行日志的路径
tmp_results/        保存的临时变量目录
tmp_models/         保存临时训练过的模型

下面是code中的文件说明:

code file                 usage

main.py                   运行代码的指令
config.py                 读取配置文件中的参数信息
classifier.py             模型文件，包括模型参数的设定，grid search的范围，交叉验证的打印信息等
data_file_preprocess.py   对给定数据文件进行预处理
feat_engineer.py          训练结束之后，对特征进行分析
deep_model.py             (to be update)
log.py                    新建日志文件
information_value.py      woe编码的文件
utils.py                  特征转换，特征工程

运行前的第一步，需要准备数据，path/to/data 为给定的数据文件，如下是操作过程：

(pwd: 输出为本工程的主目录)
-> mkdir data
-> cp path/to/data data/
-> cd code
-> python data_file_preprocess.py --data_file=../data/xxx.csv

配置文件在config/model.config中，如下是对配置文件中的参数说明:

 参数                  说明 
file_path:        数据文件的路径
is_sample:        设置0.1则为采样10%的数据，否则None则为不采样
is_check_feat:    是否检测变量是否连续，已改为从config中读取，默认为True
trans_numerical_type: 数值类型数据转换方法
trans_discrete_type: 范畴类型数据转换方法
classifier:       模型的选择 比如xgboost,randomforest
if_grid_search:   是否做grid search
if_save_faet_range: 是否在特征转换的过程中，将转换后的值保存
if_train_all:     是否在做完grid search, cross validation之后训练/测试一遍
metrics:          做grid search/ cross validation时参考的评测指标 f1,recall,precision,roc_auc等
if_deep:          模型选择是否选择深度模型，默认否，代码未测试完
delete_feat_post: 贷前数据需要删除的无用特征
delete_feat_pre:  贷后数据需要删除的无用特征
if_scale:         是否在特征工程中做特征平滑
if_cross_validation: 是否做交叉验证
data_types:       数据特征中包含的所有数据类
month_data:       月份的英文表示。为了读数据中的时间信息
time_feat_type_1/2/3: 数据中一共出现过的所有时间信息的特征，包括：时间戳，以及另外两种格式的 时间。
onehot_feat:      指定需要做onehot的特征名

配置好配置文件之后，只需要运行main.py即可。

参考指令：

# grid search
-> python main.py --if_train_all=False --if_grid_search=True --if_cross_validation=False --is_sample=0.2 --classifier=xgboost

# cross_validation
-> python main.py --if_train_all=False --if_grid_search=False --if_cross_validation=True --classifier=xgboost

# train_all
-> python main.py --if_train_all=True --if_grid_search=False --if_cross_validation=False --classifier=xgboost

API 接口说明，服务端需要调用本算法框架（训练），只需执行如下步骤：

# example code, 假设数据文件已经过 data_file_preprocess

(pwd=/anti_fraud_prediction)

file_path=xxx/xx/x 
dframe=pd.read_csv(file_path)

from code.utils import *
from code.log import *
from code.classifier import *
from code.config import financial_forecasting

trans_param={'trans_numerical_type':financial_forecasting.trans_numerical_type,'trans_discrete_type':financial_forecasting.trans_discrete_type,'if_save_feat_range':financial_forecasting.if_save_feat_range}
dframe,dcol_names,dcol_types,target_name,len_feat,len_samples=get_cont_attribute(dframe,is_sample)
data_feat=Transform_feat(dframe,target_name,dcol_names,dcol_types,len_feat,len_samples,trans_param)

model_classify('xgboost',data_feats,np.asarray(dframe[target_name]),'average_precision')

上线时，调用训练好的模型进行测试：

现假设测试样例为训练时用的文件csv的某一行，或者若干行。

(pwd:/code)

# -*- coding: utf-8 -*-
from utils import *
import pandas as pd
from classifier import *

#--------------------------------------------init required path of models--------------------------------------------#
data_path='../data/SJZH_JJ_RH_BR_TD_DH_new_filtered.csv'
trans_m_path='../tmp_results/tmp_trans_opes.m'
model_path='../tmp_models/tmp_xgb.m'

#-----------------------------------------------load the trained model-----------------------------------------------#
model=xgbt()
model.Load_model(model_path)
trans_m=Test_trans(trans_m_path)

#-----------------------------------------------predict a single sample----------------------------------------------#
x=100 # random select a sample
data1=pd.read_csv(data_path).loc[x]
data1=data1.drop('isfraud_a')
log.l.info('the tested data shape is {}'.format(data1.shape))
trans_data1=trans_m.trans_single_load(data1).reshape(1,-1) #因为是预测一个样本，所以需要reshape成1xlen的形状

pre_out_1=model.Predict(trans_data1)

#-----------------------------------------------predict multiple samples----------------------------------------------#
x_start=100;num=20 # random select samples from (x_start) to (x_start + num)
data2=pd.read_csv(data_path).loc[x_start:x_start+num]
data2=data2.drop('isfraud_a',axis=1)
log.l.info('the tested data shape is {}'.format(data2.shape))
trans_data2=trans_m.trans_multiple_load(data2,start=100)

pre_out_2=model.Predict(trans_data2)

wrapper_output(pre_out_2,t=10)

后续待完善，添加处（黑体）：

1, data_file_preprocess.py

处理完的文件不应该包含非数值型的范畴性数据，即如果送进的数据文件里面包含比如四川/山东，中国/美国等数据，需要在这个代码里加上转化成0,1,2,3这样的数值型，并将该特征名放入model.config文件，然后config.py读取该参数，并在utils.py里执行onehot

2, utils.py

a) 本文件包含特征工程的所有步骤，如果需要对单独的特征新增特征处理方法，在Transform_numerical()中的TBA-to be added处添加，此时已经对指定的特征保存完、替换了nan值，以及做完了onehot。或者在Transform_discrete()新增对object类型的特征处理的方法（此步暂定无，因为data_file_preprocess处理完了object类型的特征）。如果是类似于WOE，数据平滑等对所有特征进行处理，只需要在Transform_feat()中的TBA-to be added处添加即可（特征矩阵为out_data)

b) 存储转换特征的方法，目前采用的是字典存储：键为特征名，值为tuple对（operation_name,operation），operation定义了六种不同的方式，包括‘nothing’,‘to_constant’,‘time_1’‘time_2’‘time_3’‘onehot’。具体的操作参见代码。

3, classifier.py

比如grid search, 只需要将做完grid search后的结果参数值填入该model的__init__()内。

Tool scripts:

# read .npy data
import numpy as np
data=np.load(xxx.npy)

# read .m model,  including trans_model and classify_model
from sklearn.externals import joblib
model=joblib.load(xxx.m)

# debugging script method:
from Ipython import embed
embed()  # in code
ipython # in terminal

TBA