编程知识 cdmana.com

Finally, the beginner's Guide to data science use cases to win the classification hack marathon

author |VETRIVEL_PS compile |Flin source |analyticsvidhya

The overview

  • This is my first article Analytics Vidhya The second part of my blog post , This article is before entering the machine learning hacker Marathon 10% The ultimate entry guide .

  • If you follow these simple steps listed in this article , So the classification of winning the Hackathon is relatively simple

  • Keep learning all the time , Experiment with a high degree of consistency , And follow your intuition and the knowledge you have accumulated over time

  • From a few months ago in Hackathons Start as a beginner , I've recently become Kaggle Experts , And is Vidhya Of JanataHack Hackathon Before the series analysis 5 One of the top contributors .

  • I'm here to share my knowledge , And guide beginners to use Binary The classification use cases in the classification compete with top hackers

Let's delve into binary classification – come from Analytics Vidhya Of JanataHack Hackathon Series of insurance cross selling use cases , And do experiments in person

Link to cross selling Hackathon !- https://datahack.analyticsvidhya.com/contest/janatahack-cross-sell-prediction/#ProblemStatement

Our client is an insurance company that provides health insurance for our clients . Now they need our help to build the model , To forecast policyholders in the past year ( Customer ) Will you also be interested in the company's vehicle insurance .

An insurance policy is a company's commitment to specific losses , damage , A guarantee of compensation for illness or death , An arrangement in exchange for a specified premium . The premium is the amount that the customer needs to pay to the insurance company on a regular basis .

for example , We can do it every year for 200000 Rupees for health insurance 5000 The insurance premium for rupees , such , If we got sick in that year and needed hospitalization , The insurance company will bear the highest 200000 The cost of hospitalization in rupees . Now? , If we want to know , When the company only collects 5000 The premium of rupees is , How to bear such high hospitalization expenses , that , The concept of probability came into being .

for example , Like us , There may be 100 Customers pay 5000 The insurance premium for rupees , But only a few people ( such as 2-3 people ) I'll be hospitalized that year . such , Everyone shares the risk of others .

It's like health insurance , Some vehicle insurance requires customers to pay a certain amount of insurance premium to the insurance provider company every year , such , If an accident happens to the vehicle , The insurance provider company will provide compensation ( be called “ insure ”).

Building models to predict whether customers will be interested in vehicle insurance can be very helpful to the company , Because it can then plan its communication strategy accordingly , To cover these customers and optimize their business model and revenue .

Share my data science hacking marathon approach —— How to be in 20,000 Among many data enthusiasts 10%

In the 1 In the part , We learn to repeat , Optimized and improved 10 A step , This is a good foundation to help you get started .

Now you've started to practice , Let's try insurance cases to test our skills . don 't worry , You'll be good at dealing with any sort of hacking marathon in a few weeks of practice ( With tabular data ). I hope you are enthusiastic , Be curious , And the hacking continues through this scientific journey !

Study , Practicing and classifying hacking Marathon 10 A simple step

1. Understand problem statements and import packages and datasets

2. perform EDA( Exploratory data analysis )—— Learn about datasets . Explore training and test data , And understand each column / What does a feature mean . Check whether the target columns in the dataset are unbalanced

3. Check for duplicate lines from training data

4. fill / Impute missing values - continuity - Average / The median / Any specific value | classification - other / Positive filling / backfill

5. Feature Engineering – feature selection – Select the most important existing features | Feature creation or encapsulation – Create a new feature from an existing feature

6. Split training data into features ( Independent variables )| The goal is ( The dependent variable )

7. Data encoding – Target code , Hot coding alone | Data scaling –MinMaxScaler,StandardScaler,RobustScaler

8. Creating a baseline machine learning model for binary classification problems

9. Combined with the average K Cross validation improves evaluation indicators “ ROC_AUC” And predict the goal “Response”

10. Submit results , Check the leaderboard and improve “ ROC_AUC”

stay GitHub Link to see PYTHON Complete working code and output for learning and practicing . Just make changes , It will be updated !

1. Understand problem statements and import packages and datasets

Data set description

<table> <tr> <td><b> Variable </b> </td> <td><b> explain </b> </td> </tr> <tr> <td>id </td> <td> The customer's only ID</td> </tr> <tr> <td>Gender </td> <td> Customer gender </td> </tr> <tr> <td> Age</td> <td> Customer age </td> </tr> <tr> <td>Driving_License </td> <td>0: The customer doesn't have a driver's license <br>1: The customer already has a driver's license </td> </tr> <tr> <td>Region_Code </td> <td> Unique code of the customer's region </td> </tr> <tr> <td>Previously_Insured </td> <td>1: The customer already has vehicle insurance <br>0: The customer doesn't have vehicle insurance </td> </tr> <tr> <td>Vehicle_Age </td> <td> Age of car </td> </tr> <tr> <td>Vehicle_Damage </td> <td>1: The customer has damaged the vehicle in the past <br> 0: The customer has not damaged the vehicle in the past . </td> </tr> <tr> <td> Annual_Premium</td> <td> The amount of premium the customer needs to pay in the current year </td> </tr> <tr> <td>Policy_Sales_Channel </td> <td> Anonymous code of the channel to contact the customer , That is, different agents , Through the mail , by telephone , Meet, etc </td> </tr> <tr> <td>Vintage </td> <td> The number of days the customer has established contact with the company </td> </tr> <tr> <td>Response </td> <td>1: Customers are interested in <br>0: Customers are not interested in </td> </tr>

</table>

Now? , In order to predict whether customers will be interested in vehicle insurance , We have access to demographic information ( Gender , Age , Area type code ), vehicle ( The age of the vehicle , damage ), Policy ( insurance premium , Purchasing channels ) Etc .

Used to check all Hackathon Evaluation index of performance difference of machine learning model in

ad locum , We will ROC_AUC As an evaluation indicator .

The working characteristic curve of subjects (ROC) It is an evaluation measure for binary classification problems . This is a probability curve , We have drawn various thresholds of TPR( True positive rate ) Yes FPR( False positive rate ), And essentially “ The signal ” And “ noise ” Separate . The area under the curve is described (AUC) It's a measure of the ability to differentiate between categories , And used as ROC The summary of the curve .

AUC The higher the , The better the performance of the model in distinguishing positive and negative classes .

  • When AUC=1 when , The classifier can distinguish all the positive and negative points correctly . However , If AUC by 0, Then the classifier will predict that all negatives are positive , All the positives are negative .

  • When 0.5 < AUC < 1 when , Classifiers are likely to distinguish positive class values from negative class values . That's why , Because compared with false negative and false positive , The classifier can detect more true positives and true negatives .

  • When AUC = 0.5 when , The classifier can't distinguish positive class points from negative class points . This means that the classifier will predict random or constant categories for all data points .

  • Cross selling : Training data contains 3,81,109 Example , The test data contains 1,27,037 Example . Again, the data is seriously out of balance —— According to training data , Only recommend 12.2%( A total of 3,81,109 Of employees 46,709 name ) promotion .

Let's import the required Python The bag starts

# Import Required Python Packages
# Scientific and Data Manipulation Libraries
import numpy as np
import pandas as pd

# Data Viz & Regular Expression Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-Learn Pre-Processing Libraries
from sklearn.preprocessing import *

# Garbage Collection Libraries
import gc

# Boosting Algorithm Libraries
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

# Model Evaluation Metric & Cross Validation Libraries
from sklearn.metrics import roc_auc_score, auc, roc_curve
from sklearn.model_selection import StratifiedKFold, KFold

# Setting SEED to Reproduce Same Results even with "GPU"
seed_value = 1994
import os
os.environ['PYTHONHASHSEED'} = str(seed_value)
import random
random.seed(seed_value)
import numpy as np
np.random.seed(seed_value)
SEED=seed_value
  1. Science and data processing —— For the use of Numpy Processing digital data , Use Pandas Process tabular data .

  2. Data visualization Library ——Matplotlib and Seaborn Used to visualize single or multiple variables .

  3. Data preprocessing , Machine learning and metrics Library —— Used by using evaluation metrics ( for example ROC_AUC fraction ) By coding , Scale and measure dates to preprocess data .

  4. Lifting algorithm – XGBoost,CatBoost and LightGBM The tree based classifier model is used for binary and multi class classification

  5. Set up SEED – Is used to SEED Set to repeat the same result every time

2. perform EDA( Exploratory data analysis )– Learn about datasets

# Loading data from train, test and submission csv files
train = pd.read_csv('../input/avcrosssellhackathon/train.csv')
test = pd.read_csv('../input/avcrosssellhackathon/test.csv')
sub = pd.read_csv('../input/avcrosssellhackathon/sample_submission.csv')

# Python Method 1 : Displays Data Information
def display_data_information(data, data_types, df)
data.info()
print("\n")
for VARIABLE in data_types :
data_type = data.select_dtypes(include=[ VARIABLE }).dtypes
if len(data_type) > 0 :
print(str(len(data_type))+" "+VARIABLE+" Features\n"+str(data_type)+"\n" )

# Display Data Information of "train" :
data_types = ["float32","float64","int32","int64","object","category","datetime64[ns}"}
display_data_information(train, data_types, "train")

# Display Data Information of "test" :
display_data_information(test, data_types, "test")

# Python Method 2 : Displays Data Head (Top Rows) and Tail (Bottom Rows) of the Dataframe (Table) :
def display_head_tail(data, head_rows, tail_rows)
display("Data Head & Tail :")
display(data.head(head_rows).append(data.tail(tail_rows)))

#     return True
# Displays Data Head (Top Rows) and Tail (Bottom Rows) of the Dataframe (Table)
# Pass Dataframe as "train", No. of Rows in Head = 3 and No. of Rows in Tail = 2 :
display_head_tail(train, head_rows=3, tail_rows=2)

# Python Method 3 : Displays Data Description using Statistics :
def display_data_description(data, numeric_data_types, categorical_data_types)
print("Data Description :")
display(data.describe( include = numeric_data_types))
print("")
display(data.describe( include = categorical_data_types))

# Display Data Description of "train" :
display_data_description(train, data_types[0:4}, data_types[4:7})

# Display Data Description of "test" :
display_data_description(test, data_types[0:4}, data_types[4:7})

Read CSV Format data file ——pandas Of read_csv Method , For reading csv file , And convert it into something like Data The structure of the table , be called DataFrame. therefore , For training , Test and commit create 3 Data frames .

Apply the head and tail to the data – Used before viewing 3 After the row and 2 Line to get an overview of the data .

Applying information to data – Used to display columns about data frames , Information about data types and memory usage .

Apply description to data – Used to display descriptive statistics on a numeric column , For example, count , Uniqueness , mean value , minimum value , Maximum, etc .

3. Check for duplicate lines from training data

# Removes Data Duplicates while Retaining the First one
def remove_duplicate(data)
data.drop_duplicates(keep="first", inplace=True)
return "Checked Duplicates
# Removes Duplicates from train data
remove_duplicate(train)

Check the training data for repetitions —— Delete duplicate lines by retaining the first line . No duplicate found in training data .

4. fill / Impute missing values - continuity - Average / The median / Any specific value | classification - other / Positive filling / backfill

There is no missing value in the data .

5. Feature Engineering

# Check train data for Values of each Column - Short Form
for i in train
print(f'column {i} unique values {train[i}.unique()})

# Binary Classification Problem - Target has ONLY 2 Categories
# Target - Response has 2 Values of Customers 1 & 0
# Combine train and test data into single DataFrame - combine_set
combine_set = pd.concat{[train,test},axis=0}
# converting object to int type :
combine_set['Vehicle_Age'}=combine_set['Vehicle_Age'}.replacee({'< 1 Year':0,'1-2 Year':1,'> 2 Years':2})
combine_set['Gender'}=combine_set['Gender'}.replacee({'Male':1,'Female':0})
combine_set['Vehicle_Damage'}=combine_set['Vehicle_Damage'}.replacee({'Yes':1,'No':0})
sns.heatmap(combine_set.corr())

# HOLD - CV - 0.8589 - BEST EVER
combine_set['Vehicle_Damage_per_Vehicle_Age'} = combine_set.groupby(['Region_Code,Age'})['Vehicle_Damage'}.transform('sum'
# Score - 0.858657 (This Feature + Removed Scale_Pos_weight in LGBM) | Rank - 20
combine_set['Customer_Term_in_Years'} = combine_set['Vintage'} / 365
# combine_set['Customer_Term'} = (combine_set['Vintage'} / 365).astype('str')
# Score - 0.85855 | Rank - 20
combine_set['Vehicle_Damage_per_Policy_Sales_Channel'} = combine_set.groupby(['Region_Code,Policy_Sales_Channel'})['Vehicle_Damage'}.transform('sum')
# Score - 0.858527 | Rank - 22
combine_set['Vehicle_Damage_per_Vehicle_Age'} = combine_set.groupby(['Region_Code,Vehicle_Age'})['Vehicle_Damage'}.transform('sum')
# Score - 0.858510 | Rank - 23
combine_set["RANK"} = combine_set.groupby("id")['id'}.rank(method="first", ascending=True)
combine_set["RANK_avg"} = combine_set.groupby("id")['id'}.rank(method="average", ascending=True)
combine_set["RANK_max"} = combine_set.groupby("id")['id'}.rank(method="max", ascending=True)
combine_set["RANK_min"} = combine_set.groupby("id")['id'}.rank(method="min", ascending=True)
combine_set["RANK_DIFF"} = combine_set['RANK_max'} - combine_set['RANK_min'}
# Score - 0.85838 | Rank - 15
combine_set['Vehicle_Damage_per_Vehicle_Age'} = combine_set.groupby([Region_Code})['Vehicle_Damage'}.transform('sum')

# Data is left Skewed as we can see from below distplot
sns.distplot(combine_set['Annual_Premium'})

combine_set['Annual_Premium'} = np.log(combine_set['Annual_Premium'})
sns.distplot(combine_set['Annual_Premium'})

# Getting back Train and Test after Preprocessing :
train=combine_set[combine_set['Response'}.isnull()==False}
test=combine_set[combine_set['Response'}.isnull()==True}.drop(['Response'},axis=1)
train.columns

6. Split training data into features ( Independent variables )| The goal is ( The dependent variable )

# Split the Train data into predictors and target :
predictor_train = train.drop(['Response','id'],axis=1)
target_train    = train['Response']
predictor_train.head()

# Get the Test data by dropping 'id' :
predictor_test = test.drop(['id'],axis=1)

7. Data encoding – Target code

def add_noise(series, noise_level):
return series * (1 + noise_level * np.random.randn(len(series)))
def target_encode(trn_series=None,
tst_series=None,
target=None,
min_samples_leaf=1,
smoothing=1,
noise_level=0):
"""
Smoothing is computed like in the following paper by Daniele Micci-Barreca
https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf

trn_series : training categorical feature as a pd.Series
tst_series : test categorical feature as a pd.Series
target : target data as a pd.Series
min_samples_leaf (int) : minimum samples to take category average into account
smoothing (int) : smoothing effect to balance categorical average vs prior
"""
assert len(trn_series) == len(target)
assert trn_series.name == tst_series.name
temp = pd.concat([trn_series, target], axis=1)
# Compute target mean
averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
# Compute smoothing
smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
# Apply average function to all target data
prior = target.mean()
# The bigger the count the less full_avg is taken into account
averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
averages.drop(["mean", "count"], axis=1, inplace=True)
# Apply averages to trn and tst series
ft_trn_series = pd.merge(
trn_series.to_frame(trn_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=trn_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
# pd.merge does not keep the index so restore it
ft_trn_series.index = trn_series.index
ft_tst_series = pd.merge(
tst_series.to_frame(tst_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=tst_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
# pd.merge does not keep the index so restore it
ft_tst_series.index = tst_series.index
return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)
# Score - 0.85857 | Rank -
tr_g, te_g = target_encode(predictor_train["Vehicle_Damage"],
predictor_test["Vehicle_Damage"],
target= predictor_train["Response"],
min_samples_leaf=200,
smoothing=20,
noise_level=0.02)
predictor_train['Vehicle_Damage_me']=tr_g
predictor_test['Vehicle_Damage_me']=te_g

8. Creating a baseline machine learning model for binary classification problems

# Baseline Model Without Hyperparameters :
Classifiers = {'0.XGBoost' : XGBClassifier(),
'1.CatBoost' : CatBoostClassifier(),
'2.LightGBM' : LGBMClassifier()
}
# Fine Tuned Model With-Hyperparameters :
Classifiers = {'0.XGBoost' : XGBClassifier(eval_metric='auc',

# GPU PARAMETERS #
tree_method='gpu_hist',
gpu_id=0,

# GPU PARAMETERS #
random_state=294,
learning_rate=0.15,
max_depth=4,
n_estimators=494,
objective='binary:logistic'),
'1.CatBoost' : CatBoostClassifier(eval_metric='AUC',

# GPU PARAMETERS #
task_type='GPU',
devices="0",

# GPU PARAMETERS #
learning_rate=0.15,
n_estimators=494,
max_depth=7,
#                             scale_pos_weight=2),

'2.LightGBM' : LGBMClassifier(metric = 'auc',

# GPU PARAMETERS #
device = "gpu",
gpu_device_id =0,
max_bin = 63,
gpu_platform_id=1,

# GPU PARAMETERS #
n_estimators=50000,
bagging_fraction=0.95,
subsample_freq = 2,
objective ="binary",
min_samples_leaf= 2,
importance_type = "gain",
verbosity = -1,
random_state=294,
num_leaves = 300,
boosting_type = 'gbdt',
learning_rate=0.15,
max_depth=4,

# scale_pos_weight=2, # Score - 0.85865 | Rank - 18
n_jobs=-1)
}

9. Combined with the average K Cross validation improves evaluation indicators “ ROC_AUC” And predict the goal “Response”

# LightGBM Model
kf=KFold(n_splits=10,shuffle=True)
preds_1   = list()
y_pred_1  = []
rocauc_score = []
for i,(train_idx,val_idx) in enumerate(kf.split(predictor_train)):
X_train, y_train = predictor_train.iloc[train_idx,:], target_train.iloc[train_idx]
X_val, y_val = predictor_train.iloc[val_idx, :], target_train.iloc[val_idx]
print('\nFold: {}\n'.format(i+1))
lg= LGBMClassifier(      metric = 'auc',

# GPU PARAMETERS #
device = "gpu",
gpu_device_id =0,
max_bin = 63,
gpu_platform_id=1,

# GPU PARAMETERS #
n_estimators=50000,
bagging_fraction=0.95,
subsample_freq = 2,
objective ="binary",
min_samples_leaf= 2,
importance_type = "gain",
verbosity = -1,
random_state=294,
num_leaves = 300,
boosting_type = 'gbdt',
learning_rate=0.15,
max_depth=4,
# scale_pos_weight=2, # Score - 0.85865 | Rank - 18
n_jobs=-1
)
lg.fit(X_train, y_train
,eval_set=[(X_train, y_train),(X_val, y_val)]
,early_stopping_rounds=100
,verbose=100
)
roc_auc = roc_auc_score(y_val,lg.predict_proba(X_val)[:, 1])
rocauc_score.append(roc_auc)
preds_1.append(lg.predict_proba(predictor_test [predictor_test.columns])[:, 1])
y_pred_final_1          = np.mean(preds_1,axis=0)
sub['Response']=y_pred_final_1
Blend_model_1 = sub.copy()

print('ROC_AUC - CV Score: {}'.format((sum(rocauc_score)/10)),'\n')
print("Score : ",rocauc_score)

# Download and Show Submission File :
display("sample_submmission",sub)
sub_file_name_1 = "S1. LGBM_GPU_TargetEnc_Vehicle_Damage_me_1994SEED_NoScaler.csv"
sub.to_csv(sub_file_name_1,index=False)
sub.head(5)

# Catboost Model
kf=KFold(n_splits=10,shuffle=True)
preds_2   = list()
y_pred_2  = []
rocauc_score = []
for i,(train_idx,val_idx) in enumerate(kf.split(predictor_train)):
X_train, y_train = predictor_train.iloc[train_idx,:], target_train.iloc[train_idx]
X_val, y_val = predictor_train.iloc[val_idx, :], target_train.iloc[val_idx]
print('\nFold: {}\n'.format(i+1))
cb = CatBoostClassifier( eval_metric='AUC',

# GPU PARAMETERS #
task_type='GPU',
devices="0",

# GPU PARAMETERS #
learning_rate=0.15,
n_estimators=494,
max_depth=7,
#                             scale_pos_weight=2
)
cb.fit(X_train, y_train
,eval_set=[(X_val, y_val)]
,early_stopping_rounds=100
,verbose=100
)
roc_auc = roc_auc_score(y_val,cb.predict_proba(X_val)[:, 1])
rocauc_score.append(roc_auc)
preds_2.append(cb.predict_proba(predictor_test [predictor_test.columns])[:, 1])
y_pred_final_2          = np.mean(preds_2,axis=0)
sub['Response']=y_pred_final_2

print('ROC_AUC - CV Score: {}'.format((sum(rocauc_score)/10)),'\n')
print("Score : ",rocauc_score)

# Download and Show Submission File :
display("sample_submmission",sub)
sub_file_name_2 = "S2. CB_GPU_TargetEnc_Vehicle_Damage_me_1994SEED_LGBM_NoScaler_MyStyle.csv"
sub.to_csv(sub_file_name_2,index=False)
Blend_model_2 = sub.copy()
sub.head(5)

# XGBOOST Model
kf=KFold(n_splits=10,shuffle=True)
preds_3   = list()
y_pred_3  = []
rocauc_score = []
for i,(train_idx,val_idx) in enumerate(kf.split(predictor_train)):
X_train, y_train = predictor_train.iloc[train_idx,:], target_train.iloc[train_idx]
X_val, y_val = predictor_train.iloc[val_idx, :], target_train.iloc[val_idx]
print('\nFold: {}\n'.format(i+1))
xg=XGBClassifier( eval_metric='auc',

# GPU PARAMETERS #
tree_method='gpu_hist',
gpu_id=0,

# GPU PARAMETERS #
random_state=294,
learning_rate=0.15,
max_depth=4,
n_estimators=494,
objective='binary:logistic'
)
xg.fit(X_train, y_train
,eval_set=[(X_train, y_train),(X_val, y_val)]
,early_stopping_rounds=100
,verbose=100
)
roc_auc = roc_auc_score(y_val,xg.predict_proba(X_val)[:, 1])
rocauc_score.append(roc_auc)
preds_3.append(xg.predict_proba(predictor_test [predictor_test.columns])[:, 1])
y_pred_final_3         = np.mean(preds_3,axis=0)
sub['Response']=y_pred_final_3

print('ROC_AUC - CV Score: {}'.format((sum(rocauc_score)/10)),'\n')
print("Score : ",rocauc_score)

# Download and Show Submission File :
display("sample_submmission",sub)
sub_file_name_3 = "S3. XGB_GPU_TargetEnc_Vehicle_Damage_me_1994SEED_LGBM_NoScaler.csv"
sub.to_csv(sub_file_name_3,index=False)
Blend_model_3 = sub.copy()
sub.head(5)

10. Submit results , Check the leaderboard and improve “ ROC_AUC”

one = Blend_model_2['id'].copy()
Blend_model_1.drop("id", axis=1, inplace=True)
Blend_model_2.drop("id", axis=1, inplace=True)
Blend_model_3.drop("id", axis=1, inplace=True)
Blend = (Blend_model_1 + Blend_model_2 + Blend_model_3)/3
id_df = pd.DataFrame(one, columns=['id'])
id_df.info()
Blend = pd.concat([ id_df,Blend], axis=1)
Blend.info()
Blend.to_csv('S4. Blend of 3 Models - LGBM_CB_XGB.csv',index=False)
display("S4. Blend of 3 Models : ",Blend.head())

k Crossover verification

Cross validation is a resampling process , Used to evaluate machine learning models on limited data samples . The process has a name of k Parameters of , It represents the number of groups a given data sample will be divided into . therefore , This process is often called k Crossover verification .

Early stop

How to make GPU Upper 3 Machine learning models run faster

  1. LIGHTGBM Medium GPU Parameters

To use LightGBM GPU Model : Must enable “ Internet” – Run all of the following code :


#  keep Internet be in “ open ” state , This state is located in Kaggle On the right side of the kernel ->“ Set up ” The palette 

# Cell 1:

!rm -r /opt/conda/lib/python3.6/site-packages/lightgbm

!git clone –recursive https://github.com/Microsoft/LightGBM

# Cell 2:

!apt-get install -y -qq libboost-all-dev

# Cell 3:

%% bash

cd LightGBM

rm -r build

mkdir build

cd build

cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..

make -j$(nproc)

# Cell 4:

!cd LightGBM/python-package/;python3 setup.py install –precompile

# Cell 5:

!mkdir -p /etc/OpenCL/vendors && echo “libnvidia-opencl.so.1” > /etc/OpenCL/vendors/nvidia.icd

!rm -r LightGBM


  1. device = “gpu”

  2. gpu_device_id =0

  3. max_bin = 63

  4. gpu_platform_id=1

How to be in GPU To achieve good speed on

  1. You want to run some datasets that we have proven to have good acceleration performance ( Include Higgs, epsilon, Bosch etc. ), To make sure the settings are correct . If you have more than one GPU, Please ensure that the gpu_platform_id and gpu_device_id Set to use the required GPU. Also make sure your system is idle ( Especially when using shared computers ), For accuracy performance measurement .

  2. GPU Best performance on large and dense datasets . If the dataset is too small , It's in GPU The calculation efficiency is not high , Because data transmission costs can be very high . If you have a classification function , Please use categorical_column Option and enter it directly into LightGBM in . Don't convert them to single heat variables .

  3. In order to make better use of GPU, A smaller number of bin. It is recommended to set max_bin = 63, Because it usually does not significantly affect the accuracy of training on large datasets , however GPU Training is better than using default bin size 255 Obviously much faster . For some datasets , Even using 15 individual bin That's enough (max_bin = 15 ); Use 15 individual bin Will maximize GPU performance . Make sure to check the run log and verify that the required number of bin.

  4. Try to use single precision training as much as possible (gpu_use_dp = false), Because most GPU( In especial NVIDIA Consumer GPU) The double precision performance is very poor .

2. CATBOOST Medium GPU Parameters

  1. task_type=’GPU’

  2. devices=”0″


<table> <tr><td> Parameters </td><td> describe </td></tr> <tr><td> CatBoost (fit)</td><td>task_type: The type of processing unit used for training . Possible value :(1) a central processor (2)GPU</td></tr> <tr><td> CatBoostClassifier(fit) <br>CatBoostRegressor(fit)</td><td> equipment : For training GPU The equipment ID( Index starts from zero ).<br> Format :<br>(1) For a device <unit ID>( for example 3)<br>(2)< Multiple devices <unit ID1>:<unit ID2>:..:<unit IDN>( for example ,devices ='0:1:3') <br>(3)< equipment ID1>-< equipment IDN> For a range of devices ( for example ,devices ='0-3')</td></tr>

</table>

3. XGBOOST Medium GPU Parameters

  1. tree_method ='gpu_hist'

  2. gpu_id = 0

usage

take tree_method The parameter is specified as one of the following algorithms .

Algorithm

<table> <tr> <td>tree_method</td><td> describe </td> </tr> <tr> <td>gpu_hist</td><td> Equivalent to XGBoost Fast histogram algorithm . It's much faster , And use less memory . Be careful : In comparison Pascal Architecture earlier GPU Can run very slowly on .</td> </tr> </table>

Supported parameters

<table> <tr> <td> Parameters </td><td>gpu_hist</td> </tr> <tr> <td>subsample</td><td></td> </tr> <tr> <td>sampling_method</td><td></td> </tr> <tr> <td>colsample_bytree</td><td></td> </tr> <tr> <td>colsample_bylevel</td><td></td> </tr> <tr> <td>max_bin</td><td></td> </tr> <tr> <td>gamma</td><td></td> </tr> <tr> <td>gpu_id</td><td></td> </tr> <tr> <td>predictor</td><td></td> </tr> <tr> <td>grow_policy </td><td></td> </tr> <tr> <td>monotone_constraints</td><td></td> </tr> <tr> <td>interaction_constraints</td><td></td> </tr> <tr> <td>single_precision_histogram</td><td></td> </tr> </table>

Hackers marathon cross selling summary

In this AV Cross selling hackers play a role in the competition “10 thing ”:

  1. 2 One of the best features :Vehicle_Damage Target code and press Region_Code In groups Vehicle_Damage The sum of the —— Based on the importance of features - stay CV(10 Crossover verification ) and LB( Public rankings ) There has been a big improvement in .

  2. Domain based features : Frequency coding of old vehicles —— Improved .LB score :0.85838 |LB ranking :15

  3. Hackathon Solutions The ranking function of : It's a huge boost .LB score :0.858510 |LB ranking :23

  4. Delete “id” bar : It's a huge boost .

  5. Domain based features : Vehicle damage per vehicle 、 Age and region code —— A little bit better .LB score :0.858527 |LB ranking :22

  6. Eliminate the annual premium deviation : It's a huge boost . LB score : 0.85855 |LB ranking : 20

  7. Domain based features : Vehicle damage in each area , Code and Policy , Distribution channel , Based on the importance of features , A little bit better .LB score :0.85856 |LB ranking :20

  8. Use hyperparameters and 10-Fold CV For all 3 Models have been adjusted , A robust strategy and the best results are obtained , The number of rounds that stopped early =50 or 100.Scale_pos_weight It doesn't work here .

  9. Domain based features : Customer terms are in years , Because other features are also measured in years , The insurance response will be based on the number of years .LB score :0.858657 |LB ranking :18

  10. comprehensive / Mix all 3 The best single model :LightGBM、CatBoost and XGBoost, Got the best score .

5 Pieces of “ No use ” What happened

  1. Untreated features :[ Total vehicle damage by age , According to the total amount of vehicle damage insured before , Vehicle damage count by region code , Maximum vehicle damage by region code , Minimum vehicle damage by region code , Code the frequency of old vehicles , Frequency code for vehicle age , monthly EMI= Annual premium /12, Total amount of vehicle damage grouped by policy , Total vehicle damage by vehicle age , Total vehicle damage by driver's license ]

  2. Delete driver's license columns that are not related to the response

  3. Unique hot coding of all features / Virtual coding

  4. Compared with the uncalibrated data , all 3 None of the scaling methods works ——StandardScaler Give the best of them LB score .StandardScaler –0.8581 | MinMaxScaler–0.8580 | RobustScaler–0.8444

  5. Delete training and test based Region_Code Duplicate code on doesn't work at all

The first 2 Partial end !( Series to be continued )

If you find this article helpful , Please share with data science beginners , Help them start the hacking race , Because it explains many steps , Such as feature engineering based on domain knowledge 、 Cross validation 、 Early stop 、 stay GPU Run in 3 A machine learning model , The average combination of multiple models , Finally, it is concluded that “ What technology works , What is not effective – The last step will help us save a lot of time and energy . This will raise our awareness of the hacker race in the future .

Thank you very much for reading !

Link to the original text :https://www.analyticsvidhya.com/blog/2020/10/ultimate-beginners-guide-to-win-classification-hackathons-with-a-data-science-use-case/

Welcome to join us AI Blog station : http://panchuang.net/

sklearn Machine learning Chinese official documents : http://sklearn123.com/

Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/

版权声明
本文为[Artificial intelligence meets pioneer]所创,转载请带上原文链接,感谢

Scroll to Top