Higgs-Boson Analysis and Classification
At the ATLAS detector at CERN, very high energy protons are accelerated in a circular trajectory in both directions thus colliding with themselves and resulting in hundreds of particles per second.The goal of the Challenge is to improve the classification procedure that produces the selection region. This is a very interesting dataset to work on and perhaps explore self-supervision as well, such as VIME.
- Goal
- Defining two functions that are used to get a report of our datasets
- Loading and Reading Datasets
- Exploratory Data Analysis
- Let us look at the Class Ratio in our dataset.
- Univariate Analysis
- Finding distribution of each Feature
- Bivariate Analysis
- Getting the Data Ready
- A) Dropping Highly Correlated Features
- B) Log Transformation of the features
- C) SMOTE Technique
- Utility Functions to make Model building and Cross-validation easier.
- Setting up Baseline Model
- Working on Ensemble-Tree Models
- Conclusion
The discovery of Higgs particle was announced on 4th July 2012. In 2013, Nobel Prize was conferred upon two scientists, Francois Englert and Peter Higgs for their contribution towards its discovery. A characteristic property of Higgs Boson is its decay into other particles through different processes. At the ATLAS detector at CERN, very high energy protons are accelerated in a circular trajectory in both directions thus colliding with themselves and resulting in hundreds of particles per second.
These events are categorized as either background or signal events. The background events consist of decay of particles that have already been discovered in previous experiments. The signal events are the decay of exotic particles: a region in feature space which is not explained by the background processes. The significance of these signal events is analyzed using different statistical tests.
If the probability that the event has not been produced by a background process is well below a threshold, a new particle is considered to have been discovered. The ATLAS experiment observed a signal of the Higgs Boson decaying into two tau particles, although it was buried in significant amount of noise.
Goal
The goal of the Challenge is to improve the classification procedure that produces the selection region.The objective is a function of the weights of selected events. Prefix-less variables EventId, Weight and Label have a special role and should not be used as input to the classifier.
The variables prefixed with PRI (for PRImitives) are “raw” quantities about the bunch collision as measured by the detector, essentially the momenta of particles. Variables prefixed with DER (for DERived) are quantities computed from the primitive features.
def pretty_print(df):
return display( HTML( df.to_html().replace("\\n","<br>") ) )
def tbl_report(tbl, cols=None, card=10):
print("Table Shape", tbl.shape)
dtypes = tbl.dtypes
nulls = []
uniques = []
numuniques = []
vcs = []
for col in dtypes.index:
n = tbl[col].isnull().sum()
nulls.append(n)
strdtcol = str(dtypes[col])
#if strdtcol == 'object' or strdtcol[0:3] == 'int' or strdtcol[0:3] == 'int':
#print(strdtcol)
uniqs = tbl[col].unique()
uniquenums = uniqs.shape[0]
if uniquenums < card: # low cardinality
valcounts = pd.value_counts(tbl[col], dropna=False)
vc = "\n".join(["{}:{}".format(k,v) for k, v in valcounts.items()])
else:
vc='HC' # high cardinality
uniques.append(uniqs)
numuniques.append(uniquenums)
vcs.append(vc)
nullseries = pd.Series(nulls, index=dtypes.index)
uniqueseries = pd.Series(uniques, index=dtypes.index)
numuniqueseries = pd.Series(numuniques, index=dtypes.index)
vcseries = pd.Series(vcs, index=dtypes.index)
df = pd.concat([dtypes, nullseries, uniqueseries, numuniqueseries, vcseries], axis=1)
df.columns = ['dtype', 'nulls', 'uniques', 'num_uniques', 'value_counts']
if cols:
return pretty_print(df[cols])
return pretty_print(df)
Loading and Reading Datasets
The data consists of simulated signal and background events in a 30 dimensional feature space. Each event data point is assigned an ID and a weight as explained
before. The 30 features consisted of real values and included different kinematic properties of that event and the particles involved including estimated particle mass, invariant mass of hadronic tau and lepton, vector sum of the transverse momentum of hadronic tau, centrality of azimuthal angle, pseudo-rapidity of the leptons, the number of jets and their properties, etc. The training data consisted of 250,000 events and the test data consisted of 550,000 events. Test data was not accompanied by weights. Each event of training data was marked by one of two labels; 's' for signal and 'b' for background.
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json
!kaggle kernels list — user Sakzsee — sort-by dateRun
!kaggle competitions download -c higgs-boson
!unzip -q train.csv.zip -d .
!unzip -q test.csv.zip -d .
!ls
#Loading train set and loading test set
train = pd.read_csv("training.zip")
test = pd.read_csv("test.zip")
#EventID is identifier - making it an index in both the sets
train.set_index('EventId',inplace = True)
test.set_index('EventId',inplace=True)
#Looking at top 5 rows in train
train.head()
#Looking at training set info
tbl_report(train, cols=['dtype', 'nulls', 'num_uniques', 'value_counts'])
#Looking at the numerical descriptions
train.describe()
#Looking at top 5 rows in test
test.head()
#Looking at test info
tbl_report(test, cols=['dtype', 'nulls', 'num_uniques', 'value_counts'])
#Looking at statistical description of test
test.describe()
train.shape, test.shape
#Splitting into X and y
X = train.drop(['Label','Weight'], axis = 1)
y = pd.factorize(train['Label'])[0]
print('Shape of X: {}'.format(X.shape))
print('Shape of y: {}'.format(y.shape))
#Let's see the count of each class in our label
plt.figure(figsize=(10,8))
ax = sns.countplot(train['Label']);
for p in ax.patches:
ax.annotate('{:d}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()+5));
Looks like we do indeed have an imbalanced dataset. Considering this, accuracy would not be a good metric of performance. F1 score would be a better fit.
#Plotting Distribution of each feature
fig=plt.figure(figsize=(30,40))
for i in range(np.shape(train)[1]-1):
ax = fig.add_subplot(8,4,i+1)
ax = sns.distplot(train.iloc[:,i], color = 'dodgerblue')
ax.set_title("Feature "+ train.columns[i] +" distribution")
fig.tight_layout();
From the above, we see a lot of features that are skewed in nature. Hence, we will be log transforming them later.
#Plotting Distribution of features per class
fig=plt.figure(figsize=(30,40))
for i in range(np.shape(train)[1]-2):
ax = fig.add_subplot(6,5,i+1)
ax = sns.distplot(train[train['Label'] == 's'].iloc[:,i],label="Class S", color = "blue")
ax = sns.distplot(train[train['Label'] == 'b'].iloc[:,i],label="Class B", color = "grey")
ax.set_title("Feature "+ train.columns[i] +" distribution per class")
ax.legend()
fig.tight_layout();
fig=plt.figure(figsize=(30,40))
for i in range(np.shape(train)[1]-2):
ax = fig.add_subplot(6,5,i+1)
ax = sns.violinplot(train.iloc[:,i],train['Label'])
ax.set_title("Feature "+ train.columns[i] +" distribution per class")
fig.tight_layout();
train.corr().style.background_gradient(cmap='Blues')
#List to store the features with high correlation as tuples
feat=[]
#Setting a threshold of 0.9 of correlation
threshold=0.9
correlation=X.corr()
for i in X.columns:
temp=correlation[i]
#Finding the correlated features greater than the threshold
corr_features=temp[(abs(temp)>threshold) & (temp.index!=i)].index.values
#Adding the correlated features into a list keeping in mind that there is only one occurrence of the feature combination
if(len(corr_features)!=0):
for j in corr_features:
features=(i,j)
if(len(feat)==0):
feat.append(features)
else:
count=len(feat)
for x in feat:
if set(x) != set(features):
count-=1
else:
break
if(count==0):
feat.append(features)
#[feat.append(features) for x in feat if not (set(x)==set(features))]
print("The highly correlated features are given below")
for i in feat:
corr=correlation[i[0]][i[1]]
print('Features '+i[0]+' and '+i[1]+' are correlated with a correlation index of '+ str(np.round(corr,2)))
Since we have an imbalanced dataset, we are using SMOTE for upsampling and downsampling our dataset.
count = {}
for i, j in feat:
if i in count.keys():
count[i]+= 1
else:
count[i] = 1
if j in count.keys():
count[j]+= 1
else:
count[j] = 1
for k, v in count.items():
if v > 2:
X.drop(k, axis = 1, inplace = True)
X.info()
B) Log Transformation of the features
The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality. If the original data follows a log-normal distribution or approximately so, then the log-transformed data follows a normal or near normal distribution.
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log2, validate = True)
log_features = ['DER_mass_transverse_met_lep', 'DER_mass_vis',
'DER_pt_h', 'DER_pt_tot', 'DER_sum_pt',
'DER_pt_ratio_lep_tau', 'PRI_tau_pt',
'PRI_lep_pt',
'PRI_met', 'PRI_met_sumet']
X = X[log_features].applymap(lambda x: np.log(x+1))
for i in log_features:
X[i].hist()
plt.show()
plt.title("After Log transformation for " +i)
C) SMOTE Technique
The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important.
One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.
counter = Counter(y)
print('Before', counter)
smt = SMOTE()
#oversampling using SMOTE
X_sm, y_sm = smt.fit_resample(X,y)
counter = Counter(y_sm)
print('After', counter)
#Creating a Resampled Dataframe
X_sm_df = pd.DataFrame(X_sm, columns=X.columns)
y_sm_df = pd.DataFrame(y_sm, columns = ['Label'])
temp_df = pd.concat([X_sm_df,y_sm_df],axis=1)
f, axes = plt.subplots(figsize=(10, 4), dpi=100)
plt.subplot(121)
sns.despine()
sns.distplot(temp_df[temp_df['Label']==0]['PRI_met_sumet'],label='After Resampling',color='red')
sns.distplot(np.log(train[train['Label']=='s']['PRI_met_sumet']),label='Before Resampling',color='blue')
plt.title('Distribution of Class "s" with SMOTE', fontsize=14);
plt.legend();
plt.subplot(122)
sns.despine()
sns.distplot(temp_df[temp_df['Label']==1]['PRI_met_sumet'],label='After Resampling',color='red')
sns.distplot(np.log(train[train['Label']=='b']['PRI_met_sumet']),label='Before Resampling',color='blue')
plt.title('Distribution of Class "b" with SMOTE', fontsize=14);
plt.legend();
If you look at the X axis, clearly the upsampling has happened.
def cv_optimize(clf, parameters, X, y, n_jobs=1, n_folds=5, score_func=None,oob_func=False):
if ((not oob_func) and score_func):
print("SCORE FUNC", score_func)
gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, n_jobs=n_jobs, scoring=score_func)
elif oob_func:
print("OOB_Score")
results = {}
estimators= {}
for n_est,mf,md in product(*parameters.values()):
params = (n_est,mf,md)
clf = RandomForestClassifier(random_state = 2017, n_estimators = n_est, max_features = mf, max_depth = md, oob_score=True, n_jobs = -1)
clf.fit(X,y)
results[params] = clf.oob_score_
estimators[params] = clf
outparams = max(results, key = results.get)
print("Best Params: ",outparams)
best_estimator = estimators[outparams]
print("Training Score: ",best_estimator.score(X, y))
print("OOB Score: ",best_estimator.oob_score_)
return best_estimator
else:
gs = GridSearchCV(clf, param_grid=parameters, n_jobs=n_jobs, cv=n_folds)
gs.fit(X, y)
print("BEST", gs.best_params_, gs.best_score_)
best = gs.best_estimator_
return best
def do_classify(clf, parameters, indf,y,score_func, n_folds=5, n_jobs=1,oob_func=False):
X=indf
y=y
Xtrain,Xtest,ytrain,ytest=train_test_split(X,y,train_size=0.8,random_state=2017)
if oob_func:
clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_jobs=n_jobs, n_folds=n_folds, oob_func=True)
else:
clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)
training_accuracy = clf.score(Xtrain, ytrain)
test_accuracy = clf.score(Xtest, ytest)
print("############# based on standard predict ################")
print("Accuracy on training data: %0.2f" % (training_accuracy))
print("Accuracy on test data: %0.2f" % (test_accuracy))
print(confusion_matrix(ytest, clf.predict(Xtest)))
print("########################################################")
plot_confusion_matrix(clf,Xtest,ytest,cmap="Blues")
return clf, Xtrain, ytrain, Xtest,ytest
def make_roc(name, clf, ytest, xtest, ax=None, labe=5, proba=True, skip=0, initial = False):
if not ax:
ax=plt.gca()
if proba:
fpr, tpr, thresholds=roc_curve(ytest, clf.predict_proba(xtest)[:,1])
else:
fpr, tpr, thresholds=roc_curve(ytest, clf.decision_function(xtest))
roc_auc = auc(fpr, tpr)
if skip:
l=fpr.shape[0]
ax.plot(fpr[0:l:skip], tpr[0:l:skip], '.-', lw=2, alpha=0.4, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
else:
ax.plot(fpr, tpr, '.-', lw=2, alpha=0.4, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
label_kwargs = {}
label_kwargs['bbox'] = dict(
boxstyle='round,pad=0.3', alpha=0.2,
)
for k in range(0, fpr.shape[0],labe):
#from https://gist.github.com/podshumok/c1d1c9394335d86255b8
threshold = str(np.round(thresholds[k], 2))
ax.annotate(threshold, (fpr[k], tpr[k]), **label_kwargs)
if initial:
ax.plot([0, 1], [0, 1], 'k--')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC')
ax.legend(loc="lower right")
return ax
def make_pr(name,clf,ytest,xtest,ax=None):
scores = clf.predict_proba(xtest)[:,1]
precision, recall, _ = precision_recall_curve(ytest, scores)
ax.plot(recall,precision,'*-',label="Precision-Recall Curve for %s (area = %0.2f)" % (name,average_precision_score(ytest,scores)) )
ax.set_xlim([-0.01,1.01])
ax.set_ylim([-0.01,1.01])
ax.set_xlabel('Recall', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.grid()
plt.tight_layout()
plt.legend()
return ax
def calibration_plot(name,clf, xtest, ytest):
fig, ax = plt.subplots(2,1,figsize=(8,15))
fop, mpv = calibration_curve(ytest,clf.predict_proba(xtest)[:,1], n_bins=20)
count = 0
ax[0].plot(mpv, fop, marker='*',label="Calibration curve for %s" %(name))
ax[0].plot([0,1],[0,1])
ax[1].hist(clf.predict_proba(xtest)[:, 1], range=(0, 1), bins=20)
rect=ax[1].patches
fig.legend()
def p_importance(model, cols, fi, fistd = 0):
return pd.DataFrame({'features':cols, 'importance':fi, 'importance_std': fistd}
).sort_values('importance', ascending=False)
def plot_perm_importance(name,model,Xtest,ytest,last=False,number=10):
imp = permutation_importance(model,Xtest,ytest)
xgf_df=p_importance(imp,Xtest.columns,imp['importances_mean'],imp['importances_std'])
fig,ax=plt.subplots(figsize=(17,10))
if last:
sns.barplot(data=xgf_df[-number:],x='features',y='importance',label='%s_importances'%(name),ax=ax)
else:
sns.barplot(data=xgf_df[:number],x='features',y='importance',label='%s_importances'%(name),ax=ax)
plt.xticks(rotation='45')
plt.title("Bar plot of Importances for %s"%(name));
return xgf_df
Setting up Baseline Model
When trying to develop a scientific understanding of the world, most fields start with broad strokes before exploring important details. In Physics for example, we start with simple models (Newtonian physics) and progressively dive into more complex ones (Relativity) as we learn which of our initial assumptions were wrong. This allows us to solve problems efficiently, by reasoning at the simplest useful level.
Fundamentally, a baseline is a model that is both simple to set up and has a reasonable chance of providing decent results. Experimenting with them is usually quick and low cost, since implementations are widely available in popular packages.
In our case, we shall take Logistic Regression as our baseline model to set the line on which we will further improve model performance.
# set up standardization
ss = StandardScaler()
# oe hot encoding
oh = OneHotEncoder()
cont_vars = X.columns.to_list()
cat_vars = []
# continuous variables need to be standardized
cont_pipe = Pipeline([("scale", ss)])
# categorical variables need to be one hot encoded
cat_pipe = Pipeline([('onehot', oh)])
# combine both into a transformer
transformers = [('cont', cont_pipe, cont_vars), ('cat', cat_pipe, cat_vars)]
# apply transformer to relevant columns. Nothing will be done for the rest
ct = ColumnTransformer(transformers=transformers, remainder="passthrough")
# create a pipeline so that we are not leaking data from validation to train in the individual folds
pipe = Pipeline(steps=[('ct', ct), ('model', LogisticRegression(max_iter=10000, penalty='l2'))])
# in paramgrid we dont use C but use model__C corresponding to the name in the pipeline
paramgrid = dict(model__C=[1000, 100, 10])
#Now we train our model. Our do_classify takes care of subsetting the data and pickinging up the target variable.We score using the AUC on the validation sets.
lr, Xtrain, ytrain, Xtest, ytest = do_classify(pipe, paramgrid,X, y, score_func='roc_auc')
print(classification_report(ytest,lr.predict(Xtest)))
fig,ax=plt.subplots(1,2,figsize=(10,5))
make_roc('logistic', lr, ytest , Xtest, ax=ax[0],labe=1000, initial = False)
make_pr('logistic', lr, ytest, Xtest,ax=ax[1]);
calibration_plot("Logistic Regression Model",lr, Xtest, ytest)
From above, we are able to infer that the accuracy achieved is around 71% for the Logistic Regression model with the F1 Score being an avg of 70%. The data fed into this is imbalanced. Perhaps, that may be the reason why the accuracy dipped.
Setting up Pipeline for Decision Tree Classifier
Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too.
The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from prior data(training data).
In Decision Trees, for predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with the record’s attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node.
# Create a pipeline so that we are not leaking data from validation to train in the individual folds
dt = DecisionTreeClassifier(random_state=142)
paramgrid_dt = {'max_depth':range(1,9),'min_samples_leaf':range(3,5),'criterion':['gini']}
dt, Xtrain, ytrain, Xtest,ytest = do_classify(dt, paramgrid_dt,X_sm,y_sm,'roc_auc',n_folds=5,n_jobs=-1)
print(classification_report(ytest,dt.predict(Xtest)))
colors = [None, None,['red','blue'],]
dt_viz = dtreeviz(dt, X_sm,y_sm,
feature_names = X_sm.columns,
target_name = 'Label', class_names= ['Yes','No'],orientation = 'TD',
colors={'classes':colors},
label_fontsize=12,
ticks_fontsize=10,
)
dt_viz.save("DecisionTree.svg")
from IPython.display import SVG, display
display(SVG("my_icons/DecisionTree.svg"))
dimp = permutation_importance(dt,Xtest,ytest)
ddf = p_importance(dt,list(X_sm.columns),dimp['importances_mean'],dimp['importances_std']).iloc[:10]
#Plotting Feature Importance Graphs
fig,ax=plt.subplots(figsize=(17,10))
sns.barplot(data=ddf,x='features',y='importance',label='Decision_importances',ax=ax)
plt.xticks(rotation='45')
plt.title("Bar plot of Importances for Decision Tree Model");
Working on Ensemble-Tree Models
Ensemble methods, which combines several decision trees to produce better predictive performance than utilizing a single decision tree. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner.
Let’s talk about few techniques to perform ensemble decision trees:
- Bagging
- Boosting
Bagging (Bootstrap Aggregation) is used when our goal is to reduce the variance of a decision tree. Here idea is to create several subsets of data from training sample chosen randomly with replacement. Now, each collection of subset data is used to train their decision trees. As a result, we end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.
Trying out Boosting Models
Gradient Boosting is an extension over boosting method. Gradient Boosting= Gradient Descent + Boosting. It uses gradient descent algorithm which can optimize any differentiable loss function. An ensemble of trees are built one by one and individual trees are summed sequentially. Next tree tries to recover the loss (difference between actual and predicted values). </br> Advantages of using Gradient Boosting technique:</br>
- Supports different loss function.
- Works well with interactions. Disadvantages of using Gradient Boosting technique:</br>
- Prone to over-fitting.
- Requires careful tuning of different hyper-parameters
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
X_train,X_test,y_train,y_test = train_test_split(X_sm , y_sm ,train_size=0.8)
max_depth = 20
# Set the maximum depth to be max_depth and use 100 estimators
n_estimators = 50
basemodel = DecisionTreeClassifier(max_depth=max_depth,random_state=142)
bagging = BaggingClassifier(base_estimator=basemodel,
n_estimators=n_estimators)
# Fit the model on the training set
bagging.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
# We make predictions on the validation set
bag_predictions = bagging.predict(X_test)
# compute the accuracy on the validation set
acc_bag = round(accuracy_score(bag_predictions, y_test),2)
print(f'For Bagging, the accuracy on the validation set is {acc_bag}')
print(classification_report(y_test,bagging.predict(X_test)))
# Define a Random Forest classifier with randon_state as above
# Set the maximum depth to be max_depth and use 100 estimators
random_forest = RandomForestClassifier(max_depth=max_depth,
random_state=142,
n_estimators=n_estimators,
max_features=8)
# Fit the model on the training set
random_forest.fit(X_train, y_train)
# We make predictions on the validation set
rf_predictions = random_forest.predict(X_test)
# compute the accuracy on the validation set
acc_rf = round(accuracy_score(rf_predictions, y_test),2)
print(f'For Random Forest, the accuracy on the validation set is {acc_rf}')
# Reducing the max_depth for visualization
max_depth = 3
random_forest = RandomForestClassifier(max_depth=max_depth, random_state=142, n_estimators=n_estimators,max_features = 8)
# Fit the model on the training set
random_forest.fit(X_train, y_train)
# Selecting two trees at random
forest1 = random_forest.estimators_[0]
vizC = dtreeviz(forest1, X_sm.iloc[:,:11],y_sm,
feature_names = X_sm.columns[:11],
target_name = 'Signal/Backgound', class_names= ['No','Yes']
,orientation = 'TD',
colors={'classes':colors},
label_fontsize=14,
ticks_fontsize=10,
scale=1.1
)
vizC.save("RandomForestClassifier1.svg")
Plotting and Comparing ROC Curves
In Machine Learning, performance measurement is an essential task. So when it comes to a classification problem, we can count on an AUC - ROC Curve. When we need to check or visualize the performance of the multi-class classification problem, we use the AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve. It is one of the most important evaluation metrics for checking any classification model’s performance. It is also written as AUROC (Area Under the Receiver Operating Characteristics).
Interpretation of ROC Curves
An excellent model has AUC near to the 1 which means it has a good measure of separability. A poor model has AUC near to the 0 which means it has the worst measure of separability. In fact, it means it is reciprocating the result. It is predicting 0s as 1s and 1s as 0s. And when AUC is 0.5, it means the model has no class separation capacity whatsoever.
def make_roc(name, clf, ytest, xtest, ax=None, labe=5, proba=True, skip=0, initial = False):
if not ax:
ax=plt.gca()
if proba:
fpr, tpr, thresholds=roc_curve(ytest, clf.predict_proba(xtest)[:,1])
else:
fpr, tpr, thresholds=roc_curve(ytest, clf.decision_function(xtest))
roc_auc = auc(fpr, tpr)
if skip:
l=fpr.shape[0]
ax.plot(fpr[0:l:skip], tpr[0:l:skip], '.-', lw=2, alpha=0.4, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
else:
ax.plot(fpr, tpr, '.-', lw=2, alpha=0.4, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
label_kwargs = {}
label_kwargs['bbox'] = dict(
boxstyle='round,pad=0.3', alpha=0.2,
)
for k in range(0, fpr.shape[0],labe):
#from https://gist.github.com/podshumok/c1d1c9394335d86255b8
threshold = str(np.round(thresholds[k], 2))
ax.annotate(threshold, (fpr[k], tpr[k]), **label_kwargs)
if initial:
ax.plot([0, 1], [0, 1], 'k--')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC')
ax.legend(loc="lower right")
return ax
fig,ax=plt.subplots(1,2,figsize=(10,5))
make_roc('Decision Tree on balanced dataset', dt,ytest ,Xtest, ax=ax[0],labe=5000, initial = False)
make_roc('Logistic Regression', lr,ytest ,Xtest, ax=ax[0],labe=5000, initial = False)
make_pr('Decision Tree on balanced dataset',dt,ytest,Xtest,ax=ax[1])
make_pr('Logistic Regression',lr,ytest,Xtest,ax=ax[1])
fig,ax=plt.subplots(1,2,figsize=(10,5))
make_roc('Random Forest on balanced dataset', random_forest,ytest ,Xtest, ax=ax[0],labe=5000, initial = False)
make_roc('Bagging Classifier on balanced dataset', bagging,ytest ,Xtest, ax=ax[0],labe=5000, initial = False)
make_pr('Decision Tree on balanced dataset',random_forest,ytest,Xtest,ax=ax[1])
make_pr('Bagging Classifier on balanced datastet',bagging,ytest,Xtest,ax=ax[1])
from sklearn.inspection import plot_partial_dependence
#Partial Dependence for DT
fig, axes = plt.subplots(10, 1, figsize = (5, 20))
plot_partial_dependence(dt,Xtest,Xtest.columns.to_list(),ax=axes)
fig.tight_layout()
#Partial Dependence for Random Forest Classifier
fig, axes = plt.subplots(10, 1, figsize = (5, 20))
plot_partial_dependence(random_forest,Xtest,Xtest.columns.to_list(),ax=axes)
fig.tight_layout()
#Partial Dependence for Bagging Classifier
fig, axes = plt.subplots(10, 1, figsize = (5, 20))
plot_partial_dependence(bagging,Xtest,Xtest.columns.to_list(),ax=axes)
fig.tight_layout()
These are the following accuracies and F1 Score for the various classifiers used:
- Logistic Regression - 71% Accuracy and 0.70 F1 Score
- Decision Tree Classifier - 78% Accuracy and 0.78 F1 Score
- Bagging Classifier - 84% Accuracy and 0.84 F1 Score
- Random Tree Classifier - 84% Accuracy and 0.84 F1 Score
Clearly of these, the ensemble models performed better. However, we would like to further pre-process the data and fine tune the hyperparameters further in order to improve model classification.
One of the ideas we wanted to implement was self-supervised learning - which we have done in notebook called VIME_implementation.ipynb. This notebook contains our attempt at implementing the VIME algorithm for our dataset.