%% Cell type:markdown id: tags:
# Scikit-Learn Introduction
A number of Python packages provide implementations of machine learning algorithms.
**[Scikit-Learn](** is one of the most popular.
* it provides many of the common ML algorithms
* well-designed, uniform API (programming interface)
* standardized and largely streamlined setup of the different models
→ easy to switch
* good documentation
The first example is based on the **[Iris dataset](**. This had already been introduced by famous statistician
Ronald Fisher in 1936 and is used since then as instructive use case for classification.
The data consists of
* measurements of length and width of both sepal (Blütenkelch) and petal (Blüte)
* classification of Iris sub-species
%% Cell type:code id: tags:
``` python
# the usual setup:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%% Cell type:code id: tags:
``` python
# seaboorn provides easy way to import iris dataset as pandas dataframe
import seaborn as sns
iris = sns.load_dataset('iris')
%% Cell type:code id: tags:
``` python
%% Cell type:markdown id: tags:
## Data visualization
First step should always be some investigation of data properties, i.e.
* basic statistical properties
* visualization of distributions
%% Cell type:code id: tags:
``` python
# basic statistics with pandas
%% Cell type:code id: tags:
``` python
# distribution of single feature
%% Cell type:code id: tags:
``` python
# combined plot of 2 features
sns.jointplot(data=iris,x='sepal_length',y='sepal_width', hue='species')
%% Cell type:code id: tags:
``` python
# combined plot matrix of all features in dataframe
# will provide scatter plot of all combinations of numerical columns in dataframe
# target (=species) can be given and will cause different colors
sns.pairplot(iris, hue='species', diag_kind='hist', height=2.0)
%% Cell type:markdown id: tags:
## Data preparation
For use in sklearn with **supervised learning** the first step is always to split data into
* table/matrix of **features**
* list of **targets**
And then split the data into **train** and **test** sample:
* `train_test_split` function from sklearn
* by default 75% for training and 25% for test and validation
* can be specified as parameter
* randomized selection of entries
→ inital order does not matter
%% Cell type:code id: tags:
``` python
# feature matrix
%% Cell type:code id: tags:
``` python
# target
%% Cell type:code id: tags:
``` python
# break-up in train & test sample
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, Y)
%% Cell type:code id: tags:
``` python
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")
%% Cell type:markdown id: tags:
## Fit knn Model, apply and make predictions
%% Cell type:code id: tags:
``` python
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
%% Cell type:code id: tags:
``` python, y_train), Y)
%% Cell type:code id: tags:
``` python
# create dummy iris
#X_new = np.array([[5, 4.9, 4, 1.2]])
# recent version want same datatype for testing
X_new = pd.DataFrame(np.array([[5, 4.9, 4, 1.2]]),columns=X.columns)
# 2D format required, nrows vs ncolums (1x2)
X_new.shape #
%% Cell type:code id: tags:
``` python
knn.predict(X_new) # apply model to new data point
%% Cell type:markdown id: tags:
### test/evaluate model
%% Cell type:code id: tags:
``` python
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))
%% Cell type:code id: tags:
``` python
%% Cell type:code id: tags:
``` python
# use scilearn function for score
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)
print("Test set score: {:.3f}".format(score))
%% Cell type:markdown id: tags:
Further useful checks are the **classification report** and the **confusion matrix**,
they give detailed Info on mis-classifications:
%% Cell type:code id: tags:
``` python
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
%% Cell type:markdown id: tags:
(The meaning of `recall` etc. will be explained in a bit.)
Another intuitive measure is the confusion matrix:
%% Cell type:code id: tags:
``` python
from sklearn.metrics import confusion_matrix
labels = np.unique(y_test)
mat = confusion_matrix(y_test, y_pred, labels=labels)
print (labels, '\n', mat)
%% Cell type:markdown id: tags:
**Repeat with different settings for number of neighbors**
**Usually high accuracy for Iris data**
as scatter plot suggested there is rather clear separation between species
%% Cell type:markdown id: tags:
## Measure Quality of Classification
The above `classification_report` presented several parameters which are useful to quantify how well the qualification works.
For these we need to introduce the following terms (assuming a classification with two classes *P* and *N*)
* $t_p = $ true-positive: number of cases with predicted *P* and correct *P*
* $t_n = $ true-negative: number of cases with predicted *N* and correct *N*
* $f_p = $ false-positive: number of cases with predicted *P* and correct *N*
* $f_n = $ false-negative: number of cases with predicted *N* and correct *P*
![Confusion matrix](./figures/wikipedia_confusion_matrix.png "More details: see Wikipedia article on confusion matrix")
Based on these, the parameters in the `classification_report` are defined as:
* `precision` (or `purity`): $ t_p / ( t_p + f_p ) $ , i.e. fraction of cases classified as *P* which are true *P*
* `recall` (or `efficiency`): $ t_p / ( t_p + f_n ) $ , i.e. fraction of true *P* which are classified as *P*
* `f1-score` : Mean of `precision` and `recall`
See for a more detailed discussion
%% Cell type:markdown id: tags:
## Test further simple models
### Gaussian Naive Bayes
Also a conceptually simple model
* basic assumption is that for each different category (*Iris-species*) the variables follow a Gaussian distribution.
* In training the model determines parameters of these Gaussians
* For classification then simply calculate probability of a given new Iris-data to be of species `i` based on Gaussian probability:
$$ P(x) = \frac{1}{{\sigma_i \sqrt {2\pi } }}e^{{{ - \left( {x - \mu_i } \right)^2 } \, } \left/ \right. {\, {2\sigma_i ^2 }}}$$
* where $\mu_i$ and $\sigma_i$ are mean and standarddeviation for respecitve variable and species `i`
(We'll look at why it's called "Bayes" in a bit more detail [here](http://localhost:8888/notebooks/Higgs-Gaussian.ipynb#GaussianNB).)
%% Cell type:code id: tags:
``` python
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB() # 2. instantiate model, y_train) # 3. fit model to data
y_gnb = model.predict(X_test) # 4. predict on new data
%% Cell type:code id: tags:
``` python
# use scilearn function for score
from sklearn.metrics import accuracy_score
score = accuracy_score(y_gnb, y_test)
print("Test set score: {:.3f}".format(score))
%% Cell type:code id: tags:
``` python
from sklearn.metrics import classification_report
print(classification_report(y_gnb, y_test))
%% Cell type:code id: tags:
``` python
mat = confusion_matrix(y_test, y_gnb, labels=labels)
print (labels,'\n', mat)
%% Cell type:markdown id: tags:
### Logistic Regression
This method is similar to standard linear regression. However, it can be used for discrete dependent variables, i.e. classification use-cases.
It is a rather simple, linear model:
* logistic function: $f(x) = \frac{1}{1+\exp(-x)}$, $f(x): [-\infty,\infty] \to [0,1]$
* model: $y_i = f(x_i \cdot \beta) + \epsilon_i$
More info:
%% Cell type:code id: tags:
``` python
from sklearn.linear_model import LogisticRegression # 1. choose model class
model = LogisticRegression(max_iter=500) # 2. instantiate model, y_train) # 3. fit model to data
y_lr = model.predict(X_test) # 4. predict on new data
%% Cell type:code id: tags:
``` python
# use scilearn function for score
from sklearn.metrics import accuracy_score
score = accuracy_score(y_lr, y_test)
print("Test set score: {:.3f}".format(score))
%% Cell type:markdown id: tags:
### Support Vector Machine
SVM is another standard ML method, conceptually related to LR and kNN
* Look for line/plane separating classes
* Decision based on neighbouring elements:
* Maximize margin to ‘few-closest-elements (=Support Vectors)’
* ‘linear SVM’ works only for linear separation
* ‘kernel SVM’ variant also for non-linear cases
%% Cell type:code id: tags:
``` python
from sklearn.svm import SVC
%% Cell type:code id: tags:
``` python
model = SVC(kernel='linear',gamma = "auto", random_state = 42)
#model = SVC(kernel='rbf',gamma = "auto", random_state = 42), y_train) # 3. fit model to data
y_svm = model.predict(X_test) # 4. predict on new data
%% Cell type:code id: tags:
``` python
# use scilearn function for score
from sklearn.metrics import accuracy_score
score = accuracy_score(y_svm, y_test)
print("Test set score: {:.3f}".format(score))
%% Cell type:markdown id: tags:
### Probabilistic Classification
In general models can not only be used to give single classification as in above examples but one can also get a list of probabilities for the different possible outcomes:
%% Cell type:code id: tags:
``` python
model = LogisticRegression(max_iter=500) # 2. instantiate model, y_train) # 3. fit model to data
print (yout[:5])
%% Cell type:code id: tags:
``` python
yout[y_lr != y_test]
%% Cell type:code id: tags:
``` python
%% Cell type:markdown id: tags:
Depending on type of problem this information can be further used to distinguish clear cases and those with overlapping classifications.
Or one can use it to adjust trade-off between precision and recall.
%% Cell type:markdown id: tags:
## Classification for digit data
Another classic example case for ML is handwritten digits data.
A suitable dataset is included with sklearn, first we look into it:
%% Cell type:code id: tags:
``` python
from sklearn.datasets import load_digits
digits = load_digits()
%% Cell type:code id: tags:
``` python
%% Cell type:code id: tags:
``` python
%% Cell type:code id: tags:
``` python
%% Cell type:markdown id: tags:
The data is sklearn specific container, basically a list of 8x8 pixels images
We plot a sub-set:
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
fig, axes = plt.subplots(10, 10, figsize=(8, 8),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
ax.text(0.05, 0.05, str([i]),
transform=ax.transAxes, color='green')
%% Cell type:markdown id: tags:
Plot shows pixel image together with label (in green).
* Some images are obvious
* Others seem difficult
%% Cell type:code id: tags:
``` python
# Look at data from 1st image --> 2D table resembles 0
print (digits.images[1])
%% Cell type:code id: tags:
``` python
%% Cell type:markdown id: tags:
## Image data with sklearn:
To use the data with sklearn as before we need a 2D structure: `samples x features` , i.e. the 8x8 images should be transformed into flat 1x64 array.
Already provided in Dataset, element `data` :
%% Cell type:code id: tags:
``` python
print ([0])
%% Cell type:code id: tags:
``` python
# to use as before just re-label
X =
y =
%% Cell type:markdown id: tags:
### Digit classification
multi-classification problem
* conceptually no big difference to binary classification
* models discussed are flexible to handle this
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
%% Cell type:markdown id: tags:
#### First kNN:
%% Cell type:code id: tags:
``` python
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=7), y_train)
%% Cell type:code id: tags:
``` python
# use scilearn function for score
from sklearn.metrics import accuracy_score
y_pred = knn.predict(X_test)
score = accuracy_score(y_test, y_pred)
print("Test set score: {:.3f}".format(score))
%% Cell type:markdown id: tags:
**Detailed classification report**
%% Cell type:code id: tags:
``` python
from sklearn import metrics
print(metrics.classification_report(ytest, ypred))
%% Cell type:markdown id: tags:
**Check confusion matrix**
very infomative for such a case
%% Cell type:code id: tags:
``` python
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');
%% Cell type:markdown id: tags:
##### kNN performs really well!
#### Then Gaussian Naive Bayes:
%% Cell type:code id: tags:
``` python
from sklearn.naive_bayes import GaussianNB
model = GaussianNB(), y_train)
y_model = model.predict(X_test)
%% Cell type:code id: tags:
``` python
score = accuracy_score(y_model, y_test)
print("Test set score: {:.3f}".format(score))
%% Cell type:code id: tags:
``` python
from sklearn import metrics
print(metrics.classification_report(y_test, y_model))
%% Cell type:code id: tags:
``` python
mat = confusion_matrix(y_test, y_model)
sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');
%% Cell type:markdown id: tags:
##### GNB significantly worse, many more mis-ids!
%% Cell type:code id: tags:
``` python
%% Cell type:markdown id: tags:
#### Exercise: ####
Also try the other models we discussed for classification, i.e. logistic regression and SVC
%% Cell type:code id: tags:
``` python
