Skip to content
Snippets Groups Projects
Commit 96f8ea1d authored by GDuckeck's avatar GDuckeck
Browse files

Upload New File

parent df580c1c
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Scikit-Learn Introduction
A number of Python packages provide implementations of machine learning algorithms.
**[Scikit-Learn](http://scikit-learn.org)** is one of the most popular.
* it provides many of the common ML algorithms
* well-designed, uniform API (programming interface)
* standardized and largely streamlined setup of the different models
→ easy to switch
* good documentation
The first example is based on the **[Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set)**. This had already been introduced by famous statistician
Ronald Fisher in 1936 and is used since then as instructive use case for classification.
The data consists of
* measurements of length and width of both sepal (Blütenkelch) and petal (Blüte)
* classification of Iris sub-species
%% Cell type:code id: tags:
``` python
# the usual setup:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
```
%% Cell type:code id: tags:
``` python
# seaboorn provides easy way to import iris dataset as pandas dataframe
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()
```
%% Cell type:code id: tags:
``` python
iris
```
%% Cell type:markdown id: tags:
## Data visualization
First step should always be some investigation of data properties, i.e.
* basic statistical properties
* visualization of distributions
%% Cell type:code id: tags:
``` python
# basic statistics with pandas
iris.describe()
```
%% Cell type:code id: tags:
``` python
# distribution of single feature
sns.histplot(data=iris,x='sepal_length',hue='species')
```
%% Cell type:code id: tags:
``` python
# combined plot of 2 features
sns.jointplot(data=iris,x='sepal_length',y='sepal_width', hue='species')
```
%% Cell type:code id: tags:
``` python
# combined plot matrix of all features in dataframe
#
# will provide scatter plot of all combinations of numerical columns in dataframe
# target (=species) can be given and will cause different colors
sns.pairplot(iris, hue='species', diag_kind='hist', height=2.0)
```
%% Cell type:markdown id: tags:
## Data preparation
For use in sklearn with **supervised learning** the first step is always to split data into
* table/matrix of **features**
* list of **targets**
And then split the data into **train** and **test** sample:
* `train_test_split` function from sklearn
* by default 75% for training and 25% for test and validation
* can be specified as parameter
* randomized selection of entries
→ inital order does not matter
%% Cell type:code id: tags:
``` python
# feature matrix
X=iris.loc[:,'sepal_length':'petal_width']
X.shape
```
%% Cell type:code id: tags:
``` python
# target
Y=iris.species
Y.shape
```
%% Cell type:code id: tags:
``` python
# break-up in train & test sample
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, Y)
```
%% Cell type:code id: tags:
``` python
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")
```
%% Cell type:markdown id: tags:
## Fit knn Model, apply and make predictions
%% Cell type:code id: tags:
``` python
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
```
%% Cell type:code id: tags:
``` python
knn.fit(X_train, y_train)
#knn.fit(X, Y)
```
%% Cell type:code id: tags:
``` python
# create dummy iris
#X_new = np.array([[5, 4.9, 4, 1.2]])
# recent version want same datatype for testing
X_new = pd.DataFrame(np.array([[5, 4.9, 4, 1.2]]),columns=X.columns)
# 2D format required, nrows vs ncolums (1x2)
X_new.shape #
```
%% Cell type:code id: tags:
``` python
knn.predict(X_new) # apply model to new data point
```
%% Cell type:markdown id: tags:
### test/evaluate model
%% Cell type:code id: tags:
``` python
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))
```
%% Cell type:code id: tags:
``` python
y_test==y_pred
```
%% Cell type:code id: tags:
``` python
# use scilearn function for score
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)
print("Test set score: {:.3f}".format(score))
```
%% Cell type:markdown id: tags:
***
Further useful checks are the **classification report** and the **confusion matrix**,
they give detailed Info on mis-classifications:
%% Cell type:code id: tags:
``` python
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
```
%% Cell type:markdown id: tags:
(The meaning of `recall` etc. will be explained in a bit.)
Another intuitive measure is the confusion matrix:
%% Cell type:code id: tags:
``` python
from sklearn.metrics import confusion_matrix
labels = np.unique(y_test)
mat = confusion_matrix(y_test, y_pred, labels=labels)
print (labels, '\n', mat)
```
%% Cell type:markdown id: tags:
***
**Repeat with different settings for number of neighbors**
**Usually high accuracy for Iris data**
as scatter plot suggested there is rather clear separation between species
%% Cell type:markdown id: tags:
***
## Measure Quality of Classification
The above `classification_report` presented several parameters which are useful to quantify how well the qualification works.
For these we need to introduce the following terms (assuming a classification with two classes *P* and *N*)
* $t_p = $ true-positive: number of cases with predicted *P* and correct *P*
* $t_n = $ true-negative: number of cases with predicted *N* and correct *N*
* $f_p = $ false-positive: number of cases with predicted *P* and correct *N*
* $f_n = $ false-negative: number of cases with predicted *N* and correct *P*
![Confusion matrix](./figures/wikipedia_confusion_matrix.png "More details: see Wikipedia article on confusion matrix")
Based on these, the parameters in the `classification_report` are defined as:
* `precision` (or `purity`): $ t_p / ( t_p + f_p ) $ , i.e. fraction of cases classified as *P* which are true *P*
* `recall` (or `efficiency`): $ t_p / ( t_p + f_n ) $ , i.e. fraction of true *P* which are classified as *P*
* `f1-score` : Mean of `precision` and `recall`
See https://en.wikipedia.org/wiki/Precision_and_recall for a more detailed discussion
***
%% Cell type:markdown id: tags:
***
## Test further simple models
### Gaussian Naive Bayes
Also a conceptually simple model
* basic assumption is that for each different category (*Iris-species*) the variables follow a Gaussian distribution.
* In training the model determines parameters of these Gaussians
* For classification then simply calculate probability of a given new Iris-data to be of species `i` based on Gaussian probability:
$$ P(x) = \frac{1}{{\sigma_i \sqrt {2\pi } }}e^{{{ - \left( {x - \mu_i } \right)^2 } \, } \left/ \right. {\, {2\sigma_i ^2 }}}$$
* where $\mu_i$ and $\sigma_i$ are mean and standarddeviation for respecitve variable and species `i`
(We'll look at why it's called "Bayes" in a bit more detail [here](http://localhost:8888/notebooks/Higgs-Gaussian.ipynb#GaussianNB).)
%% Cell type:code id: tags:
``` python
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB() # 2. instantiate model
model.fit(X_train, y_train) # 3. fit model to data
y_gnb = model.predict(X_test) # 4. predict on new data
```
%% Cell type:code id: tags:
``` python
# use scilearn function for score
from sklearn.metrics import accuracy_score
score = accuracy_score(y_gnb, y_test)
print("Test set score: {:.3f}".format(score))
```
%% Cell type:code id: tags:
``` python
from sklearn.metrics import classification_report
print(classification_report(y_gnb, y_test))
```
%% Cell type:code id: tags:
``` python
mat = confusion_matrix(y_test, y_gnb, labels=labels)
print (labels,'\n', mat)
```
%% Cell type:markdown id: tags:
***
### Logistic Regression
This method is similar to standard linear regression. However, it can be used for discrete dependent variables, i.e. classification use-cases.
It is a rather simple, linear model:
* logistic function: $f(x) = \frac{1}{1+\exp(-x)}$, $f(x): [-\infty,\infty] \to [0,1]$
* model: $y_i = f(x_i \cdot \beta) + \epsilon_i$
More info:
* https://en.wikipedia.org/wiki/Logistic_regression
* https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
%% Cell type:code id: tags:
``` python
from sklearn.linear_model import LogisticRegression # 1. choose model class
model = LogisticRegression(max_iter=500) # 2. instantiate model
model.fit(X_train, y_train) # 3. fit model to data
y_lr = model.predict(X_test) # 4. predict on new data
```
%% Cell type:code id: tags:
``` python
# use scilearn function for score
from sklearn.metrics import accuracy_score
score = accuracy_score(y_lr, y_test)
print("Test set score: {:.3f}".format(score))
```
%% Cell type:markdown id: tags:
### Support Vector Machine
SVM is another standard ML method, conceptually related to LR and kNN
* Look for line/plane separating classes
* Decision based on neighbouring elements:
* Maximize margin to ‘few-closest-elements (=Support Vectors)’
* ‘linear SVM’ works only for linear separation
* ‘kernel SVM’ variant also for non-linear cases
%% Cell type:code id: tags:
``` python
from sklearn.svm import SVC
```
%% Cell type:code id: tags:
``` python
model = SVC(kernel='linear',gamma = "auto", random_state = 42)
#model = SVC(kernel='rbf',gamma = "auto", random_state = 42)
model.fit(X_train, y_train) # 3. fit model to data
y_svm = model.predict(X_test) # 4. predict on new data
```
%% Cell type:code id: tags:
``` python
# use scilearn function for score
from sklearn.metrics import accuracy_score
score = accuracy_score(y_svm, y_test)
print("Test set score: {:.3f}".format(score))
```
%% Cell type:markdown id: tags:
***
### Probabilistic Classification
In general models can not only be used to give single classification as in above examples but one can also get a list of probabilities for the different possible outcomes:
%% Cell type:code id: tags:
``` python
model = LogisticRegression(max_iter=500) # 2. instantiate model
model.fit(X_train, y_train) # 3. fit model to data
yout=model.predict_proba(X_test)
print (yout[:5])
```
%% Cell type:code id: tags:
``` python
yout[y_lr != y_test]
```
%% Cell type:code id: tags:
``` python
list(zip(y_lr[:5],y_test[:5]))
```
%% Cell type:markdown id: tags:
Depending on type of problem this information can be further used to distinguish clear cases and those with overlapping classifications.
Or one can use it to adjust trade-off between precision and recall.
%% Cell type:markdown id: tags:
***
## Classification for digit data
Another classic example case for ML is handwritten digits data.
A suitable dataset is included with sklearn, first we look into it:
%% Cell type:code id: tags:
``` python
from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape
```
%% Cell type:code id: tags:
``` python
type(digits)
```
%% Cell type:code id: tags:
``` python
digits?
```
%% Cell type:code id: tags:
``` python
print(digits.DESCR)
```
%% Cell type:markdown id: tags:
The data is sklearn specific container, basically a list of 8x8 pixels images
We plot a sub-set:
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
fig, axes = plt.subplots(10, 10, figsize=(8, 8),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
ax.text(0.05, 0.05, str(digits.target[i]),
transform=ax.transAxes, color='green')
```
%% Cell type:markdown id: tags:
Plot shows pixel image together with label (in green).
* Some images are obvious
* Others seem difficult
%% Cell type:code id: tags:
``` python
# Look at data from 1st image --> 2D table resembles 0
print (digits.images[1])
```
%% Cell type:code id: tags:
``` python
digits.images[0].shape
```
%% Cell type:markdown id: tags:
## Image data with sklearn:
To use the data with sklearn as before we need a 2D structure: `samples x features` , i.e. the 8x8 images should be transformed into flat 1x64 array.
Already provided in Dataset, element `data` :
%% Cell type:code id: tags:
``` python
print (digits.data[0])
```
%% Cell type:code id: tags:
``` python
# to use as before just re-label
X = digits.data
y = digits.target
```
%% Cell type:markdown id: tags:
***
### Digit classification
multi-classification problem
* conceptually no big difference to binary classification
* models discussed are flexible to handle this
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
```
%% Cell type:markdown id: tags:
#### First kNN:
%% Cell type:code id: tags:
``` python
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
```
%% Cell type:code id: tags:
``` python
# use scilearn function for score
from sklearn.metrics import accuracy_score
y_pred = knn.predict(X_test)
score = accuracy_score(y_test, y_pred)
print("Test set score: {:.3f}".format(score))
```
%% Cell type:markdown id: tags:
***
**Detailed classification report**
%% Cell type:code id: tags:
``` python
from sklearn import metrics
print(metrics.classification_report(ytest, ypred))
```
%% Cell type:markdown id: tags:
**Check confusion matrix**
very infomative for such a case
%% Cell type:code id: tags:
``` python
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');
```
%% Cell type:markdown id: tags:
##### kNN performs really well!
***
#### Then Gaussian Naive Bayes:
%% Cell type:code id: tags:
``` python
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
y_model = model.predict(X_test)
```
%% Cell type:code id: tags:
``` python
score = accuracy_score(y_model, y_test)
print("Test set score: {:.3f}".format(score))
```
%% Cell type:code id: tags:
``` python
from sklearn import metrics
print(metrics.classification_report(y_test, y_model))
```
%% Cell type:code id: tags:
``` python
mat = confusion_matrix(y_test, y_model)
sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');
```
%% Cell type:markdown id: tags:
##### GNB significantly worse, many more mis-ids!
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
---
#### Exercise: ####
Also try the other models we discussed for classification, i.e. logistic regression and SVC
%% Cell type:code id: tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment