Upload New File

96f8ea1d · GDuckeck · df580c1c · 96f8ea1d
Commit 96f8ea1d authored 9 months ago by GDuckeck
--- a/ScilearnIntro_v2.ipynb
+++ b/ScilearnIntro_v2.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Scikit-Learn Introduction\n",
+    "A number of Python packages provide implementations of  machine learning algorithms. \n",
+    "**[Scikit-Learn](http://scikit-learn.org)** is one of the most popular.\n",
+    "* it provides many of the common ML algorithms\n",
+    "* well-designed, uniform API (programming interface)\n",
+    "  * standardized and largely streamlined setup of the different models   \n",
+    "    &rarr; easy to switch\n",
+    "* good documentation\n",
+    "\n",
+    "The first example is based on the **[Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set)**. This had already been introduced by famous statistician\n",
+    "Ronald Fisher in 1936 and is used since then as instructive use case for classification. \n",
+    "The data consists of\n",
+    "* measurements of length and width of both sepal (Bl&uuml;tenkelch) and petal (Bl&uuml;te) \n",
+    "* classification of Iris sub-species\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# the usual setup: \n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# seaboorn provides easy way to import iris dataset as pandas dataframe\n",
+    "import seaborn as sns\n",
+    "iris = sns.load_dataset('iris')\n",
+    "iris.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "iris"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data visualization\n",
+    "First step should always be some investigation of data properties, i.e.\n",
+    "* basic statistical properties\n",
+    "* visualization of distributions\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# basic statistics with pandas\n",
+    "iris.describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# distribution of single feature\n",
+    "sns.histplot(data=iris,x='sepal_length',hue='species')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# combined plot of 2 features\n",
+    "sns.jointplot(data=iris,x='sepal_length',y='sepal_width', hue='species')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# combined plot matrix of all features in dataframe\n",
+    "#\n",
+    "# will provide scatter plot of all combinations of numerical columns in dataframe\n",
+    "# target (=species) can be given and will cause different colors\n",
+    "sns.pairplot(iris, hue='species', diag_kind='hist', height=2.0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data preparation\n",
+    "For use in sklearn with  **supervised learning** the first step is always to split  data into \n",
+    "* table/matrix of **features**\n",
+    "* list of **targets**\n",
+    "\n",
+    "And then split the data into **train** and **test** sample:\n",
+    "* `train_test_split` function from sklearn\n",
+    "* by default 75% for training and 25% for test and validation\n",
+    "  * can be specified as parameter\n",
+    "* randomized selection of entries  \n",
+    "&rarr; inital order does not matter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# feature matrix\n",
+    "X=iris.loc[:,'sepal_length':'petal_width']\n",
+    "X.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# target\n",
+    "Y=iris.species\n",
+    "Y.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# break-up in train & test sample\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "X_train, X_test, y_train, y_test = train_test_split( X, Y)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(f\"X_train shape: {X_train.shape}\")\n",
+    "print(f\"y_train shape: {y_train.shape}\")\n",
+    "print(f\"X_test shape: {X_test.shape}\")\n",
+    "print(f\"y_test shape: {y_test.shape}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Fit knn Model, apply and make predictions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.neighbors import KNeighborsClassifier\n",
+    "knn = KNeighborsClassifier(n_neighbors=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "knn.fit(X_train, y_train)\n",
+    "#knn.fit(X, Y)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# create dummy iris\n",
+    "#X_new = np.array([[5, 4.9, 4, 1.2]])\n",
+    "# recent version want same datatype for testing\n",
+    "X_new = pd.DataFrame(np.array([[5, 4.9, 4, 1.2]]),columns=X.columns)\n",
+    "# 2D format required, nrows vs ncolums (1x2)\n",
+    "X_new.shape #"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "knn.predict(X_new) # apply model to new data point"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### test/evaluate model\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "y_pred = knn.predict(X_test)\n",
+    "print(\"Test set predictions:\\n {}\".format(y_pred))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "y_test==y_pred"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# use scilearn function for score\n",
+    "from sklearn.metrics import accuracy_score\n",
+    "score = accuracy_score(y_test, y_pred)\n",
+    "print(\"Test set score: {:.3f}\".format(score))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "Further useful checks are the **classification report** and the **confusion matrix**,  \n",
+    "they give detailed Info on mis-classifications:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.metrics import classification_report\n",
+    "print(classification_report(y_test, y_pred))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "(The meaning of `recall` etc. will be explained in a bit.) \n",
+    "\n",
+    "Another intuitive measure is the confusion matrix:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.metrics import confusion_matrix\n",
+    "\n",
+    "labels = np.unique(y_test)\n",
+    "mat = confusion_matrix(y_test, y_pred, labels=labels)\n",
+    "print (labels, '\\n', mat)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "**Repeat with different settings for number of neighbors**\n",
+    "\n",
+    "**Usually high accuracy for Iris data**  \n",
+    "as scatter plot suggested there is rather clear separation between species"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "## Measure   Quality of Classification\n",
+    "\n",
+    "The above `classification_report` presented several parameters which are useful to quantify how well the qualification works.\n",
+    "\n",
+    "For these we need to introduce the following terms (assuming a classification with two classes *P* and *N*)\n",
+    "* $t_p = $ true-positive: number of cases with predicted *P* and correct *P*\n",
+    "* $t_n = $ true-negative: number of cases with predicted *N* and correct *N*\n",
+    "* $f_p = $ false-positive: number of cases with predicted *P* and correct *N*\n",
+    "* $f_n = $ false-negative: number of cases with predicted *N* and correct *P*\n",
+    "\n",
+    "![Confusion matrix](./figures/wikipedia_confusion_matrix.png \"More details: see Wikipedia article on confusion matrix\")\n",
+    "\n",
+    "\n",
+    "Based on these, the parameters in the `classification_report`  are defined as:\n",
+    "* `precision` (or `purity`): $ t_p / ( t_p + f_p ) $ , i.e. fraction of cases classified as *P* which are true *P*\n",
+    "* `recall` (or `efficiency`): $ t_p / ( t_p + f_n ) $ , i.e. fraction of true *P* which are classified as *P*\n",
+    "* `f1-score` : Mean of `precision` and `recall`\n",
+    "\n",
+    "See https://en.wikipedia.org/wiki/Precision_and_recall for a more detailed discussion\n",
+    "\n",
+    "***"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "## Test further simple models\n",
+    "\n",
+    "### Gaussian Naive Bayes\n",
+    "Also a conceptually simple model\n",
+    "* basic assumption is that for each different category (*Iris-species*) the variables follow a Gaussian distribution.\n",
+    "* In training the model determines parameters of these Gaussians\n",
+    "* For classification then simply calculate probability of a given new Iris-data to be of species `i` based on Gaussian probability:\n",
+    "$$ P(x) = \\frac{1}{{\\sigma_i \\sqrt {2\\pi } }}e^{{{ - \\left( {x - \\mu_i } \\right)^2 } \\, } \\left/ \\right. {\\, {2\\sigma_i ^2 }}}$$\n",
+    "* where $\\mu_i$ and $\\sigma_i$ are mean and standarddeviation for respecitve variable and species `i`\n",
+    "\n",
+    "(We'll look at why it's called \"Bayes\" in a bit more detail [here](http://localhost:8888/notebooks/Higgs-Gaussian.ipynb#GaussianNB).)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.naive_bayes import GaussianNB # 1. choose model class\n",
+    "model = GaussianNB()                       # 2. instantiate model\n",
+    "model.fit(X_train, y_train)                # 3. fit model to data\n",
+    "y_gnb = model.predict(X_test)              # 4. predict on new data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# use scilearn function for score\n",
+    "from sklearn.metrics import accuracy_score\n",
+    "score = accuracy_score(y_gnb, y_test)\n",
+    "print(\"Test set score: {:.3f}\".format(score))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.metrics import classification_report\n",
+    "print(classification_report(y_gnb, y_test))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "mat = confusion_matrix(y_test, y_gnb, labels=labels)\n",
+    "print (labels,'\\n', mat)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "\n",
+    "### Logistic Regression\n",
+    "This method is similar to standard linear regression. However, it can be used for discrete dependent variables, i.e. classification use-cases.\n",
+    "\n",
+    "It is a rather simple, linear model: \n",
+    "* logistic function: $f(x) = \\frac{1}{1+\\exp(-x)}$, $f(x): [-\\infty,\\infty] \\to [0,1]$\n",
+    "* model: $y_i = f(x_i \\cdot \\beta) + \\epsilon_i$\n",
+    "\n",
+    "More info:\n",
+    "* https://en.wikipedia.org/wiki/Logistic_regression \n",
+    "* https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.linear_model import LogisticRegression # 1. choose model class\n",
+    "model = LogisticRegression(max_iter=500)            # 2. instantiate model\n",
+    "model.fit(X_train, y_train)                         # 3. fit model to data\n",
+    "y_lr = model.predict(X_test)                        # 4. predict on new data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# use scilearn function for score\n",
+    "from sklearn.metrics import accuracy_score\n",
+    "score = accuracy_score(y_lr, y_test)\n",
+    "print(\"Test set score: {:.3f}\".format(score))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Support Vector Machine\n",
+    "\n",
+    "SVM is another standard ML method, conceptually related to LR and kNN\n",
+    "* Look for line/plane separating classes \n",
+    "* Decision based on neighbouring elements:\n",
+    "* Maximize margin to  ‘few-closest-elements (=Support Vectors)’\n",
+    "* ‘linear SVM’ works  only for linear separation\n",
+    "* ‘kernel SVM’ variant also for non-linear cases\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.svm import SVC"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = SVC(kernel='linear',gamma = \"auto\", random_state = 42)\n",
+    "#model = SVC(kernel='rbf',gamma = \"auto\", random_state = 42)\n",
+    "model.fit(X_train, y_train)                         # 3. fit model to data\n",
+    "y_svm = model.predict(X_test)                        # 4. predict on new data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# use scilearn function for score\n",
+    "from sklearn.metrics import accuracy_score\n",
+    "score = accuracy_score(y_svm, y_test)\n",
+    "print(\"Test set score: {:.3f}\".format(score))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "### Probabilistic Classification\n",
+    "\n",
+    "In general models can not only be used to give single classification as in above examples but one can also get a list of probabilities for the different possible outcomes:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "model = LogisticRegression(max_iter=500)            # 2. instantiate model\n",
+    "model.fit(X_train, y_train)                         # 3. fit model to data\n",
+    "\n",
+    "yout=model.predict_proba(X_test)\n",
+    "print (yout[:5])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "yout[y_lr != y_test]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "list(zip(y_lr[:5],y_test[:5]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Depending on type of problem this information can be further used to distinguish clear cases and those with overlapping classifications.\n",
+    "\n",
+    "Or one can use it to adjust trade-off between precision and recall."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "## Classification for digit data\n",
+    "\n",
+    "Another classic example case for ML is handwritten digits data.\n",
+    "\n",
+    "A suitable dataset is included with sklearn, first we look into it:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.datasets import load_digits\n",
+    "digits = load_digits()\n",
+    "digits.images.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "type(digits)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "digits?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(digits.DESCR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The data is sklearn specific container, basically a list of 8x8 pixels images\n",
+    "\n",
+    "We plot a sub-set:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "fig, axes = plt.subplots(10, 10, figsize=(8, 8),\n",
+    "                         subplot_kw={'xticks':[], 'yticks':[]},\n",
+    "                         gridspec_kw=dict(hspace=0.1, wspace=0.1))\n",
+    "\n",
+    "for i, ax in enumerate(axes.flat):\n",
+    "    ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')\n",
+    "    ax.text(0.05, 0.05, str(digits.target[i]),\n",
+    "            transform=ax.transAxes, color='green')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Plot shows pixel image together with label (in green).\n",
+    "\n",
+    "* Some images are obvious\n",
+    "* Others seem difficult "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Look at data from 1st image --> 2D table resembles 0\n",
+    "print (digits.images[1])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "digits.images[0].shape"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Image data with sklearn:\n",
+    "To use the data with sklearn as before we need a 2D structure: `samples x features` , i.e. the 8x8 images should be transformed into flat 1x64 array.   \n",
+    "\n",
+    "Already provided in Dataset, element `data` :"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print (digits.data[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# to use as before just re-label\n",
+    "X = digits.data\n",
+    "y = digits.target"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "### Digit classification\n",
+    "\n",
+    "multi-classification problem\n",
+    "* conceptually no big difference to binary classification\n",
+    "* models discussed are flexible to handle this"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.model_selection import train_test_split\n",
+    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### First kNN:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.neighbors import KNeighborsClassifier\n",
+    "knn = KNeighborsClassifier(n_neighbors=7)\n",
+    "knn.fit(X_train, y_train)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# use scilearn function for score\n",
+    "from sklearn.metrics import accuracy_score\n",
+    "y_pred = knn.predict(X_test)\n",
+    "score = accuracy_score(y_test, y_pred)\n",
+    "print(\"Test set score: {:.3f}\".format(score))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "**Detailed classification report**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn import metrics\n",
+    "print(metrics.classification_report(ytest, ypred))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Check confusion matrix**  \n",
+    "very infomative for such a case"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.metrics import confusion_matrix\n",
+    "mat = confusion_matrix(y_test, y_pred)\n",
+    "sns.heatmap(mat, square=True, annot=True, cbar=False)\n",
+    "plt.xlabel('predicted value')\n",
+    "plt.ylabel('true value');"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### kNN performs really well!\n",
+    "\n",
+    "***\n",
+    "#### Then  Gaussian Naive Bayes:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.naive_bayes import GaussianNB\n",
+    "model = GaussianNB()\n",
+    "model.fit(X_train, y_train)\n",
+    "y_model = model.predict(X_test)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "score = accuracy_score(y_model, y_test)\n",
+    "print(\"Test set score: {:.3f}\".format(score))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn import metrics\n",
+    "print(metrics.classification_report(y_test, y_model))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mat = confusion_matrix(y_test, y_model)\n",
+    "sns.heatmap(mat, square=True, annot=True, cbar=False)\n",
+    "plt.xlabel('predicted value')\n",
+    "plt.ylabel('true value');"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### GNB significantly worse, many more mis-ids!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "#### Exercise: ####\n",
+    "Also try the other models we discussed for classification, i.e. logistic regression and SVC"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "anaconda-cloud": {},
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": true,
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "261px"
+   },
+   "toc_section_display": true,
+   "toc_window_display": false
+  },
+  "toc-showtags": false
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
+%% Cell type:markdown id: tags:
+
+# Scikit-Learn Introduction
+A number of Python packages provide implementations of  machine learning algorithms.
+**[Scikit-Learn](http://scikit-learn.org)** is one of the most popular.
+* it provides many of the common ML algorithms
+* well-designed, uniform API (programming interface)
+  * standardized and largely streamlined setup of the different models
+    &rarr; easy to switch
+* good documentation
+
+The first example is based on the **[Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set)**. This had already been introduced by famous statistician
+Ronald Fisher in 1936 and is used since then as instructive use case for classification.
+The data consists of
+* measurements of length and width of both sepal (Bl&uuml;tenkelch) and petal (Bl&uuml;te)
+* classification of Iris sub-species
+
+
+%% Cell type:code id: tags:
+
+``` python
+# the usual setup:
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+%matplotlib inline
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# seaboorn provides easy way to import iris dataset as pandas dataframe
+import seaborn as sns
+iris = sns.load_dataset('iris')
+iris.head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+iris
+```
+
+%% Cell type:markdown id: tags:
+
+## Data visualization
+First step should always be some investigation of data properties, i.e.
+* basic statistical properties
+* visualization of distributions
+
+%% Cell type:code id: tags:
+
+``` python
+# basic statistics with pandas
+iris.describe()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# distribution of single feature
+sns.histplot(data=iris,x='sepal_length',hue='species')
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# combined plot of 2 features
+sns.jointplot(data=iris,x='sepal_length',y='sepal_width', hue='species')
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# combined plot matrix of all features in dataframe
+#
+# will provide scatter plot of all combinations of numerical columns in dataframe
+# target (=species) can be given and will cause different colors
+sns.pairplot(iris, hue='species', diag_kind='hist', height=2.0)
+```
+
+%% Cell type:markdown id: tags:
+
+## Data preparation
+For use in sklearn with  **supervised learning** the first step is always to split  data into
+* table/matrix of **features**
+* list of **targets**
+
+And then split the data into **train** and **test** sample:
+* `train_test_split` function from sklearn
+* by default 75% for training and 25% for test and validation
+  * can be specified as parameter
+* randomized selection of entries
+&rarr; inital order does not matter
+
+%% Cell type:code id: tags:
+
+``` python
+# feature matrix
+X=iris.loc[:,'sepal_length':'petal_width']
+X.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# target
+Y=iris.species
+Y.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# break-up in train & test sample
+from sklearn.model_selection import train_test_split
+X_train, X_test, y_train, y_test = train_test_split( X, Y)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+print(f"X_train shape: {X_train.shape}")
+print(f"y_train shape: {y_train.shape}")
+print(f"X_test shape: {X_test.shape}")
+print(f"y_test shape: {y_test.shape}")
+```
+
+%% Cell type:markdown id: tags:
+
+## Fit knn Model, apply and make predictions
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.neighbors import KNeighborsClassifier
+knn = KNeighborsClassifier(n_neighbors=3)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+knn.fit(X_train, y_train)
+#knn.fit(X, Y)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# create dummy iris
+#X_new = np.array([[5, 4.9, 4, 1.2]])
+# recent version want same datatype for testing
+X_new = pd.DataFrame(np.array([[5, 4.9, 4, 1.2]]),columns=X.columns)
+# 2D format required, nrows vs ncolums (1x2)
+X_new.shape #
+```
+
+%% Cell type:code id: tags:
+
+``` python
+knn.predict(X_new) # apply model to new data point
+```
+
+%% Cell type:markdown id: tags:
+
+### test/evaluate model
+
+%% Cell type:code id: tags:
+
+``` python
+y_pred = knn.predict(X_test)
+print("Test set predictions:\n {}".format(y_pred))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+y_test==y_pred
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use scilearn function for score
+from sklearn.metrics import accuracy_score
+score = accuracy_score(y_test, y_pred)
+print("Test set score: {:.3f}".format(score))
+```
+
+%% Cell type:markdown id: tags:
+
+***
+Further useful checks are the **classification report** and the **confusion matrix**,
+they give detailed Info on mis-classifications:
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.metrics import classification_report
+print(classification_report(y_test, y_pred))
+```
+
+%% Cell type:markdown id: tags:
+
+(The meaning of `recall` etc. will be explained in a bit.)
+
+Another intuitive measure is the confusion matrix:
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.metrics import confusion_matrix
+
+labels = np.unique(y_test)
+mat = confusion_matrix(y_test, y_pred, labels=labels)
+print (labels, '\n', mat)
+```
+
+%% Cell type:markdown id: tags:
+
+***
+**Repeat with different settings for number of neighbors**
+
+**Usually high accuracy for Iris data**
+as scatter plot suggested there is rather clear separation between species
+
+%% Cell type:markdown id: tags:
+
+***
+## Measure   Quality of Classification
+
+The above `classification_report` presented several parameters which are useful to quantify how well the qualification works.
+
+For these we need to introduce the following terms (assuming a classification with two classes *P* and *N*)
+* $t_p = $ true-positive: number of cases with predicted *P* and correct *P*
+* $t_n = $ true-negative: number of cases with predicted *N* and correct *N*
+* $f_p = $ false-positive: number of cases with predicted *P* and correct *N*
+* $f_n = $ false-negative: number of cases with predicted *N* and correct *P*
+
+![Confusion matrix](./figures/wikipedia_confusion_matrix.png "More details: see Wikipedia article on confusion matrix")
+
+
+Based on these, the parameters in the `classification_report`  are defined as:
+* `precision` (or `purity`): $ t_p / ( t_p + f_p ) $ , i.e. fraction of cases classified as *P* which are true *P*
+* `recall` (or `efficiency`): $ t_p / ( t_p + f_n ) $ , i.e. fraction of true *P* which are classified as *P*
+* `f1-score` : Mean of `precision` and `recall`
+
+See https://en.wikipedia.org/wiki/Precision_and_recall for a more detailed discussion
+
+***
+
+%% Cell type:markdown id: tags:
+
+***
+## Test further simple models
+
+### Gaussian Naive Bayes
+Also a conceptually simple model
+* basic assumption is that for each different category (*Iris-species*) the variables follow a Gaussian distribution.
+* In training the model determines parameters of these Gaussians
+* For classification then simply calculate probability of a given new Iris-data to be of species `i` based on Gaussian probability:
+$$ P(x) = \frac{1}{{\sigma_i \sqrt {2\pi } }}e^{{{ - \left( {x - \mu_i } \right)^2 } \, } \left/ \right. {\, {2\sigma_i ^2 }}}$$
+* where $\mu_i$ and $\sigma_i$ are mean and standarddeviation for respecitve variable and species `i`
+
+(We'll look at why it's called "Bayes" in a bit more detail [here](http://localhost:8888/notebooks/Higgs-Gaussian.ipynb#GaussianNB).)
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.naive_bayes import GaussianNB # 1. choose model class
+model = GaussianNB()                       # 2. instantiate model
+model.fit(X_train, y_train)                # 3. fit model to data
+y_gnb = model.predict(X_test)              # 4. predict on new data
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use scilearn function for score
+from sklearn.metrics import accuracy_score
+score = accuracy_score(y_gnb, y_test)
+print("Test set score: {:.3f}".format(score))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.metrics import classification_report
+print(classification_report(y_gnb, y_test))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+mat = confusion_matrix(y_test, y_gnb, labels=labels)
+print (labels,'\n', mat)
+```
+
+%% Cell type:markdown id: tags:
+
+***
+
+### Logistic Regression
+This method is similar to standard linear regression. However, it can be used for discrete dependent variables, i.e. classification use-cases.
+
+It is a rather simple, linear model:
+* logistic function: $f(x) = \frac{1}{1+\exp(-x)}$, $f(x): [-\infty,\infty] \to [0,1]$
+* model: $y_i = f(x_i \cdot \beta) + \epsilon_i$
+
+More info:
+* https://en.wikipedia.org/wiki/Logistic_regression
+* https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.linear_model import LogisticRegression # 1. choose model class
+model = LogisticRegression(max_iter=500)            # 2. instantiate model
+model.fit(X_train, y_train)                         # 3. fit model to data
+y_lr = model.predict(X_test)                        # 4. predict on new data
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use scilearn function for score
+from sklearn.metrics import accuracy_score
+score = accuracy_score(y_lr, y_test)
+print("Test set score: {:.3f}".format(score))
+```
+
+%% Cell type:markdown id: tags:
+
+### Support Vector Machine
+
+SVM is another standard ML method, conceptually related to LR and kNN
+* Look for line/plane separating classes
+* Decision based on neighbouring elements:
+* Maximize margin to  ‘few-closest-elements (=Support Vectors)’
+* ‘linear SVM’ works  only for linear separation
+* ‘kernel SVM’ variant also for non-linear cases
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.svm import SVC
+```
+
+%% Cell type:code id: tags:
+
+``` python
+model = SVC(kernel='linear',gamma = "auto", random_state = 42)
+#model = SVC(kernel='rbf',gamma = "auto", random_state = 42)
+model.fit(X_train, y_train)                         # 3. fit model to data
+y_svm = model.predict(X_test)                        # 4. predict on new data
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use scilearn function for score
+from sklearn.metrics import accuracy_score
+score = accuracy_score(y_svm, y_test)
+print("Test set score: {:.3f}".format(score))
+```
+
+%% Cell type:markdown id: tags:
+
+***
+### Probabilistic Classification
+
+In general models can not only be used to give single classification as in above examples but one can also get a list of probabilities for the different possible outcomes:
+
+%% Cell type:code id: tags:
+
+``` python
+model = LogisticRegression(max_iter=500)            # 2. instantiate model
+model.fit(X_train, y_train)                         # 3. fit model to data
+
+yout=model.predict_proba(X_test)
+print (yout[:5])
+```
+
+%% Cell type:code id: tags:
+
+``` python
+yout[y_lr != y_test]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+list(zip(y_lr[:5],y_test[:5]))
+```
+
+%% Cell type:markdown id: tags:
+
+Depending on type of problem this information can be further used to distinguish clear cases and those with overlapping classifications.
+
+Or one can use it to adjust trade-off between precision and recall.
+
+%% Cell type:markdown id: tags:
+
+***
+## Classification for digit data
+
+Another classic example case for ML is handwritten digits data.
+
+A suitable dataset is included with sklearn, first we look into it:
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.datasets import load_digits
+digits = load_digits()
+digits.images.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+type(digits)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+digits?
+```
+
+%% Cell type:code id: tags:
+
+``` python
+print(digits.DESCR)
+```
+
+%% Cell type:markdown id: tags:
+
+The data is sklearn specific container, basically a list of 8x8 pixels images
+
+We plot a sub-set:
+
+%% Cell type:code id: tags:
+
+``` python
+import matplotlib.pyplot as plt
+
+fig, axes = plt.subplots(10, 10, figsize=(8, 8),
+                         subplot_kw={'xticks':[], 'yticks':[]},
+                         gridspec_kw=dict(hspace=0.1, wspace=0.1))
+
+for i, ax in enumerate(axes.flat):
+    ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
+    ax.text(0.05, 0.05, str(digits.target[i]),
+            transform=ax.transAxes, color='green')
+```
+
+%% Cell type:markdown id: tags:
+
+Plot shows pixel image together with label (in green).
+
+* Some images are obvious
+* Others seem difficult
+
+%% Cell type:code id: tags:
+
+``` python
+# Look at data from 1st image --> 2D table resembles 0
+print (digits.images[1])
+```
+
+%% Cell type:code id: tags:
+
+``` python
+digits.images[0].shape
+```
+
+%% Cell type:markdown id: tags:
+
+## Image data with sklearn:
+To use the data with sklearn as before we need a 2D structure: `samples x features` , i.e. the 8x8 images should be transformed into flat 1x64 array.
+
+Already provided in Dataset, element `data` :
+
+%% Cell type:code id: tags:
+
+``` python
+print (digits.data[0])
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# to use as before just re-label
+X = digits.data
+y = digits.target
+```
+
+%% Cell type:markdown id: tags:
+
+***
+### Digit classification
+
+multi-classification problem
+* conceptually no big difference to binary classification
+* models discussed are flexible to handle this
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.model_selection import train_test_split
+X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
+```
+
+%% Cell type:markdown id: tags:
+
+#### First kNN:
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.neighbors import KNeighborsClassifier
+knn = KNeighborsClassifier(n_neighbors=7)
+knn.fit(X_train, y_train)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use scilearn function for score
+from sklearn.metrics import accuracy_score
+y_pred = knn.predict(X_test)
+score = accuracy_score(y_test, y_pred)
+print("Test set score: {:.3f}".format(score))
+```
+
+%% Cell type:markdown id: tags:
+
+***
+**Detailed classification report**
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn import metrics
+print(metrics.classification_report(ytest, ypred))
+```
+
+%% Cell type:markdown id: tags:
+
+**Check confusion matrix**
+very infomative for such a case
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.metrics import confusion_matrix
+mat = confusion_matrix(y_test, y_pred)
+sns.heatmap(mat, square=True, annot=True, cbar=False)
+plt.xlabel('predicted value')
+plt.ylabel('true value');
+```
+
+%% Cell type:markdown id: tags:
+
+##### kNN performs really well!
+
+***
+#### Then  Gaussian Naive Bayes:
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.naive_bayes import GaussianNB
+model = GaussianNB()
+model.fit(X_train, y_train)
+y_model = model.predict(X_test)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+score = accuracy_score(y_model, y_test)
+print("Test set score: {:.3f}".format(score))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn import metrics
+print(metrics.classification_report(y_test, y_model))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+mat = confusion_matrix(y_test, y_model)
+sns.heatmap(mat, square=True, annot=True, cbar=False)
+plt.xlabel('predicted value')
+plt.ylabel('true value');
+```
+
+%% Cell type:markdown id: tags:
+
+##### GNB significantly worse, many more mis-ids!
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+---
+#### Exercise: ####
+Also try the other models we discussed for classification, i.e. logistic regression and SVC
+
+%% Cell type:code id: tags:
+
+``` python
+```