Upload New File

df580c1c · GDuckeck · de7f704d · df580c1c
Commit df580c1c authored 9 months ago by GDuckeck
--- a/knnExample_v2.ipynb
+++ b/knnExample_v2.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "toc": true
+   },
+   "source": [
+    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
+    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#k-Nearest-Neighbors\" data-toc-modified-id=\"k-Nearest-Neighbors-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>k-Nearest Neighbors</a></span><ul class=\"toc-item\"><li><span><a href=\"#kNN-Model-summary\" data-toc-modified-id=\"kNN-Model-summary-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>kNN Model summary</a></span></li></ul></li><li><span><a href=\"#kNN-with-Scikit-learn\" data-toc-modified-id=\"kNN-with-Scikit-learn-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>kNN with Scikit-learn</a></span><ul class=\"toc-item\"><li><span><a href=\"#Starting-with-sklearn\" data-toc-modified-id=\"Starting-with-sklearn-2.1\"><span class=\"toc-item-num\">2.1&nbsp;&nbsp;</span>Starting with sklearn</a></span><ul class=\"toc-item\"><li><span><a href=\"#Apply-model,-make-predictions\" data-toc-modified-id=\"Apply-model,-make-predictions-2.1.1\"><span class=\"toc-item-num\">2.1.1&nbsp;&nbsp;</span>Apply model, make predictions</a></span></li><li><span><a href=\"#Test/evaluate-model\" data-toc-modified-id=\"Test/evaluate-model-2.1.2\"><span class=\"toc-item-num\">2.1.2&nbsp;&nbsp;</span>Test/evaluate model</a></span></li></ul></li></ul></li></ul></div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## k-Nearest Neighbors\n",
+    "We use k-Nearest Neighbors (kNN) as detailed showcase for an ML model, i.e.\n",
+    "* we will not just show how to use it in one of the standard ML packages \n",
+    "* but discuss in some detail the implementation in Python functions\n",
+    "\n",
+    "kNN is conceptually simple:\n",
+    "* need a sample with known classifications\n",
+    "* for new data look at elements from known sample in **neighborhood**\n",
+    "  * requires some metric to define **distance**\n",
+    "* classify according to **majority classification** of these neighbors\n",
+    "  \n",
+    "\n",
+    "**Real world example -- elections**  \n",
+    "Elections results, i.e. which party is most popular strongly varies between regions. So if you want to predict how a specific person votes then the place where a person lives  and how the neighbors voted provides useful information.\n",
+    "\n",
+    "Examples from Bundestagswahl 2017:\n",
+    "* Wahlkreis Jachenau (Bad Tölz) ~62% CSU\n",
+    "* Wahlbezirk Nürnberg-4553 ~45% SPD\n",
+    "* though extreme cases, many \"Wahl-Bezirke\" rather balanced\n",
+    "\n",
+    "Of course other information might be more important to predict voting decision:  \n",
+    "*education, income, profession, hobbies, ...*\n",
+    "\n",
+    "\n",
+    "In the following we discuss an example kNN implementation adapted from the book *Data Science from Scratch*\n",
+    "\n",
+    "What's needed:\n",
+    "* toy data:\n",
+    "  * artificial poll data of person's programming language preference and geographic location (longitude vs latitude)\n",
+    "* *metric* for distance:\n",
+    "  * simply geographical distance \n",
+    "* *list of neighbors* sorted by distance\n",
+    "* function to determine *majority vote* of `k-Nearest Neighbors`\n",
+    "\n",
+    "***"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# the usual setup: \n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# some artificial poll data of \n",
+    "# person's programming language preference\n",
+    "# and geographic location (longitue vs latitude)\n",
+    "cities = [(-86.75,33.5666666666667,'Python'),\n",
+    "          (-88.25,30.6833333333333,'Python'),\n",
+    "          (-112.016666666667,33.4333333333333,'Java'),\n",
+    "          (-110.933333333333,32.1166666666667,'Java'),\n",
+    "          (-92.2333333333333,34.7333333333333,'R'),\n",
+    "          (-121.95,37.7,'R'),(-118.15,33.8166666666667,'Python'),\n",
+    "          (-118.233333333333,34.05,'Java'),\n",
+    "          (-122.316666666667,37.8166666666667,'R'),\n",
+    "          (-117.6,34.05,'Python'),\n",
+    "          (-116.533333333333,33.8166666666667,'Python'),\n",
+    "          (-121.5,38.5166666666667,'R'),\n",
+    "          (-117.166666666667,32.7333333333333,'R'),\n",
+    "          (-122.383333333333,37.6166666666667,'R'),\n",
+    "          (-121.933333333333,37.3666666666667,'R'),\n",
+    "          (-122.016666666667,36.9833333333333,'Python'),\n",
+    "          (-104.716666666667,38.8166666666667,'Python'),\n",
+    "          (-104.866666666667,39.75,'Python'),\n",
+    "          (-72.65,41.7333333333333,'R'),\n",
+    "          (-75.6,39.6666666666667,'Python'),(-77.0333333333333,38.85,'Python'),(-80.2666666666667,25.8,'Java'),(-81.3833333333333,28.55,'Java'),(-82.5333333333333,27.9666666666667,'Java'),(-84.4333333333333,33.65,'Python'),(-116.216666666667,43.5666666666667,'Python'),(-87.75,41.7833333333333,'Java'),(-86.2833333333333,39.7333333333333,'Java'),(-93.65,41.5333333333333,'Java'),(-97.4166666666667,37.65,'Java'),(-85.7333333333333,38.1833333333333,'Python'),(-90.25,29.9833333333333,'Java'),(-70.3166666666667,43.65,'R'),(-76.6666666666667,39.1833333333333,'R'),(-71.0333333333333,42.3666666666667,'R'),(-72.5333333333333,42.2,'R'),(-83.0166666666667,42.4166666666667,'Python'),(-84.6,42.7833333333333,'Python'),(-93.2166666666667,44.8833333333333,'Python'),(-90.0833333333333,32.3166666666667,'Java'),(-94.5833333333333,39.1166666666667,'Java'),(-90.3833333333333,38.75,'Python'),(-108.533333333333,45.8,'Python'),(-95.9,41.3,'Python'),(-115.166666666667,36.0833333333333,'Java'),(-71.4333333333333,42.9333333333333,'R'),(-74.1666666666667,40.7,'R'),(-106.616666666667,35.05,'Python'),(-78.7333333333333,42.9333333333333,'R'),(-73.9666666666667,40.7833333333333,'R'),(-80.9333333333333,35.2166666666667,'Python'),(-78.7833333333333,35.8666666666667,'Python'),(-100.75,46.7666666666667,'Java'),(-84.5166666666667,39.15,'Java'),(-81.85,41.4,'Java'),(-82.8833333333333,40,'Java'),(-97.6,35.4,'Python'),(-122.666666666667,45.5333333333333,'Python'),(-75.25,39.8833333333333,'Python'),(-80.2166666666667,40.5,'Python'),(-71.4333333333333,41.7333333333333,'R'),(-81.1166666666667,33.95,'R'),(-96.7333333333333,43.5666666666667,'Python'),(-90,35.05,'R'),(-86.6833333333333,36.1166666666667,'R'),(-97.7,30.3,'Python'),(-96.85,32.85,'Java'),(-95.35,29.9666666666667,'Java'),(-98.4666666666667,29.5333333333333,'Java'),(-111.966666666667,40.7666666666667,'Python'),(-73.15,44.4666666666667,'R'),(-77.3333333333333,37.5,'Python'),(-122.3,47.5333333333333,'Python'),(-89.3333333333333,43.1333333333333,'R'),(-104.816666666667,41.15,'Java')]\n",
+    "#cities = [([longitude, latitude], language) for longitude, latitude, language in cities]\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# convert to Dataframe\n",
+    "cols=['long','lat','lang']\n",
+    "citdf=pd.DataFrame(cities,columns=cols)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "citdf.describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "langs=np.unique(citdf.lang) # get list of languages\n",
+    "print(langs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "citdf.lang"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# plot distribution of data\n",
+    "# different color for each language\n",
+    "\n",
+    "coldict= { \"Java\" : \"r\", \"Python\" : \"b\", \"R\" : \"g\" }\n",
+    "\n",
+    "fig, ax = plt.subplots()\n",
+    "\n",
+    "for l in langs:\n",
+    "    citdf[citdf.lang==l].plot.scatter('long','lat',ax=ax,\n",
+    "                                      c=coldict[l],label=l)\n",
+    "\n",
+    "plt.legend()\n",
+    "plt.axis([-130,-60,15,55]) # set the axes\n",
+    "plt.title(\"Favorite Programming Languages\");\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from collections import Counter\n",
+    "a=['a','b','c','b','b']\n",
+    "Counter(a)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# helper functions for knn classification\n",
+    "from collections import Counter\n",
+    "\n",
+    "def majority_vote(labels):\n",
+    "    \"\"\"assumes that labels are ordered from nearest to farthest\"\"\"\n",
+    "    vote_counts = Counter(labels)\n",
+    "    winner, winner_count = vote_counts.most_common(1)[0]\n",
+    "    num_winners = len([count \n",
+    "                       for count in vote_counts.values()\n",
+    "                       if count == winner_count])\n",
+    "\n",
+    "    if num_winners == 1:\n",
+    "        return winner                     # unique winner, so return it\n",
+    "    else:\n",
+    "        return majority_vote(labels[:-1]) # try again without the farthest\n",
+    "\n",
+    "\n",
+    "def knn_classify(k, citdf, new_point, exclude_first=True):\n",
+    "    \"\"\n",
+    "    \n",
+    "    # order the labeled points from nearest to farthest\n",
+    "    \n",
+    "    x,y = new_point\n",
+    "    citdf['dist']=(citdf.long-x)**2+(citdf.lat-y)**2 # calculate distance\n",
+    "    citdfs=citdf.sort_values(['dist']) # sort by distance\n",
+    "\n",
+    "    # find the labels for the k closest,\n",
+    "    # exclude index 0 --> identical point\n",
+    "    if exclude_first:\n",
+    "        citdfs=citdfs[1:]\n",
+    "    k_nearest_labels = citdfs.lang[:k].tolist()\n",
+    "\n",
+    "    # and let them vote\n",
+    "    return majority_vote(k_nearest_labels)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "knn_classify(1, citdf, (-100,35))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def test_knn():\n",
+    "    \" try several different values for k\"\n",
+    "    for k in [1, 3, 5, 7]:\n",
+    "        num_correct = 0\n",
+    "   \n",
+    "        #for location, actual_language in cities:\n",
+    "        for i,l in citdf.iterrows():\n",
+    "            location = (l['long'],l['lat'])\n",
+    "\n",
+    "            predicted_language = knn_classify(k, citdf, location)\n",
+    "\n",
+    "            if predicted_language == l['lang']: \n",
+    "                num_correct += 1\n",
+    "\n",
+    "        print (k, \"neighbor[s]:\", num_correct, \n",
+    "               \"correct out of\", len(cities))\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_knn()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def classify_and_plot_grid(k=1):\n",
+    "    plots = { \"Java\" : ([], []), \"Python\" : ([], []), \"R\" : ([], []) }\n",
+    "#    markers = { \"Java\" : \"o\", \"Python\" : \"s\", \"R\" : \"^\" }\n",
+    "    markers = { \"Java\" : \".\", \"Python\" : \".\", \"R\" : \".\" }\n",
+    "    colors  = { \"Java\" : \"r\", \"Python\" : \"b\", \"R\" : \"g\" }\n",
+    "\n",
+    "    for longitude in range(-130, -60):\n",
+    "        for latitude in range(20, 55):\n",
+    "            pos = (longitude, latitude)\n",
+    "            predicted_language = knn_classify(k, citdf, pos, False)\n",
+    "            plots[predicted_language][0].append(longitude)\n",
+    "            plots[predicted_language][1].append(latitude)\n",
+    "\n",
+    "    # create a scatter series for each language\n",
+    "    for language, (x, y) in plots.items():\n",
+    "        plt.scatter(x, y, color=colors[language], marker=markers[language],\n",
+    "                          label=language, alpha=0.7)\n",
+    "\n",
+    "#    plot_state_borders(plt, color='black')    # assume we have a function that does this\n",
+    "\n",
+    "    plt.legend(loc=0)          # let matplotlib choose the location\n",
+    "    plt.axis([-130,-60,20,55]) # set the axes\n",
+    "    plt.title(str(k) + \"-Nearest Neighbor Programming Languages\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i in (1,3,5):\n",
+    "    plt.figure()\n",
+    "    classify_and_plot_grid(i)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### kNN Model summary\n",
+    "* conceptually simple model\n",
+    "  * though distance metric critical\n",
+    "    * not alway as straightforward as in case of geometrical data\n",
+    "* no real training needed\n",
+    "* works also for very localized or non-linear distributions\n",
+    "* no real model\n",
+    "  * evaluation can get slow for large reference data and/or high dimensions\n",
+    "  \n",
+    "***"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true,
+    "jupyter": {
+     "outputs_hidden": true
+    }
+   },
+   "source": [
+    "## kNN with Scikit-learn\n",
+    "In the following we will mainly use ML models and tools from the scikit-learn package.\n",
+    "As a quick example we take our language poll data and apply the kNN model from scikit learn to it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# extract only long and lat\n",
+    "X = citdf.loc[:,'long':'lat']\n",
+    "Y = citdf.lang\n",
+    "X.shape"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Starting with sklearn\n",
+    "usual procedure:\n",
+    "- split dataset into training and validation\n",
+    "- select and initialize sklearn model \n",
+    "- do training\n",
+    "- test/validate\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.model_selection import train_test_split"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_test_split?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# X_train, X_test, y_train, y_test = train_test_split( X, Y )\n",
+    "X_train, X_test, y_train, y_test = train_test_split( X, Y )\n",
+    "\n",
+    "# by default 75% train, 25% test\n",
+    "print (X_train.shape, X_test.shape)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# select kNN model\n",
+    "from sklearn.neighbors import KNeighborsClassifier\n",
+    "# initialize model: parameter n_neighbors\n",
+    "knn = KNeighborsClassifier(n_neighbors=3)\n",
+    "# do the training\n",
+    "knn.fit(X_train, y_train)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Apply model, make predictions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# create some coordinates\n",
+    "X_new = np.array([[-100,35]])\n",
+    "# 2D format required, nrows vs ncolums (1x2)\n",
+    "X_new.shape #"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "knn.predict(X_new) # apply model to new data point"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Test/evaluate model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "y_pred = knn.predict(X_test)\n",
+    "print(\"Test set predictions:\\n {}\".format(y_pred))\n",
+    "print(list(y_test))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# test by hand\n",
+    "y_ok = y_pred == y_test\n",
+    "print(\"Test set score: {:.2f}\".format(np.mean(y_ok)))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# or use equivalent scilearn function for it.\n",
+    "print(\"Test set score: {:.2f}\".format(knn.score(X_test, y_test)))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": true,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
+%% Cell type:markdown id: tags:
+
+<h1>Table of Contents<span class="tocSkip"></span></h1>
+<div class="toc"><ul class="toc-item"><li><span><a href="#k-Nearest-Neighbors" data-toc-modified-id="k-Nearest-Neighbors-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>k-Nearest Neighbors</a></span><ul class="toc-item"><li><span><a href="#kNN-Model-summary" data-toc-modified-id="kNN-Model-summary-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>kNN Model summary</a></span></li></ul></li><li><span><a href="#kNN-with-Scikit-learn" data-toc-modified-id="kNN-with-Scikit-learn-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>kNN with Scikit-learn</a></span><ul class="toc-item"><li><span><a href="#Starting-with-sklearn" data-toc-modified-id="Starting-with-sklearn-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Starting with sklearn</a></span><ul class="toc-item"><li><span><a href="#Apply-model,-make-predictions" data-toc-modified-id="Apply-model,-make-predictions-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Apply model, make predictions</a></span></li><li><span><a href="#Test/evaluate-model" data-toc-modified-id="Test/evaluate-model-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Test/evaluate model</a></span></li></ul></li></ul></li></ul></div>
+
+%% Cell type:markdown id: tags:
+
+## k-Nearest Neighbors
+We use k-Nearest Neighbors (kNN) as detailed showcase for an ML model, i.e.
+* we will not just show how to use it in one of the standard ML packages
+* but discuss in some detail the implementation in Python functions
+
+kNN is conceptually simple:
+* need a sample with known classifications
+* for new data look at elements from known sample in **neighborhood**
+  * requires some metric to define **distance**
+* classify according to **majority classification** of these neighbors
+
+
+**Real world example -- elections**
+Elections results, i.e. which party is most popular strongly varies between regions. So if you want to predict how a specific person votes then the place where a person lives  and how the neighbors voted provides useful information.
+
+Examples from Bundestagswahl 2017:
+* Wahlkreis Jachenau (Bad Tölz) ~62% CSU
+* Wahlbezirk Nürnberg-4553 ~45% SPD
+* though extreme cases, many "Wahl-Bezirke" rather balanced
+
+Of course other information might be more important to predict voting decision:
+*education, income, profession, hobbies, ...*
+
+
+In the following we discuss an example kNN implementation adapted from the book *Data Science from Scratch*
+
+What's needed:
+* toy data:
+  * artificial poll data of person's programming language preference and geographic location (longitude vs latitude)
+* *metric* for distance:
+  * simply geographical distance
+* *list of neighbors* sorted by distance
+* function to determine *majority vote* of `k-Nearest Neighbors`
+
+***
+
+%% Cell type:code id: tags:
+
+``` python
+# the usual setup:
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# some artificial poll data of
+# person's programming language preference
+# and geographic location (longitue vs latitude)
+cities = [(-86.75,33.5666666666667,'Python'),
+          (-88.25,30.6833333333333,'Python'),
+          (-112.016666666667,33.4333333333333,'Java'),
+          (-110.933333333333,32.1166666666667,'Java'),
+          (-92.2333333333333,34.7333333333333,'R'),
+          (-121.95,37.7,'R'),(-118.15,33.8166666666667,'Python'),
+          (-118.233333333333,34.05,'Java'),
+          (-122.316666666667,37.8166666666667,'R'),
+          (-117.6,34.05,'Python'),
+          (-116.533333333333,33.8166666666667,'Python'),
+          (-121.5,38.5166666666667,'R'),
+          (-117.166666666667,32.7333333333333,'R'),
+          (-122.383333333333,37.6166666666667,'R'),
+          (-121.933333333333,37.3666666666667,'R'),
+          (-122.016666666667,36.9833333333333,'Python'),
+          (-104.716666666667,38.8166666666667,'Python'),
+          (-104.866666666667,39.75,'Python'),
+          (-72.65,41.7333333333333,'R'),
+          (-75.6,39.6666666666667,'Python'),(-77.0333333333333,38.85,'Python'),(-80.2666666666667,25.8,'Java'),(-81.3833333333333,28.55,'Java'),(-82.5333333333333,27.9666666666667,'Java'),(-84.4333333333333,33.65,'Python'),(-116.216666666667,43.5666666666667,'Python'),(-87.75,41.7833333333333,'Java'),(-86.2833333333333,39.7333333333333,'Java'),(-93.65,41.5333333333333,'Java'),(-97.4166666666667,37.65,'Java'),(-85.7333333333333,38.1833333333333,'Python'),(-90.25,29.9833333333333,'Java'),(-70.3166666666667,43.65,'R'),(-76.6666666666667,39.1833333333333,'R'),(-71.0333333333333,42.3666666666667,'R'),(-72.5333333333333,42.2,'R'),(-83.0166666666667,42.4166666666667,'Python'),(-84.6,42.7833333333333,'Python'),(-93.2166666666667,44.8833333333333,'Python'),(-90.0833333333333,32.3166666666667,'Java'),(-94.5833333333333,39.1166666666667,'Java'),(-90.3833333333333,38.75,'Python'),(-108.533333333333,45.8,'Python'),(-95.9,41.3,'Python'),(-115.166666666667,36.0833333333333,'Java'),(-71.4333333333333,42.9333333333333,'R'),(-74.1666666666667,40.7,'R'),(-106.616666666667,35.05,'Python'),(-78.7333333333333,42.9333333333333,'R'),(-73.9666666666667,40.7833333333333,'R'),(-80.9333333333333,35.2166666666667,'Python'),(-78.7833333333333,35.8666666666667,'Python'),(-100.75,46.7666666666667,'Java'),(-84.5166666666667,39.15,'Java'),(-81.85,41.4,'Java'),(-82.8833333333333,40,'Java'),(-97.6,35.4,'Python'),(-122.666666666667,45.5333333333333,'Python'),(-75.25,39.8833333333333,'Python'),(-80.2166666666667,40.5,'Python'),(-71.4333333333333,41.7333333333333,'R'),(-81.1166666666667,33.95,'R'),(-96.7333333333333,43.5666666666667,'Python'),(-90,35.05,'R'),(-86.6833333333333,36.1166666666667,'R'),(-97.7,30.3,'Python'),(-96.85,32.85,'Java'),(-95.35,29.9666666666667,'Java'),(-98.4666666666667,29.5333333333333,'Java'),(-111.966666666667,40.7666666666667,'Python'),(-73.15,44.4666666666667,'R'),(-77.3333333333333,37.5,'Python'),(-122.3,47.5333333333333,'Python'),(-89.3333333333333,43.1333333333333,'R'),(-104.816666666667,41.15,'Java')]
+#cities = [([longitude, latitude], language) for longitude, latitude, language in cities]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# convert to Dataframe
+cols=['long','lat','lang']
+citdf=pd.DataFrame(cities,columns=cols)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+citdf.describe()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+langs=np.unique(citdf.lang) # get list of languages
+print(langs)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+citdf.lang
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# plot distribution of data
+# different color for each language
+
+coldict= { "Java" : "r", "Python" : "b", "R" : "g" }
+
+fig, ax = plt.subplots()
+
+for l in langs:
+    citdf[citdf.lang==l].plot.scatter('long','lat',ax=ax,
+                                      c=coldict[l],label=l)
+
+plt.legend()
+plt.axis([-130,-60,15,55]) # set the axes
+plt.title("Favorite Programming Languages");
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from collections import Counter
+a=['a','b','c','b','b']
+Counter(a)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# helper functions for knn classification
+from collections import Counter
+
+def majority_vote(labels):
+    """assumes that labels are ordered from nearest to farthest"""
+    vote_counts = Counter(labels)
+    winner, winner_count = vote_counts.most_common(1)[0]
+    num_winners = len([count
+                       for count in vote_counts.values()
+                       if count == winner_count])
+
+    if num_winners == 1:
+        return winner                     # unique winner, so return it
+    else:
+        return majority_vote(labels[:-1]) # try again without the farthest
+
+
+def knn_classify(k, citdf, new_point, exclude_first=True):
+    ""
+
+    # order the labeled points from nearest to farthest
+
+    x,y = new_point
+    citdf['dist']=(citdf.long-x)**2+(citdf.lat-y)**2 # calculate distance
+    citdfs=citdf.sort_values(['dist']) # sort by distance
+
+    # find the labels for the k closest,
+    # exclude index 0 --> identical point
+    if exclude_first:
+        citdfs=citdfs[1:]
+    k_nearest_labels = citdfs.lang[:k].tolist()
+
+    # and let them vote
+    return majority_vote(k_nearest_labels)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+knn_classify(1, citdf, (-100,35))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def test_knn():
+    " try several different values for k"
+    for k in [1, 3, 5, 7]:
+        num_correct = 0
+
+        #for location, actual_language in cities:
+        for i,l in citdf.iterrows():
+            location = (l['long'],l['lat'])
+
+            predicted_language = knn_classify(k, citdf, location)
+
+            if predicted_language == l['lang']:
+                num_correct += 1
+
+        print (k, "neighbor[s]:", num_correct,
+               "correct out of", len(cities))
+
+```
+
+%% Cell type:code id: tags:
+
+``` python
+test_knn()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def classify_and_plot_grid(k=1):
+    plots = { "Java" : ([], []), "Python" : ([], []), "R" : ([], []) }
+#    markers = { "Java" : "o", "Python" : "s", "R" : "^" }
+    markers = { "Java" : ".", "Python" : ".", "R" : "." }
+    colors  = { "Java" : "r", "Python" : "b", "R" : "g" }
+
+    for longitude in range(-130, -60):
+        for latitude in range(20, 55):
+            pos = (longitude, latitude)
+            predicted_language = knn_classify(k, citdf, pos, False)
+            plots[predicted_language][0].append(longitude)
+            plots[predicted_language][1].append(latitude)
+
+    # create a scatter series for each language
+    for language, (x, y) in plots.items():
+        plt.scatter(x, y, color=colors[language], marker=markers[language],
+                          label=language, alpha=0.7)
+
+#    plot_state_borders(plt, color='black')    # assume we have a function that does this
+
+    plt.legend(loc=0)          # let matplotlib choose the location
+    plt.axis([-130,-60,20,55]) # set the axes
+    plt.title(str(k) + "-Nearest Neighbor Programming Languages")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+for i in (1,3,5):
+    plt.figure()
+    classify_and_plot_grid(i)
+```
+
+%% Cell type:markdown id: tags:
+
+### kNN Model summary
+* conceptually simple model
+  * though distance metric critical
+    * not alway as straightforward as in case of geometrical data
+* no real training needed
+* works also for very localized or non-linear distributions
+* no real model
+  * evaluation can get slow for large reference data and/or high dimensions
+
+***
+
+%% Cell type:markdown id: tags:
+
+## kNN with Scikit-learn
+In the following we will mainly use ML models and tools from the scikit-learn package.
+As a quick example we take our language poll data and apply the kNN model from scikit learn to it.
+
+%% Cell type:code id: tags:
+
+``` python
+# extract only long and lat
+X = citdf.loc[:,'long':'lat']
+Y = citdf.lang
+X.shape
+```
+
+%% Cell type:markdown id: tags:
+
+### Starting with sklearn
+usual procedure:
+- split dataset into training and validation
+- select and initialize sklearn model
+- do training
+- test/validate
+
+%% Cell type:code id: tags:
+
+``` python
+from sklearn.model_selection import train_test_split
+```
+
+%% Cell type:code id: tags:
+
+``` python
+train_test_split?
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# X_train, X_test, y_train, y_test = train_test_split( X, Y )
+X_train, X_test, y_train, y_test = train_test_split( X, Y )
+
+# by default 75% train, 25% test
+print (X_train.shape, X_test.shape)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# select kNN model
+from sklearn.neighbors import KNeighborsClassifier
+# initialize model: parameter n_neighbors
+knn = KNeighborsClassifier(n_neighbors=3)
+# do the training
+knn.fit(X_train, y_train)
+```
+
+%% Cell type:markdown id: tags:
+
+#### Apply model, make predictions
+
+%% Cell type:code id: tags:
+
+``` python
+# create some coordinates
+X_new = np.array([[-100,35]])
+# 2D format required, nrows vs ncolums (1x2)
+X_new.shape #
+```
+
+%% Cell type:code id: tags:
+
+``` python
+knn.predict(X_new) # apply model to new data point
+```
+
+%% Cell type:markdown id: tags:
+
+#### Test/evaluate model
+
+%% Cell type:code id: tags:
+
+``` python
+y_pred = knn.predict(X_test)
+print("Test set predictions:\n {}".format(y_pred))
+print(list(y_test))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# test by hand
+y_ok = y_pred == y_test
+print("Test set score: {:.2f}".format(np.mean(y_ok)))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# or use equivalent scilearn function for it.
+print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+```