Skip to content
Snippets Groups Projects
Commit f394ee1f authored by Guenter Duckeck's avatar Guenter Duckeck
Browse files

ref added

parent 96f8ea1d
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#k-Nearest-Neighbors" data-toc-modified-id="k-Nearest-Neighbors-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>k-Nearest Neighbors</a></span><ul class="toc-item"><li><span><a href="#kNN-Model-summary" data-toc-modified-id="kNN-Model-summary-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>kNN Model summary</a></span></li></ul></li><li><span><a href="#kNN-with-Scikit-learn" data-toc-modified-id="kNN-with-Scikit-learn-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>kNN with Scikit-learn</a></span><ul class="toc-item"><li><span><a href="#Starting-with-sklearn" data-toc-modified-id="Starting-with-sklearn-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Starting with sklearn</a></span><ul class="toc-item"><li><span><a href="#Apply-model,-make-predictions" data-toc-modified-id="Apply-model,-make-predictions-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Apply model, make predictions</a></span></li><li><span><a href="#Test/evaluate-model" data-toc-modified-id="Test/evaluate-model-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Test/evaluate model</a></span></li></ul></li></ul></li></ul></div>
%% Cell type:markdown id: tags:
## k-Nearest Neighbors
We use k-Nearest Neighbors (kNN) as detailed showcase for an ML model, i.e.
* we will not just show how to use it in one of the standard ML packages
* but discuss in some detail the implementation in Python functions
kNN is conceptually simple:
* need a sample with known classifications
* for new data look at elements from known sample in **neighborhood**
* requires some metric to define **distance**
* classify according to **majority classification** of these neighbors
**Real world example -- elections**
Elections results, i.e. which party is most popular strongly varies between regions. So if you want to predict how a specific person votes then the place where a person lives and how the neighbors voted provides useful information.
Examples from Bundestagswahl 2017:
* Wahlkreis Jachenau (Bad Tölz) ~62% CSU
* Wahlbezirk Nürnberg-4553 ~45% SPD
* though extreme cases, many "Wahl-Bezirke" rather balanced
Of course other information might be more important to predict voting decision:
*education, income, profession, hobbies, ...*
In the following we discuss an example kNN implementation adapted from the book *Data Science from Scratch*
What's needed:
* toy data:
* artificial poll data of person's programming language preference and geographic location (longitude vs latitude)
* *metric* for distance:
* simply geographical distance
* *list of neighbors* sorted by distance
* function to determine *majority vote* of `k-Nearest Neighbors`
***
Following example taken from:
**[Data science from Scratch](http://shop.oreilly.com/product/0636920033400.do)** by Joel Grus
%% Cell type:code id: tags:
``` python
# the usual setup:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
```
%% Cell type:code id: tags:
``` python
# some artificial poll data of
# person's programming language preference
# and geographic location (longitue vs latitude)
cities = [(-86.75,33.5666666666667,'Python'),
(-88.25,30.6833333333333,'Python'),
(-112.016666666667,33.4333333333333,'Java'),
(-110.933333333333,32.1166666666667,'Java'),
(-92.2333333333333,34.7333333333333,'R'),
(-121.95,37.7,'R'),(-118.15,33.8166666666667,'Python'),
(-118.233333333333,34.05,'Java'),
(-122.316666666667,37.8166666666667,'R'),
(-117.6,34.05,'Python'),
(-116.533333333333,33.8166666666667,'Python'),
(-121.5,38.5166666666667,'R'),
(-117.166666666667,32.7333333333333,'R'),
(-122.383333333333,37.6166666666667,'R'),
(-121.933333333333,37.3666666666667,'R'),
(-122.016666666667,36.9833333333333,'Python'),
(-104.716666666667,38.8166666666667,'Python'),
(-104.866666666667,39.75,'Python'),
(-72.65,41.7333333333333,'R'),
(-75.6,39.6666666666667,'Python'),(-77.0333333333333,38.85,'Python'),(-80.2666666666667,25.8,'Java'),(-81.3833333333333,28.55,'Java'),(-82.5333333333333,27.9666666666667,'Java'),(-84.4333333333333,33.65,'Python'),(-116.216666666667,43.5666666666667,'Python'),(-87.75,41.7833333333333,'Java'),(-86.2833333333333,39.7333333333333,'Java'),(-93.65,41.5333333333333,'Java'),(-97.4166666666667,37.65,'Java'),(-85.7333333333333,38.1833333333333,'Python'),(-90.25,29.9833333333333,'Java'),(-70.3166666666667,43.65,'R'),(-76.6666666666667,39.1833333333333,'R'),(-71.0333333333333,42.3666666666667,'R'),(-72.5333333333333,42.2,'R'),(-83.0166666666667,42.4166666666667,'Python'),(-84.6,42.7833333333333,'Python'),(-93.2166666666667,44.8833333333333,'Python'),(-90.0833333333333,32.3166666666667,'Java'),(-94.5833333333333,39.1166666666667,'Java'),(-90.3833333333333,38.75,'Python'),(-108.533333333333,45.8,'Python'),(-95.9,41.3,'Python'),(-115.166666666667,36.0833333333333,'Java'),(-71.4333333333333,42.9333333333333,'R'),(-74.1666666666667,40.7,'R'),(-106.616666666667,35.05,'Python'),(-78.7333333333333,42.9333333333333,'R'),(-73.9666666666667,40.7833333333333,'R'),(-80.9333333333333,35.2166666666667,'Python'),(-78.7833333333333,35.8666666666667,'Python'),(-100.75,46.7666666666667,'Java'),(-84.5166666666667,39.15,'Java'),(-81.85,41.4,'Java'),(-82.8833333333333,40,'Java'),(-97.6,35.4,'Python'),(-122.666666666667,45.5333333333333,'Python'),(-75.25,39.8833333333333,'Python'),(-80.2166666666667,40.5,'Python'),(-71.4333333333333,41.7333333333333,'R'),(-81.1166666666667,33.95,'R'),(-96.7333333333333,43.5666666666667,'Python'),(-90,35.05,'R'),(-86.6833333333333,36.1166666666667,'R'),(-97.7,30.3,'Python'),(-96.85,32.85,'Java'),(-95.35,29.9666666666667,'Java'),(-98.4666666666667,29.5333333333333,'Java'),(-111.966666666667,40.7666666666667,'Python'),(-73.15,44.4666666666667,'R'),(-77.3333333333333,37.5,'Python'),(-122.3,47.5333333333333,'Python'),(-89.3333333333333,43.1333333333333,'R'),(-104.816666666667,41.15,'Java')]
#cities = [([longitude, latitude], language) for longitude, latitude, language in cities]
```
%% Cell type:code id: tags:
``` python
# convert to Dataframe
cols=['long','lat','lang']
citdf=pd.DataFrame(cities,columns=cols)
```
%% Cell type:code id: tags:
``` python
citdf.describe()
```
%% Cell type:code id: tags:
``` python
langs=np.unique(citdf.lang) # get list of languages
print(langs)
```
%% Cell type:code id: tags:
``` python
citdf.lang
```
%% Cell type:code id: tags:
``` python
# plot distribution of data
# different color for each language
coldict= { "Java" : "r", "Python" : "b", "R" : "g" }
fig, ax = plt.subplots()
for l in langs:
citdf[citdf.lang==l].plot.scatter('long','lat',ax=ax,
c=coldict[l],label=l)
plt.legend()
plt.axis([-130,-60,15,55]) # set the axes
plt.title("Favorite Programming Languages");
```
%% Cell type:code id: tags:
``` python
from collections import Counter
a=['a','b','c','b','b']
Counter(a)
```
%% Cell type:code id: tags:
``` python
# helper functions for knn classification
from collections import Counter
def majority_vote(labels):
"""assumes that labels are ordered from nearest to farthest"""
vote_counts = Counter(labels)
winner, winner_count = vote_counts.most_common(1)[0]
num_winners = len([count
for count in vote_counts.values()
if count == winner_count])
if num_winners == 1:
return winner # unique winner, so return it
else:
return majority_vote(labels[:-1]) # try again without the farthest
def knn_classify(k, citdf, new_point, exclude_first=True):
""
# order the labeled points from nearest to farthest
x,y = new_point
citdf['dist']=(citdf.long-x)**2+(citdf.lat-y)**2 # calculate distance
citdfs=citdf.sort_values(['dist']) # sort by distance
# find the labels for the k closest,
# exclude index 0 --> identical point
if exclude_first:
citdfs=citdfs[1:]
k_nearest_labels = citdfs.lang[:k].tolist()
# and let them vote
return majority_vote(k_nearest_labels)
```
%% Cell type:code id: tags:
``` python
knn_classify(1, citdf, (-100,35))
```
%% Cell type:code id: tags:
``` python
def test_knn():
" try several different values for k"
for k in [1, 3, 5, 7]:
num_correct = 0
#for location, actual_language in cities:
for i,l in citdf.iterrows():
location = (l['long'],l['lat'])
predicted_language = knn_classify(k, citdf, location)
if predicted_language == l['lang']:
num_correct += 1
print (k, "neighbor[s]:", num_correct,
"correct out of", len(cities))
```
%% Cell type:code id: tags:
``` python
test_knn()
```
%% Cell type:code id: tags:
``` python
def classify_and_plot_grid(k=1):
plots = { "Java" : ([], []), "Python" : ([], []), "R" : ([], []) }
# markers = { "Java" : "o", "Python" : "s", "R" : "^" }
markers = { "Java" : ".", "Python" : ".", "R" : "." }
colors = { "Java" : "r", "Python" : "b", "R" : "g" }
for longitude in range(-130, -60):
for latitude in range(20, 55):
pos = (longitude, latitude)
predicted_language = knn_classify(k, citdf, pos, False)
plots[predicted_language][0].append(longitude)
plots[predicted_language][1].append(latitude)
# create a scatter series for each language
for language, (x, y) in plots.items():
plt.scatter(x, y, color=colors[language], marker=markers[language],
label=language, alpha=0.7)
# plot_state_borders(plt, color='black') # assume we have a function that does this
plt.legend(loc=0) # let matplotlib choose the location
plt.axis([-130,-60,20,55]) # set the axes
plt.title(str(k) + "-Nearest Neighbor Programming Languages")
```
%% Cell type:code id: tags:
``` python
for i in (1,3,5):
plt.figure()
classify_and_plot_grid(i)
```
%% Cell type:markdown id: tags:
### kNN Model summary
* conceptually simple model
* though distance metric critical
* not alway as straightforward as in case of geometrical data
* no real training needed
* works also for very localized or non-linear distributions
* no real model
* evaluation can get slow for large reference data and/or high dimensions
***
%% Cell type:markdown id: tags:
## kNN with Scikit-learn
In the following we will mainly use ML models and tools from the scikit-learn package.
As a quick example we take our language poll data and apply the kNN model from scikit learn to it.
%% Cell type:code id: tags:
``` python
# extract only long and lat
X = citdf.loc[:,'long':'lat']
Y = citdf.lang
X.shape
```
%% Cell type:markdown id: tags:
### Starting with sklearn
usual procedure:
- split dataset into training and validation
- select and initialize sklearn model
- do training
- test/validate
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import train_test_split
```
%% Cell type:code id: tags:
``` python
train_test_split?
```
%% Cell type:code id: tags:
``` python
# X_train, X_test, y_train, y_test = train_test_split( X, Y )
X_train, X_test, y_train, y_test = train_test_split( X, Y )
# by default 75% train, 25% test
print (X_train.shape, X_test.shape)
```
%% Cell type:code id: tags:
``` python
# select kNN model
from sklearn.neighbors import KNeighborsClassifier
# initialize model: parameter n_neighbors
knn = KNeighborsClassifier(n_neighbors=3)
# do the training
knn.fit(X_train, y_train)
```
%% Cell type:markdown id: tags:
#### Apply model, make predictions
%% Cell type:code id: tags:
``` python
# create some coordinates
X_new = np.array([[-100,35]])
# 2D format required, nrows vs ncolums (1x2)
X_new.shape #
```
%% Cell type:code id: tags:
``` python
knn.predict(X_new) # apply model to new data point
```
%% Cell type:markdown id: tags:
#### Test/evaluate model
%% Cell type:code id: tags:
``` python
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))
print(list(y_test))
```
%% Cell type:code id: tags:
``` python
# test by hand
y_ok = y_pred == y_test
print("Test set score: {:.2f}".format(np.mean(y_ok)))
```
%% Cell type:code id: tags:
``` python
# or use equivalent scilearn function for it.
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
```
%% Cell type:code id: tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment