"<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
"<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#k-Nearest-Neighbors\" data-toc-modified-id=\"k-Nearest-Neighbors-1\"><span class=\"toc-item-num\">1 </span>k-Nearest Neighbors</a></span><ul class=\"toc-item\"><li><span><a href=\"#kNN-Model-summary\" data-toc-modified-id=\"kNN-Model-summary-1.1\"><span class=\"toc-item-num\">1.1 </span>kNN Model summary</a></span></li></ul></li><li><span><a href=\"#kNN-with-Scikit-learn\" data-toc-modified-id=\"kNN-with-Scikit-learn-2\"><span class=\"toc-item-num\">2 </span>kNN with Scikit-learn</a></span><ul class=\"toc-item\"><li><span><a href=\"#Starting-with-sklearn\" data-toc-modified-id=\"Starting-with-sklearn-2.1\"><span class=\"toc-item-num\">2.1 </span>Starting with sklearn</a></span><ul class=\"toc-item\"><li><span><a href=\"#Apply-model,-make-predictions\" data-toc-modified-id=\"Apply-model,-make-predictions-2.1.1\"><span class=\"toc-item-num\">2.1.1 </span>Apply model, make predictions</a></span></li><li><span><a href=\"#Test/evaluate-model\" data-toc-modified-id=\"Test/evaluate-model-2.1.2\"><span class=\"toc-item-num\">2.1.2 </span>Test/evaluate model</a></span></li></ul></li></ul></li></ul></div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## k-Nearest Neighbors\n",
"We use k-Nearest Neighbors (kNN) as detailed showcase for an ML model, i.e.\n",
"* we will not just show how to use it in one of the standard ML packages \n",
"* but discuss in some detail the implementation in Python functions\n",
"\n",
"kNN is conceptually simple:\n",
"* need a sample with known classifications\n",
"* for new data look at elements from known sample in **neighborhood**\n",
" * requires some metric to define **distance**\n",
"* classify according to **majority classification** of these neighbors\n",
" \n",
"\n",
"**Real world example -- elections** \n",
"Elections results, i.e. which party is most popular strongly varies between regions. So if you want to predict how a specific person votes then the place where a person lives and how the neighbors voted provides useful information.\n",
"\n",
"Examples from Bundestagswahl 2017:\n",
"* Wahlkreis Jachenau (Bad Tölz) ~62% CSU\n",
"* Wahlbezirk Nürnberg-4553 ~45% SPD\n",
"* though extreme cases, many \"Wahl-Bezirke\" rather balanced\n",
"\n",
"Of course other information might be more important to predict voting decision: \n",
"knn.predict(X_new) # apply model to new data point"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Test/evaluate model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_pred = knn.predict(X_test)\n",
"print(\"Test set predictions:\\n {}\".format(y_pred))\n",
"print(list(y_test))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# test by hand\n",
"y_ok = y_pred == y_test\n",
"print(\"Test set score: {:.2f}\".format(np.mean(y_ok)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# or use equivalent scilearn function for it.\n",
"print(\"Test set score: {:.2f}\".format(knn.score(X_test, y_test)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": true,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}
%% Cell type:markdown id: tags:
<h1>Table of Contents<spanclass="tocSkip"></span></h1>
<divclass="toc"><ulclass="toc-item"><li><span><ahref="#k-Nearest-Neighbors"data-toc-modified-id="k-Nearest-Neighbors-1"><spanclass="toc-item-num">1 </span>k-Nearest Neighbors</a></span><ulclass="toc-item"><li><span><ahref="#kNN-Model-summary"data-toc-modified-id="kNN-Model-summary-1.1"><spanclass="toc-item-num">1.1 </span>kNN Model summary</a></span></li></ul></li><li><span><ahref="#kNN-with-Scikit-learn"data-toc-modified-id="kNN-with-Scikit-learn-2"><spanclass="toc-item-num">2 </span>kNN with Scikit-learn</a></span><ulclass="toc-item"><li><span><ahref="#Starting-with-sklearn"data-toc-modified-id="Starting-with-sklearn-2.1"><spanclass="toc-item-num">2.1 </span>Starting with sklearn</a></span><ulclass="toc-item"><li><span><ahref="#Apply-model,-make-predictions"data-toc-modified-id="Apply-model,-make-predictions-2.1.1"><spanclass="toc-item-num">2.1.1 </span>Apply model, make predictions</a></span></li><li><span><ahref="#Test/evaluate-model"data-toc-modified-id="Test/evaluate-model-2.1.2"><spanclass="toc-item-num">2.1.2 </span>Test/evaluate model</a></span></li></ul></li></ul></li></ul></div>
%% Cell type:markdown id: tags:
## k-Nearest Neighbors
We use k-Nearest Neighbors (kNN) as detailed showcase for an ML model, i.e.
* we will not just show how to use it in one of the standard ML packages
* but discuss in some detail the implementation in Python functions
kNN is conceptually simple:
* need a sample with known classifications
* for new data look at elements from known sample in **neighborhood**
* requires some metric to define **distance**
* classify according to **majority classification** of these neighbors
**Real world example -- elections**
Elections results, i.e. which party is most popular strongly varies between regions. So if you want to predict how a specific person votes then the place where a person lives and how the neighbors voted provides useful information.
Examples from Bundestagswahl 2017:
* Wahlkreis Jachenau (Bad Tölz) ~62% CSU
* Wahlbezirk Nürnberg-4553 ~45% SPD
* though extreme cases, many "Wahl-Bezirke" rather balanced
Of course other information might be more important to predict voting decision:
*education, income, profession, hobbies, ...*
In the following we discuss an example kNN implementation adapted from the book *Data Science from Scratch*
What's needed:
* toy data:
* artificial poll data of person's programming language preference and geographic location (longitude vs latitude)
**metric* for distance:
* simply geographical distance
**list of neighbors* sorted by distance
* function to determine *majority vote* of `k-Nearest Neighbors`