"A number of Python packages provide implementations of machine learning algorithms. \n",
"**[Scikit-Learn](http://scikit-learn.org)** is one of the most popular.\n",
"* it provides many of the common ML algorithms\n",
"* well-designed, uniform API (programming interface)\n",
" * standardized and largely streamlined setup of the different models \n",
" → easy to switch\n",
"* good documentation\n",
"\n",
"The first example is based on the **[Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set)**. This had already been introduced by famous statistician\n",
"Ronald Fisher in 1936 and is used since then as instructive use case for classification. \n",
"The data consists of\n",
"* measurements of length and width of both sepal (Blütenkelch) and petal (Blüte) \n",
"* classification of Iris sub-species\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# the usual setup: \n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# seaboorn provides easy way to import iris dataset as pandas dataframe\n",
"import seaborn as sns\n",
"iris = sns.load_dataset('iris')\n",
"iris.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"iris"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data visualization\n",
"First step should always be some investigation of data properties, i.e.\n",
"model.fit(X_train, y_train) # 3. fit model to data\n",
"y_svm = model.predict(X_test) # 4. predict on new data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# use scilearn function for score\n",
"from sklearn.metrics import accuracy_score\n",
"score = accuracy_score(y_svm, y_test)\n",
"print(\"Test set score: {:.3f}\".format(score))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"### Probabilistic Classification\n",
"\n",
"In general models can not only be used to give single classification as in above examples but one can also get a list of probabilities for the different possible outcomes:"
"Plot shows pixel image together with label (in green).\n",
"\n",
"* Some images are obvious\n",
"* Others seem difficult "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Look at data from 1st image --> 2D table resembles 0\n",
"print (digits.images[1])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"digits.images[0].shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Image data with sklearn:\n",
"To use the data with sklearn as before we need a 2D structure: `samples x features` , i.e. the 8x8 images should be transformed into flat 1x64 array. \n",
"\n",
"Already provided in Dataset, element `data` :"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print (digits.data[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# to use as before just re-label\n",
"X = digits.data\n",
"y = digits.target"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"### Digit classification\n",
"\n",
"multi-classification problem\n",
"* conceptually no big difference to binary classification\n",
"##### GNB significantly worse, many more mis-ids!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"#### Exercise: ####\n",
"Also try the other models we discussed for classification, i.e. logistic regression and SVC"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": true,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "261px"
},
"toc_section_display": true,
"toc_window_display": false
},
"toc-showtags": false
},
"nbformat": 4,
"nbformat_minor": 4
}
%% Cell type:markdown id: tags:
# Scikit-Learn Introduction
A number of Python packages provide implementations of machine learning algorithms.
**[Scikit-Learn](http://scikit-learn.org)** is one of the most popular.
* it provides many of the common ML algorithms
* well-designed, uniform API (programming interface)
* standardized and largely streamlined setup of the different models
→ easy to switch
* good documentation
The first example is based on the **[Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set)**. This had already been introduced by famous statistician
Ronald Fisher in 1936 and is used since then as instructive use case for classification.
The data consists of
* measurements of length and width of both sepal (Blütenkelch) and petal (Blüte)
* classification of Iris sub-species
%% Cell type:code id: tags:
``` python
# the usual setup:
importpandasaspd
importnumpyasnp
importmatplotlib.pyplotasplt
%matplotlibinline
```
%% Cell type:code id: tags:
``` python
# seaboorn provides easy way to import iris dataset as pandas dataframe
importseabornassns
iris=sns.load_dataset('iris')
iris.head()
```
%% Cell type:code id: tags:
``` python
iris
```
%% Cell type:markdown id: tags:
## Data visualization
First step should always be some investigation of data properties, i.e.
y_svm=model.predict(X_test)# 4. predict on new data
```
%% Cell type:code id: tags:
``` python
# use scilearn function for score
fromsklearn.metricsimportaccuracy_score
score=accuracy_score(y_svm,y_test)
print("Test set score: {:.3f}".format(score))
```
%% Cell type:markdown id: tags:
***
### Probabilistic Classification
In general models can not only be used to give single classification as in above examples but one can also get a list of probabilities for the different possible outcomes:
%% Cell type:code id: tags:
``` python
model=LogisticRegression(max_iter=500)# 2. instantiate model
model.fit(X_train,y_train)# 3. fit model to data
yout=model.predict_proba(X_test)
print (yout[:5])
```
%% Cell type:code id: tags:
``` python
yout[y_lr!=y_test]
```
%% Cell type:code id: tags:
``` python
list(zip(y_lr[:5],y_test[:5]))
```
%% Cell type:markdown id: tags:
Depending on type of problem this information can be further used to distinguish clear cases and those with overlapping classifications.
Or one can use it to adjust trade-off between precision and recall.
%% Cell type:markdown id: tags:
***
## Classification for digit data
Another classic example case for ML is handwritten digits data.
A suitable dataset is included with sklearn, first we look into it:
%% Cell type:code id: tags:
``` python
fromsklearn.datasetsimportload_digits
digits=load_digits()
digits.images.shape
```
%% Cell type:code id: tags:
``` python
type(digits)
```
%% Cell type:code id: tags:
``` python
digits?
```
%% Cell type:code id: tags:
``` python
print(digits.DESCR)
```
%% Cell type:markdown id: tags:
The data is sklearn specific container, basically a list of 8x8 pixels images