100% Accurate Mushroom Classification in Python

6 min readApr 14, 2021

Mushrooms come in a wide variety of shapes and sizes and colors and some are edible while others should be kept far away from the dinner table. Mushroom classification for those who live in rural areas is something crucial to survival — at least it might have been centuries ago. Skilled horticulturist might not even need a second glance to classify a mushroom. Let’s see if we can build a model to have it automatically detected using a number of variables.

Our data comes from to us from a 1987 dataset found on Kaggle now with over 8124 records.

import warnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, classification_report, roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScalerwarnings.filterwarnings("ignore")dataset = pd.read_csv('mushrooms.csv', na_values='?')
dataset.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                5644 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring    8124 non-null   object
 15  stalk-color-below-ring    8124 non-null   object
 16  veil-type                 8124 non-null   object
 17  veil-color                8124 non-null   object
 18  ring-number               8124 non-null   object
 19  ring-type                 8124 non-null   object
 20  spore-print-color         8124 non-null   object
 21  population                8124 non-null   object
 22  habitat                   8124 non-null   object
dtypes: object(23)
memory usage: 1.4+ MB

There are 23 variables including the class variable. Other variables include the colour, shape and surface of the cap, gill, stalk and veil, as well as odor, bruising or population variables. Most of our variable seem to be free of missing values.

# Missing values in each column
print("Missing values for each column:\n", dataset.isnull().sum())Missing values for each column:
class                          0
cap-shape                      0
cap-surface                    0
cap-color                      0
bruises                        0
odor                           0
gill-attachment                0
gill-spacing                   0
gill-size                      0
gill-color                     0
stalk-shape                    0
stalk-root                  2480
stalk-surface-above-ring       0
stalk-surface-below-ring       0
stalk-color-above-ring         0
stalk-color-below-ring         0
veil-type                      0
veil-color                     0
ring-number                    0
ring-type                      0
spore-print-color              0
population                     0
habitat                        0
dtype: int64

Over 30% of the information in the ‘stalk-root’ column are missing. I decide to fill these in with the current mode of the column, which is ‘b’.

# stalk-root value_counts
dataset['stalk-root'].value_counts()
b    3776
e    1120
c     556
r     192
Name: stalk-root, dtype: int64# Fill 2480 NA in stalk-root with mode
dataset['stalk-root'].fillna(dataset['stalk-root'].mode()[0], inplace=True)

We can now visualise our variables since we have no further data cleaning to undertake. Since all variables are categorical, we can just create a for loop to create countplots for us. Once with just the variables alone and another incorporating a hue for the class/target variable.

# Data Visualisation
# countplot of every variable
for i, col in enumerate(dataset.columns):
    plt.figure(i)
    sns_plot = sns.countplot(x=col, data=dataset)
    sns_plot.figure.savefig("{} countplot.png".format(col))

# countplot of every variable with hue = class/target
for i, col in enumerate(dataset.columns):
    plt.figure(i)
    sns_plot = sns.countplot(x=col, hue='class', data=dataset)
    sns_plot.figure.savefig("{} class countplot.png".format(col))

We’re quite happy to see that our data has a balanced target variable, between the poisonous and edible mushrooms.

Some insights we see are

edible mushrooms are more likely to have bruising
edible mushrooms are most likely to have no odor whilst poisonous mushrooms will have a foul odor
the stalk surface in edible mushrooms are smooth whilst poisonous mushrooms are usually silky
the gill size of edible mushrooms are usually broad whilst poisonous mushrooms are a mix of both broad and narrow
the spore print color of edible mushrooms are usually black or brown whilst for poisonous mushrooms are chocolate and white
the categories which are the worst ones at first glance to use to classify mushrooms are: cap shape, cap surface, cap colour, gill attachment, gill spacing, stalk shape, ring number, veil type, and veil color.

We can move onto now the pre-processing of the data to prepare it for modelling. Let’s first undertake the creation of dummy variables for all variables except ‘class’, using get_dummies. And also we will map the class variable to a binary 1 for poisonous and 0 for edible:

# Preprocessing
# get_dummies for all except target variable
dummies_columns = [list(dataset.columns)[i] for i in np.arange(1,23)]
dataset = pd.get_dummies(dataset, columns=dummies_columns)

# Map target variable class: 1 = poisonous, 0 = edible
dataset['class'] = dataset['class'].map({"p": 1, "e": 0})

We now split the data to have a 80% training and 20% test set:

# Data splitting
# X,y split
x = dataset.iloc[:, 1:].values
y = dataset.iloc[:, 0].values

# Training set and Test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
print("Number transactions x_train dataset: ", x_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions x_test dataset: ", x_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)Number transactions x_train dataset:  (6499, 116)
Number transactions y_train dataset:  (6499,)
Number transactions x_test dataset:  (1625, 116)
Number transactions y_test dataset:  (1625,)

One last thing to do is to undertake feature scaling using StandardScaler:

# Feature Scaling
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

I’ll be using the Logistic Regression algorithm to create our model and classify our test data:

# Modelling
# Fitting Logistic Regression
lr = LogisticRegression()
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)
y_prob = lr.predict_proba(x_test)[:, 1]# Classification report and scoring
print(classification_report(y_test, y_pred))
print(f'ROC AUC score: {roc_auc_score(y_test, y_prob)}')
print('Accuracy Score: ', accuracy_score(y_test, y_pred))              precision    recall  f1-score   support
           0       1.00      1.00      1.00       852
           1       1.00      1.00      1.00       773
    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625ROC AUC score: 1.0
Accuracy Score:  1.0

Our model seems to have a 100% accuracy with a 100% f1-score and 1.0 ROC AUC score. Perfect model!

Let’s visualise this with a confusion matrix

# Visualising Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, cmap='Blues', annot=True, fmt='d', cbar=False, yticklabels=['Edible', 'Poisonous'],
            xticklabels=['Predicted Edible', 'Predicted Poisonous'])
plt.show()

Our ROC AUC Curve likewise looks very appealing.

# Roc AUC Curve
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.plot(false_positive_rate, true_positive_rate, label='AUC = %0.3f' % roc_auc)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend()
plt.show()

We were very lucky to have an almost perfectly balanced dataset with a sufficient amount of data to create an accurate model. If we weren’t so lucky with an unbalanced dataset then we would have to look into sampling techniques as well as look into using more complex algorithms compared to Logistic Regression. Our model however seems to be suffcient and could be used for an amateur ‘shroomer’.

100% Accurate Mushroom Classification in Python

Written by Val