Data Analysis · Data Mining · Pandas · Python · SciKit-Learn

Impute NaN values with mean of column Pandas Python

Incomplete data or a missing value is a common issue in data analysis. Systems or humans often collect data with missing values. Actually, we can do data analysis on data with missing values, it means we do not aware of the quality of data. However, it may produce the wrong results because of those missing values. The common approach to deal with missing value is dropping all tuples that have missing values. The problem with this dropping approach is it may generate bias results especially if the rows that contain NaN values are large, while in the end, we have to drop a large number of tuples. This way can be used if the data has a small number of missing values. In the case of data with a large number of missing values, we have to repair those missing values.

There are a lot of proposed imputation methods for repairing missing values. The simplest one is to repair missing values with the mean, median, or mode. It can be the mean of whole data or mean of each column in the data frame.

In this experiment, we will use Boston housing dataset. The Boston data frame has 506 rows and 14 columns. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data has been used in many machine learning papers that address regression problems. MEDV attribute is the target (dependent variable), where others are independent variables. This dataset is available in the scikit-learn library, so we can just import it directly.  As usual, in this experiment, I am going to use Python Jupyter notebook. If you are not familiar with Jupyter Notebook, Pandas, Numpy, and other python libraries, I have a couple of old posts that may useful for you: 1) setup anaconda 2) understand python libraries for data science.

Let’s get started…


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

boston_bunch = load_boston()
dfx = pd.DataFrame(boston_bunch.data, columns = boston_bunch.feature_names) #independent variables
dfy = pd.DataFrame(boston_bunch.target, columns = ['target']) #dependent variables
boston = dfx.join(dfy)
)

We can use command boston.head() to see the data, and boston.shape to see the dimension of the data. The next step is check the number of Na in boston dataset using command below.


boston.isnull().sum()

The result shows that Boston dataset has no Na values. The question is how we create/change some values to NA? is that possible?


#Change some values (20%) to NAN randomly
# Change 10% Values to NA randomly
import collections
import random
df = boston
replaced = collections.defaultdict(set)
ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
random.shuffle(ix)
to_replace = int(round(.2*len(ix)))
for row, col in ix:
    if len(replaced[row]) < df.shape[1] - 1:
        df.iloc[row, col] = np.nan
        to_replace -= 1
        replaced[row].add(col)
        if to_replace == 0:
            break

Using the code above, we can replace some values (20%) in Boston dataset to NA. We also can change the percentage of NA by changing the code above(see .2*). Na values are absolutely random with respect to the whole data. Then, now check again is there any missing values in our boston dataset?


boston.isnull().sum()

The result shows that all columns have around 20% NaN values. Then how to replace all those missing values (impute those missing values) based on the mean of each column?


#fill NA with mean() of each column in boston dataset
df = df.apply(lambda x: x.fillna(x.mean()),axis=0)

Now, use command boston.head() to see the data. We have fixed missing values based on the mean of each column. We also can impute our missing values using median() or mode() by replacing the function mean().

This imputation method is the simplest one, there are a lot of sophisticated algorithms (e.g., regression, monte carlo, etc) out there that can be used for repairing missing values. Maybe I’ll post it next time.

All codes and results can be accessed through this link https://github.com/rischanlab/PyDataScience.Org/blob/master/python_notebook/2%20Impute%20NaN%20values%20with%20mean%20of%20each%20columns%20.ipynb

Thank you, see you

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.