Incomplete data or a missing value is a common issue in data analysis. Systems or humans often collect data with missing values. Actually, we can do data analysis on data with missing values, it means we do not aware of the quality of data. However, it may produce the wrong results because of those missing values. The common approach to deal with missing value is dropping all tuples that have missing values. The problem with this dropping approach is it may generate bias results especially if the rows that contain NaN values are large, while in the end, we have to drop a large number of tuples. This way can be used if the data has a small number of missing values. In the case of data with a large number of missing values, we have to repair those missing values.
There are a lot of proposed imputation methods for repairing missing values. The simplest one is to repair missing values with the mean, median, or mode. It can be the mean of whole data or mean of each column in the data frame.
In this experiment, we will use Boston housing dataset. The Boston data frame has 506 rows and 14 columns. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data has been used in many machine learning papers that address regression problems. MEDV attribute is the target (dependent variable), where others are independent variables. This dataset is available in the scikit-learn library, so we can just import it directly. As usual, in this experiment, I am going to use Python Jupyter notebook. If you are not familiar with Jupyter Notebook, Pandas, Numpy, and other python libraries, I have a couple of old posts that may useful for you: 1) setup anaconda 2) understand python libraries for data science.
Let’s get started…
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_boston boston_bunch = load_boston() dfx = pd.DataFrame(boston_bunch.data, columns = boston_bunch.feature_names) #independent variables dfy = pd.DataFrame(boston_bunch.target, columns = ['target']) #dependent variables boston = dfx.join(dfy) )
We can use command
boston.head() to see the data, and
boston.shape to see the dimension of the data. The next step is check the number of Na in boston dataset using command below.
The result shows that Boston dataset has no Na values. The question is how we create/change some values to NA? is that possible?
#Change some values (20%) to NAN randomly # Change 10% Values to NA randomly import collections import random df = boston replaced = collections.defaultdict(set) ix = [(row, col) for row in range(df.shape) for col in range(df.shape)] random.shuffle(ix) to_replace = int(round(.2*len(ix))) for row, col in ix: if len(replaced[row]) < df.shape - 1: df.iloc[row, col] = np.nan to_replace -= 1 replaced[row].add(col) if to_replace == 0: break
Using the code above, we can replace some values (20%) in Boston dataset to NA. We also can change the percentage of NA by changing the code above(see .2*). Na values are absolutely random with respect to the whole data. Then, now check again is there any missing values in our boston dataset?
The result shows that all columns have around 20% NaN values. Then how to replace all those missing values (impute those missing values) based on the mean of each column?
#fill NA with mean() of each column in boston dataset df = df.apply(lambda x: x.fillna(x.mean()),axis=0)
Now, use command
boston.head() to see the data. We have fixed missing values based on the mean of each column. We also can impute our missing values using
mode() by replacing the function
This imputation method is the simplest one, there are a lot of sophisticated algorithms (e.g., regression, monte carlo, etc) out there that can be used for repairing missing values. Maybe I’ll post it next time.
All codes and results can be accessed through this link https://github.com/rischanlab/PyDataScience.Org/blob/master/python_notebook/2%20Impute%20NaN%20values%20with%20mean%20of%20each%20columns%20.ipynb
Thank you, see you