Hi guys!
In the last post, I wrote about how to deal with missing values in a dataset. Honestly, that post is related to my PhD project. I will not explain the detail of my project but I need to replace a certain of percentage (10,20,…90 %) of my dataset to NaN then impute all those NaN values. In that post, I did experiment using Boston dataset, a quite small dataset (13 dimensions and 506 rows/tuples).
I got problems when I did experiments with a bigger dataset, for instance, Flights dataset. This dataset consists of 11 dimensions and almost one million rows (tuples) which is a quite large number. Let see Figure below:
To replace 80% values to NaN in Flight dataset using Pandas operation, it takes around 469 seconds. It’s really slow. Moreover, in this case, I only work on 8 dimensions (only numerical attributes).
I guess there are some reasons, why it has a slow performance: 1) because of the code itself; 2) due to using Pandas for large number operations, or 3) due to both reasons.
I was trying to find the answer and I found two posts about comparison performance between Numpy and Pandas including when we should use Numpy and Pandas: ([1], [2])
After reading those posts, I decided to use Numpy instead of Pandas in my operation due to my dataset has a large number of tuples (almost one million tuples).
This is how I implement. This function below is a function for replacing values to NaN:
def dropout(a, percent): # create a copy mat = a.copy() # number of values to replace prop = int(mat.size * percent) # indices to mask mask = random.sample(range(mat.size), prop) # replace with NaN np.put(mat, mask, [np.NaN]*len(mask)) return mat
The code below is for missing values imputation. The code below is based on scikit learn example (scikit-learn has a function for imputing missing values). I imputed all numerical missing values with mean and all categorical missing values with the most frequent values:
from sklearn.base import TransformerMixin class DataFrameImputer(TransformerMixin): def __init__(self): """Impute missing values. Columns of dtype object are imputed with the most frequent value in column. Columns of other types are imputed with mean of column. """ def fit(self, X, y=None): self.fill = pd.Series([X[c].value_counts().index[0] if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns) return self def transform(self, X, y=None): return X.fillna(self.fill)
Here the results:
In Figure above, it can be seen that I converted Pandas data frame to numpy array. Just use this command
data = df.values
, your data frame will be converted to numpy array. Then I run the dropout function when all data in the form of numpy array. In the end, I re-converted again the data to Pandas dataframe after the operations finished.
Using Numpy operation to replace 80% data to NaN including imputing all NaN with most frequent values only takes 4 seconds. Moreover, in this case, I work on 11 dimensions (categorical and numerical attributes).
From this post, I just want to share to you that your choice matters. When we want to deal with a large number of tuples, we may consider choosing numpy instead of pandas. However, another important thing is no one can write optimized code!!
See you again in the next post!