Numpy vs Pandas Performance

Hi guys!

In the last post, I wrote about how to deal with missing values in a dataset. Honestly, that post is related to my PhD project. I will not explain the detail of my project but I need to replace a certain of percentage (10,20,…90 %) of my dataset to NaN then impute all those NaN values. In that post, I did experiment using Boston dataset, a quite small dataset (13 dimensions and 506 rows/tuples).

I got problems when I did experiments with a bigger dataset, for instance, Flights dataset. This dataset consists of 11 dimensions and almost one million rows (tuples) which is a quite large number. Let see Figure below:

To replace 80% values to NaN in Flight dataset using Pandas operation, it takes around 469 seconds. It’s really slow. Moreover, in this case, I only work on 8 dimensions (only numerical attributes).

I guess there are some reasons, why it has a slow performance: 1) because of the code itself; 2) due to using Pandas for large number operations, or 3) due to both reasons.

I was trying to find the answer and I found two posts about comparison performance between Numpy and Pandas including when we should use Numpy and Pandas: ([1], [2])

After reading those posts, I decided to use Numpy instead of Pandas in my operation due to my dataset has a large number of tuples (almost one million tuples).

This is how I implement. This function below is a function for replacing values to NaN:

def dropout(a, percent):
    # create a copy
    mat = a.copy()
    # number of values to replace
    prop = int(mat.size * percent)
    # indices to mask
    mask = random.sample(range(mat.size), prop)
    # replace with NaN
    np.put(mat, mask, [np.NaN]*len(mask))
    return mat

The code below is for missing values imputation. The code below is based on scikit learn example (scikit-learn has a function for imputing missing values). I imputed all numerical missing values with mean and all categorical missing values with the most frequent values:

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        Columns of dtype object are imputed with the most frequent value 
        in column.
        Columns of other types are imputed with mean of column.
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
        return self
    def transform(self, X, y=None):
        return X.fillna(self.fill)

Here the results:

In Figure above, it can be seen that I converted Pandas data frame to numpy array. Just use this command

data = df.values

, your data frame will be converted to numpy array. Then I run the dropout function when all data in the form of numpy array. In the end, I re-converted again the data to Pandas dataframe after the operations finished.

Using Numpy operation to replace 80% data to NaN including imputing all NaN with most frequent values only takes 4 seconds. Moreover, in this case, I work on 11 dimensions (categorical and numerical attributes).

From this post, I just want to share to you that your choice matters. When we want to deal with a large number of tuples, we may consider choosing numpy instead of pandas. However, another important thing is no one can write optimized code!!

See you again in the next post!