Numpy vs Pandas Performance

Hi guys!

In the last post, I wrote about how to deal with missing values in a dataset. Honestly, that post is related to my PhD project. I will not explain the detail of my project but I need to replace a certain of percentage (10,20,…90 %) of my dataset to NaN then impute all those NaN values. In that post, I did experiment using Boston dataset, a quite small dataset (13 dimensions and 506 rows/tuples).

I got problems when I did experiments with a bigger dataset, for instance, Flights dataset. This dataset consists of 11 dimensions and almost one million rows (tuples) which is a quite large number. Let see Figure below:

To replace 80% values to NaN in Flight dataset using Pandas operation, it takes around 469 seconds. It’s really slow. Moreover, in this case, I only work on 8 dimensions (only numerical attributes).

I guess there are some reasons, why it has a slow performance: 1) because of the code itself; 2) due to using Pandas for large number operations, or 3) due to both reasons.

I was trying to find the answer and I found two posts about comparison performance between Numpy and Pandas including when we should use Numpy and Pandas: ([1], [2])

After reading those posts, I decided to use Numpy instead of Pandas in my operation due to my dataset has a large number of tuples (almost one million tuples).

This is how I implement. This function below is a function for replacing values to NaN:

 
def dropout(a, percent):
    # create a copy
    mat = a.copy()
    # number of values to replace
    prop = int(mat.size * percent)
    # indices to mask
    mask = random.sample(range(mat.size), prop)
    # replace with NaN
    np.put(mat, mask, [np.NaN]*len(mask))
    return mat

The code below is for missing values imputation. The code below is based on scikit learn example (scikit-learn has a function for imputing missing values). I imputed all numerical missing values with mean and all categorical missing values with the most frequent values:

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        Columns of dtype object are imputed with the most frequent value 
        in column.
        Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.fill)

Here the results:

In Figure above, it can be seen that I converted Pandas data frame to numpy array. Just use this command

data = df.values

, your data frame will be converted to numpy array. Then I run the dropout function when all data in the form of numpy array. In the end, I re-converted again the data to Pandas dataframe after the operations finished.

Using Numpy operation to replace 80% data to NaN including imputing all NaN with most frequent values only takes 4 seconds. Moreover, in this case, I work on 11 dimensions (categorical and numerical attributes).

From this post, I just want to share to you that your choice matters. When we want to deal with a large number of tuples, we may consider choosing numpy instead of pandas. However, another important thing is no one can write optimized code!!

See you again in the next post!

Python for Data Science

I have been two years doing processing and manipulating data using R and mostly I use this language for my research project. I only heard and never tried Python for my work before. But now, after I use Python, I really fall in love with this language. Python is very simple and it is been known that this language is the easiest one to be learned. The reason why previously I used R was this language is supported by tons of libraries for scientific analysis and all of those are open source. Now, with the popularity of Python, I can find easily all libraries that I need in Python and all of them open source as well.

There are core libraries that you must know when you start to do data analytics using Python:

  1. NumPy, it stands for Numerical Python. Python is different with R, the purpose of R language is for scientist. On the other side, Python is just general programming language. That’s why Python needs a library to handle numerical things such as complex arrays and matrics. Repo project link: https://github.com/numpy/numpy
  2. SciPy, this library is for scientific and it handles such as statistic computing, linear algebra, optimation etc. Repo project link: https://github.com/scipy/scipy
  3. Pandas, if you have experiences with R, it is very similar to DataFrame. Using DataFrame, we can easily manipulate, aggregate, and doing analysis on our dataset. The data will be shown in a table similar to Excel Spreadsheet or DataFrame in R and it convenient to access the data by columns, rows or else. Repo project link: https://github.com/pandas-dev/pandas
  4. Matplotlib, Plotting is very important for data analysis. Why we need plotting? the simple answer is to make anyone easier and we know that one picture can descript 1000 words. To generate visualization from dataset, we absolutely need data visualization tools. If you have experiences with Excel, it is very easy, just block the table that you want to plot and select the plotting types such as Bar chart, line chart, etc. In R, the most popular tools for plotting is ggplot, basically, you can use standard library ‘plot’ in R but if you want more advanced and more beautiful figure you need to use ggplot.  How about in Python? Matplotlib is the basic library for visualization in Python, Repo project link: https://github.com/matplotlib/matplotlib

Those are the core basic libraries that you need when you start to use Python for data analytics. There are tons of Python libraries out there, here some of them that may useful for you:

  1. SciKit-Learn, when you want to apply machine learning, you have to understand this.
  2. Scrapy, to scrap the data from the Web, when you want to gather the data from websites for your analysis. For instance, collecting tweets data from Twitter.
  3. NLTK, if you want to do natural language processing.
  4. Theano, Tensorflow, Keras, when you are not satisfied with NumPy performance or want to apply neural network algorithms or doing deep learning stuff, you have to understand these libraries.
  5. Interactive Visualization Tools, matplotlib is basic plotting tool and it is enough for me as researcher especially for publications, but when we want a dynamic plotting or more interactive, we can use Seaborn, Ploty, or Bokeh.

pythonenvironment

If you do not want to think too much about how to install all of those libraries, just try to use Anaconda, it is really cool. 

See ya next time

Brisbane, 24 November 2017

Python for Data Science Cheat Sheet

Berikut adalah Python for Data Science Cheat Sheet yang cukup membantu untuk merefresh ingatan kita atau bagi yang baru awal menggunakan Python untuk analisis data, mining data atau data science bisa dijadikan bahan bacaan.

 

Python Basic for Data Science

Berikut Cheat Sheet nya:
Python basic

Untuk file PDF dengan kualitas bagus bisa didownload di sini

 

Python NumPy Cheat Sheet

Berikut Cheat Sheet gambarnya:

Numpy Basic

Untuk file PDF dengan kualitas bagus bisa didownload di sini

 

Python Pandas Cheat Sheet

Berikut Cheat Sheet gambarnya:

Pandas Basic

Untuk file PDF dengan kualitas bagus bisa didownload di sini

 

Python Bokeh Interactive Visualization Cheat Sheet

Berikut Cheat Sheet gambarnya:

Bokeh

Untuk file PDF dengan kualitas bagus bisa didownload di sini

 

Cheat Sheet di atas saya dapatkan dari DataCamp.

Semoga bermanfaat.