Data Analysis · Data Mining · NumPy · Pandas · Python · SciKit-Learn

Numpy vs Pandas Performance

Hi guys!

In the last post, I wrote about how to deal with missing values in a dataset. Honestly, that post is related to my PhD project. I will not explain the detail of my project but I need to replace a certain of percentage (10,20,…90 %) of my dataset to NaN then impute all those NaN values. In that post, I did experiment using Boston dataset, a quite small dataset (13 dimensions and 506 rows/tuples).

I got problems when I did experiments with a bigger dataset, for instance, Flights dataset. This dataset consists of 11 dimensions and almost one million rows (tuples) which is a quite large number. Let see Figure below:

To replace 80% values to NaN in Flight dataset using Pandas operation, it takes around 469 seconds. It’s really slow. Moreover, in this case, I only work on 8 dimensions (only numerical attributes).

I guess there are some reasons, why it has a slow performance: 1) because of the code itself; 2) due to using Pandas for large number operations, or 3) due to both reasons.

I was trying to find the answer and I found two posts about comparison performance between Numpy and Pandas including when we should use Numpy and Pandas: ([1], [2])

After reading those posts, I decided to use Numpy instead of Pandas in my operation due to my dataset has a large number of tuples (almost one million tuples).

This is how I implement. This function below is a function for replacing values to NaN:

 
def dropout(a, percent):
    # create a copy
    mat = a.copy()
    # number of values to replace
    prop = int(mat.size * percent)
    # indices to mask
    mask = random.sample(range(mat.size), prop)
    # replace with NaN
    np.put(mat, mask, [np.NaN]*len(mask))
    return mat

The code below is for missing values imputation. The code below is based on scikit learn example (scikit-learn has a function for imputing missing values). I imputed all numerical missing values with mean and all categorical missing values with the most frequent values:

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        Columns of dtype object are imputed with the most frequent value 
        in column.
        Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.fill)

Here the results:

In Figure above, it can be seen that I converted Pandas data frame to numpy array. Just use this command

data = df.values

, your data frame will be converted to numpy array. Then I run the dropout function when all data in the form of numpy array. In the end, I re-converted again the data to Pandas dataframe after the operations finished.

Using Numpy operation to replace 80% data to NaN including imputing all NaN with most frequent values only takes 4 seconds. Moreover, in this case, I work on 11 dimensions (categorical and numerical attributes).

From this post, I just want to share to you that your choice matters. When we want to deal with a large number of tuples, we may consider choosing numpy instead of pandas. However, another important thing is no one can write optimized code!!

See you again in the next post!

Data Analysis · Data Mining · Pandas · Python · SciKit-Learn

Impute NaN values with mean of column Pandas Python

Incomplete data or a missing value is a common issue in data analysis. Systems or humans often collect data with missing values. Actually, we can do data analysis on data with missing values, it means we do not aware of the quality of data. However, it may produce the wrong results because of those missing values. The common approach to deal with missing value is dropping all tuples that have missing values. The problem with this dropping approach is it may generate bias results especially if the rows that contain NaN values are large, while in the end, we have to drop a large number of tuples. This way can be used if the data has a small number of missing values. In the case of data with a large number of missing values, we have to repair those missing values.

There are a lot of proposed imputation methods for repairing missing values. The simplest one is to repair missing values with the mean, median, or mode. It can be the mean of whole data or mean of each column in the data frame.

In this experiment, we will use Boston housing dataset. The Boston data frame has 506 rows and 14 columns. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data has been used in many machine learning papers that address regression problems. MEDV attribute is the target (dependent variable), where others are independent variables. This dataset is available in the scikit-learn library, so we can just import it directly.  As usual, in this experiment, I am going to use Python Jupyter notebook. If you are not familiar with Jupyter Notebook, Pandas, Numpy, and other python libraries, I have a couple of old posts that may useful for you: 1) setup anaconda 2) understand python libraries for data science.

Let’s get started…


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

boston_bunch = load_boston()
dfx = pd.DataFrame(boston_bunch.data, columns = boston_bunch.feature_names) #independent variables
dfy = pd.DataFrame(boston_bunch.target, columns = ['target']) #dependent variables
boston = dfx.join(dfy)
)

We can use command boston.head() to see the data, and boston.shape to see the dimension of the data. The next step is check the number of Na in boston dataset using command below.


boston.isnull().sum()

The result shows that Boston dataset has no Na values. The question is how we create/change some values to NA? is that possible?


#Change some values (20%) to NAN randomly
# Change 10% Values to NA randomly
import collections
import random
df = boston
replaced = collections.defaultdict(set)
ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
random.shuffle(ix)
to_replace = int(round(.2*len(ix)))
for row, col in ix:
    if len(replaced[row]) < df.shape[1] - 1:
        df.iloc[row, col] = np.nan
        to_replace -= 1
        replaced[row].add(col)
        if to_replace == 0:
            break

Using the code above, we can replace some values (20%) in Boston dataset to NA. We also can change the percentage of NA by changing the code above(see .2*). Na values are absolutely random with respect to the whole data. Then, now check again is there any missing values in our boston dataset?


boston.isnull().sum()

The result shows that all columns have around 20% NaN values. Then how to replace all those missing values (impute those missing values) based on the mean of each column?


#fill NA with mean() of each column in boston dataset
df = df.apply(lambda x: x.fillna(x.mean()),axis=0)

Now, use command boston.head() to see the data. We have fixed missing values based on the mean of each column. We also can impute our missing values using median() or mode() by replacing the function mean().

This imputation method is the simplest one, there are a lot of sophisticated algorithms (e.g., regression, monte carlo, etc) out there that can be used for repairing missing values. Maybe I’ll post it next time.

All codes and results can be accessed through this link https://github.com/rischanlab/PyDataScience.Org/blob/master/python_notebook/2%20Impute%20NaN%20values%20with%20mean%20of%20each%20columns%20.ipynb

Thank you, see you

Data Analysis · Data Mining · NumPy · Pandas · SciKit-Learn

Remove Duplicates from Correlation Matrix Python

Correlation is one of the most important things that usually used by the data analysts in their analytical workflow. By using correlation, we can understand the mutual relationship or association between two attributes. Let’s start with the example. For instance, I want to do an analysis of the “Boston housing dataset”, let see the example code below. If you are not familiar with Jupyter Notebook, Pandas, Numpy, and other python libraries, I have a couple of old posts that may useful for you: 1) setup anaconda 2) understand python libraries for data science.


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

boston = datasets.load_boston() #Load Boston Housing dataset, this dataset is available on Scikit-learn
boston = pd.DataFrame(boston['data'], columns=boston['feature_names'])
)

We can use the command boston.head() to see the data, and boston.shape to see the dimension of the data. We can easily use this command below to get correlation value among all attributes in Boston housing dataset. (e.g., in this experiment I used Pearson correlation).


dataCorr = boston.corr(method='pearson')
dataCorr

After using this command, we will see the matrix of correlation like in Figure below:

The question is, how to remove duplicates from this matrix of correlation to make it more readable? I found a nice answer on stackoverflow, we can use this command:


dataCorr = dataCorr[abs(dataCorr) >= 0.01].stack().reset_index()
dataCorr = dataCorr[dataCorr['level_0'].astype(str)!=dataCorr['level_1'].astype(str)]

# filtering out lower/upper triangular duplicates 
dataCorr['ordered-cols'] = dataCorr.apply(lambda x: '-'.join(sorted([x['level_0'],x['level_1']])),axis=1)
dataCorr = dataCorr.drop_duplicates(['ordered-cols'])
dataCorr.drop(['ordered-cols'], axis=1, inplace=True)

dataCorr.sort_values(by=[0], ascending=False).head(10) #Get 10 highest correlation of pairwaise attributes

Finally, we get the table that consists of the pair of attributes and the correlation values, and the most important thing is we do not have any duplication.

I also found another way from a guy in Github who create a nice function to remove this duplication. Please see through this link.

Thank you and see you next time.

Data Analysis · NumPy · Pandas

Python Pandas DataFrame Basics Tutorial

In this post, I am going to show you how to deal with data in Python. Before going there, you have to understand what kind of python libraries that you need to know if you want to deal with data in Python. Python has tons of libraries especially related to data science. I have a couple of old posts that may useful for you: 1) setup anaconda 2) understand python libraries for data science. In this tutorial, I use Jupyter Notebook, if you did not have/familiar yet, please read the instruction above, otherwise, just go down!

Let start from the simplest one. When you want to deal with data in Python. Python has an amazing library called Pandas. If you are familiar with Spreadsheet tool such as MS Excel, Pandas similar to that kind of tool, Pandas shows our data in the format of Table. The only difference is, when you use Excel you just drag and drop but here in Pandas, you have to understand the standard syntax and command of pandas.

Let’s get started.

To start analyzing data, you can import your data (e.g., csv, xls, and etc) to python environment using Pandas: import pandas as pd then pd.read_csv('data.csv'). However, to make it easier, in this tutorial, I just create my data rather than import from file.


#Create DataFrame
import pandas as pd #when we want to use Pandas, we have to import it
import numpy as np #numpy is another useful library in python for dealing with number
df = pd.DataFrame(
    {'integer':[1,2,3,6,7,23,8,3],
     'float':[2,3.4,5,6,2,4.7,4,8],
     'string':['saya','aku', np.NaN ,'cinta','kamu','a','b','indonesia']}
)

1. To show your DataFrame, just use this command!

#show data in DataFrame
df

2. If you want to access your single or more data from your DataFrame, you can access it using loc syntax.

#Show data based on index
df.loc[1]

3. If you only need some columns and ignore other columns, you can just select the columns:

#show data based on columns selected
df[['string','float']]

4. You also can apply IF condition on your data similar to the filter in Excel. Use this command and see your result.

#show data with condition
df[df['float']>4]

5. You also able to rename the columns by using this command:

#rename column in DataFrame
df.rename(columns={'string':'char'})

6. When I create the data, I add one row that contains Nan or null value. The missing value is a common issue in data. So, how to deal with missing values? first, we need to know whether our DataFrame contains missing values or not by using this command.

#Show NaN value in DataFrame
df.isnull()

7. The simplest way to deal with missing values is to drop all missing values. How to drop missing values? here the command:

# Drop all rows contain NaN value
df.dropna()

8. We also can make summaries from our data (e.g., mean, median, mode, max, etc), use this command and see what you got!

#Show mean, median, and maximum in Data Frame
mean = df['float'].mean()
print("mean =", mean)
median = df['float'].median()
print("median = ", median)
max = df['float'].max()
print("max =", max)

Here the result: https://github.com/rischanlab/PyDataScience.Org/blob/master/python_notebook/1%20Basic%20Pandas%20Operation%20.ipynb

 

 

Data Analysis · Matplotlib · Plotting in Python

Plot multiple lines in one chart with different style Python matplotlib

Sometimes we need to plot multiple lines in one chart using different styles such as dot, line, dash, or maybe with different colour as well. It is quite easy to do that in basic python plotting using matplotlib library.

We start with the simple one, only one line:


import matplotlib.pyplot as plt
plt.plot([1,2,3,4])

# when you want to give a label
plt.xlabel('This is X label')
plt.ylabel('This is Y label')
plt.show()

 

Let’s go to the next step, several lines with different colour and different styles.


import numpy as np
import matplotlib.pyplot as plt

# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)

# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()

If only three lines, it seems still easy, how if there are many lines, e.g: six lines.


import matplotlib.pyplot as plt
import numpy as np

x=np.arange(6)

fig=plt.figure()
ax=fig.add_subplot(111)

ax.plot(x,x,c='b',marker="^",ls='--',label='Greedy',fillstyle='none')
ax.plot(x,x+1,c='g',marker=(8,2,0),ls='--',label='Greedy Heuristic')
ax.plot(x,(x+1)**2,c='k',ls='-',label='Random')
ax.plot(x,(x-1)**2,c='r',marker="v",ls='-',label='GMC')
ax.plot(x,x**2-1,c='m',marker="o",ls='--',label='KSTW',fillstyle='none')
ax.plot(x,x-1,c='k',marker="+",ls=':',label='DGYC')

plt.legend(loc=2)
plt.show()

Now, we can plot multiple lines with multiple styles in one chart.

These are some resources from matplotlib documentation that may useful:

  1. Marker types of matplotlib https://matplotlib.org/examples/lines_bars_and_markers/marker_reference.html
  2. Line styles matplotlib https://matplotlib.org/1.3.1/examples/pylab_examples/line_styles.html
  3. Matplotlib marker explanation https://matplotlib.org/api/markers_api.html

In this experiment, we define each line manually while it can be hard if we want to generate line chart from dataset. In the next experiment, we use real an Excel dataset and plot the data to line chart with different markers without defining one by one for each line -> just check it here https://pydatascience.org/2017/12/05/read-the-data-and-plotting-with-multiple-markers/

*Some part of the codes, I took from StackOverflow

Anaconda · Data Analysis

Python for Data Science using Anaconda

A few years ago, I had a complate setup environment for Python data analysis on my Macbook. I had Python with python virtual environments (virtualenv), of course, one of my virtualenv has complete data science libraries. I used this virtualenv when I want to do data analysis. I also had experiences using docker-machine to be more productive and reproducible in analyzing data but it is too heavy on my Laptop so I kept using virtualenv. I have heard about Anaconda or conda which is the platform that bundles all data science libraries to one plate, but I never tried yet. It is hard for me to move on.

After I start my PhD study in UQ Australia, I have a new computer in my office and it is Windows. I wanted to start working with data using Python in that computer and I tried to remember all steps that I have to do,  so I can do data analysis on my office computer: Installing Python, installing PIP, installing virtualenv, installing virtualenv wrapper, installing all data science libraries to one of my virtualenv, and start working!

As a lazy guy, I do not want to do that. I went to anaconda website and I decided to try Anaconda and tarrraa!!!!!!!!

Just go to https://www.anaconda.com/download/ download the installer which matches with your operating system, install it then launch the Anaconda Navigator.

 

It was surprising me, I even can run Rstudio using Anaconda Navigator. If you enjoy using Jupyter (IPython Notebook), just press launch for the Jupyter. It is also has one of beautiful IDE to do data analysis in Python which is Spider. It is really cool! Previously when I used R, I always use Rstudio as my IDE to do data analysis and now if you want to move to Python, you can use a similar IDE which is Spider.

If you want to know what kind of data science libraries that you need to install manually if you don’t want to use Anaconda, please visit this link. The picture below describes some Python libraries for Data Science which may useful for you:

Bokeh · Data Analysis · Data Mining · Keras · Machine Learning · Matplotlib · NumPy · Pandas · Plotting in Python · Ploty · SciKit-Learn · SciPy · Seaborn

Python for Data Science

I have been two years doing processing and manipulating data using R and mostly I use this language for my research project. I only heard and never tried Python for my work before. But now, after I use Python, I really fall in love with this language. Python is very simple and it is been known that this language is the easiest one to be learned. The reason why previously I used R was this language is supported by tons of libraries for scientific analysis and all of those are open source. Now, with the popularity of Python, I can find easily all libraries that I need in Python and all of them open source as well.

There are core libraries that you must know when you start to do data analytics using Python:

  1. NumPy, it stands for Numerical Python. Python is different with R, the purpose of R language is for scientist. On the other side, Python is just general programming language. That’s why Python needs a library to handle numerical things such as complex arrays and matrics. Repo project link: https://github.com/numpy/numpy
  2. SciPy, this library is for scientific and it handles such as statistic computing, linear algebra, optimation etc. Repo project link: https://github.com/scipy/scipy
  3. Pandas, if you have experiences with R, it is very similar to DataFrame. Using DataFrame, we can easily manipulate, aggregate, and doing analysis on our dataset. The data will be shown in a table similar to Excel Spreadsheet or DataFrame in R and it convenient to access the data by columns, rows or else. Repo project link: https://github.com/pandas-dev/pandas
  4. Matplotlib, Plotting is very important for data analysis. Why we need plotting? the simple answer is to make anyone easier and we know that one picture can descript 1000 words. To generate visualization from dataset, we absolutely need data visualization tools. If you have experiences with Excel, it is very easy, just block the table that you want to plot and select the plotting types such as Bar chart, line chart, etc. In R, the most popular tools for plotting is ggplot, basically, you can use standard library ‘plot’ in R but if you want more advanced and more beautiful figure you need to use ggplot.  How about in Python? Matplotlib is the basic library for visualization in Python, Repo project link: https://github.com/matplotlib/matplotlib

Those are the core basic libraries that you need when you start to use Python for data analytics. There are tons of Python libraries out there, here some of them that may useful for you:

  1. SciKit-Learn, when you want to apply machine learning, you have to understand this.
  2. Scrapy, to scrap the data from the Web, when you want to gather the data from websites for your analysis. For instance, collecting tweets data from Twitter.
  3. NLTK, if you want to do natural language processing.
  4. Theano, Tensorflow, Keras, when you are not satisfied with NumPy performance or want to apply neural network algorithms or doing deep learning stuff, you have to understand these libraries.
  5. Interactive Visualization Tools, matplotlib is basic plotting tool and it is enough for me as researcher especially for publications, but when we want a dynamic plotting or more interactive, we can use Seaborn, Ploty, or Bokeh.

pythonenvironment

If you do not want to think too much about how to install all of those libraries, just try to use Anaconda, it is really cool. 

See ya next time

Brisbane, 24 November 2017