Correlation is one of the most important things that usually used by the data analysts in their analytical workflow. By using correlation, we can understand the mutual relationship or association between two attributes. Let’s start with the example. For instance, I want to do an analysis of the “Boston housing dataset”, let see the example code below. If you are not familiar with Jupyter Notebook, Pandas, Numpy, and other python libraries, I have a couple of old posts that may useful for you: 1) setup anaconda 2) understand python libraries for data science.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn import datasets boston = datasets.load_boston() #Load Boston Housing dataset, this dataset is available on Scikit-learn boston = pd.DataFrame(boston['data'], columns=boston['feature_names']) )
We can use the command boston.head()
to see the data, and boston.shape
to see the dimension of the data. We can easily use this command below to get correlation value among all attributes in Boston housing dataset. (e.g., in this experiment I used Pearson correlation).
dataCorr = boston.corr(method='pearson') dataCorr
After using this command, we will see the matrix of correlation like in Figure below:
The question is, how to remove duplicates from this matrix of correlation to make it more readable? I found a nice answer on stackoverflow, we can use this command:
dataCorr = dataCorr[abs(dataCorr) >= 0.01].stack().reset_index() dataCorr = dataCorr[dataCorr['level_0'].astype(str)!=dataCorr['level_1'].astype(str)] # filtering out lower/upper triangular duplicates dataCorr['ordered-cols'] = dataCorr.apply(lambda x: '-'.join(sorted([x['level_0'],x['level_1']])),axis=1) dataCorr = dataCorr.drop_duplicates(['ordered-cols']) dataCorr.drop(['ordered-cols'], axis=1, inplace=True) dataCorr.sort_values(by=[0], ascending=False).head(10) #Get 10 highest correlation of pairwaise attributes
Finally, we get the table that consists of the pair of attributes and the correlation values, and the most important thing is we do not have any duplication.
I also found another way from a guy in Github who create a nice function to remove this duplication. Please see through this link.
Thank you and see you next time.