Data Analysis · Data Mining · NumPy · Pandas · SciKit-Learn

Remove Duplicates from Correlation Matrix Python

Correlation is one of the most important things that usually used by the data analysts in their analytical workflow. By using correlation, we can understand the mutual relationship or association between two attributes. Let’s start with the example. For instance, I want to do an analysis of the “Boston housing dataset”, let see the example code below. If you are not familiar with Jupyter Notebook, Pandas, Numpy, and other python libraries, I have a couple of old posts that may useful for you: 1) setup anaconda 2) understand python libraries for data science.


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

boston = datasets.load_boston() #Load Boston Housing dataset, this dataset is available on Scikit-learn
boston = pd.DataFrame(boston['data'], columns=boston['feature_names'])
)

We can use the command boston.head() to see the data, and boston.shape to see the dimension of the data. We can easily use this command below to get correlation value among all attributes in Boston housing dataset. (e.g., in this experiment I used Pearson correlation).


dataCorr = boston.corr(method='pearson')
dataCorr

After using this command, we will see the matrix of correlation like in Figure below:

The question is, how to remove duplicates from this matrix of correlation to make it more readable? I found a nice answer on stackoverflow, we can use this command:


dataCorr = dataCorr[abs(dataCorr) >= 0.01].stack().reset_index()
dataCorr = dataCorr[dataCorr['level_0'].astype(str)!=dataCorr['level_1'].astype(str)]

# filtering out lower/upper triangular duplicates 
dataCorr['ordered-cols'] = dataCorr.apply(lambda x: '-'.join(sorted([x['level_0'],x['level_1']])),axis=1)
dataCorr = dataCorr.drop_duplicates(['ordered-cols'])
dataCorr.drop(['ordered-cols'], axis=1, inplace=True)

dataCorr.sort_values(by=[0], ascending=False).head(10) #Get 10 highest correlation of pairwaise attributes

Finally, we get the table that consists of the pair of attributes and the correlation values, and the most important thing is we do not have any duplication.

I also found another way from a guy in Github who create a nice function to remove this duplication. Please see through this link.

Thank you and see you next time.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.