Data Engineering with Google Cloud Professional Certificate

Sebelumnya saya pernah posting tentang Kursus Gratis di tengah pandemi Covid19 dari Coursera, tulisan ini sebenernya lanjutan dari tulisan tersebut.

Di tulisan ini saya mau ngomongin soal course Data Engineering with Google Cloud Professional Certificate!

Berawal dari digratiskannya beberapa courses berbayar di coursera karena covid19, saya antusias untuk memanfaatkan kesempatan tersebut. Awal bulan Mei teman saya ngasih info kalau ada promo di course Data Engineering with Google Cloud Professional Certificate. Coursera sebenernya selalu ngasih free trial 1 minggu untuk semua courses yang berbayar, nah promonya ini adalah 1 bulan trial. Saya langsung antusias dan saya optimis bisa menyelesaikan course ini kurang dari 1 bulan jadi nanti gak perlu bayar lah wkwkw. Untuk enroll memang harus masukin kartu kredit. Baru akan kena charge bila lebih dari 1 bulan belum selesai (ini karena saya dapat promo 1 bulan ya). Jika trial 1 minggu dan tidak di cancel ya setelah 1 minggu bisa kena charge. Untuk masa trial bisa dicancel kapanpun, dan gak akan kena charge klo ngecancel.

FYI: link promonya untuk course data engineer itu ada di sini https://www.coursera.org/promo/dataEngineer tapi sekarang sudah tidak bisa. Untuk lihat promo courses lainnya bisa akses link ini https://www.coursera.org/promo/free-courses-college-students.

Baiklah saya ingin cerita tentang course yang saya ikuti ini, serius menarik. Course ini untuk persiapan kalau mau ambil profesional certificate untuk data engineer di Google Cloud Platform. Tadinya saya kira, ah ini mah cepet cuman 1 course, ternyata di dalam 1 course ini ada 6 courses dan tiap course banyak sekali quiz dan latihannya. Latihannya langsung praktek pakai Qwiklab (lihat video saya yang topiknya Google Cloud Platform). Course ini sebenarnya didesign untuk 2 bulan lebih. Di Course terakhir juga ada ujiannya, dikasih timer dan ujiannya tehnis langsung praktek juga pakai Qwiklab. Intinya kalau teman-teman mau ambil sertifikasi data engineer di GCP course ini worth lah walaupun teman-teman harus bayar misalnya. Dan coba kejar selesaikan kurang dari 1 bulan!

Baiklah, saya mau ngomongin sertifikat, karena biasanya temen-temen indonesia nanya ini. Kalau saya pribadi mah kurang begitu tertarik dengan sertifikatnya yang penting ilmunya. Toh ini kan juga sertifikat belajar ya bukan sertifikat profesional. Tapi course ini emang courses untuk persiapan pengambilan profesional certifikat.

Dari course ini saya dapat 7 sertifikat. Karena 1 course ini ada 6 courses, masing-masing courses setelah selesai ada sertifikatnya. Dan ada 1 sertifikat yg menyatakan sudah menyelesaikan 6 courses.

Silahkan bisa dilihat di tautan ini: https://www.coursera.org/account/accomplishments/professional-cert/X9HECYUG2LPR

Data Engineering with Google Cloud Professional Certificate
Google Cloud Platform Big Data and Machine Learning Fundamentals
Modernizing Data Lakes and Data Warehouses with GCP
Building Batch Data Pipelines on GCP
Building Resilient Streaming Analytics Systems on GCP
Smart Analytics, Machine Learning, and AI on GCP
Preparing for the Google Cloud Professional Data Engineer Exam

Dari 6 courses ini saya belajar lumayan banyak, misalnya:

  • Seperti apa data lake dan data warehouse di Google Cloud Platform (GCP), misalnya kita punya banyak source data, gimana setup pipelinenya, konsep ETL, ELT -> Extract Transform Load atau Extract Load Transform.
  • Bagaimana setup pipeline untuk batch data analytics (misalnya kita sudah punya platform hadoop, spark di local/data center kita, terus mau migrasi ke GCP), gimana running PySpark job di GCP cluster pakai DataProc, setup master dan node workers di GCP dan sebagainya.
  • Bagaimana setup pipeline untuk Streaming data analytics. GCP punya Pub/Sub messaging untuk event stream, kemudian pakai Data flow untuk ETL, dan di analysis pakai BigQuery, klo streamnya cepet banget bisa pakai BigTable.
  • Di course itu juga saya belajar bagaimana machine learning di GCP. Dari yang sudah tinggal auto pakai, karena data dan modelnya sudah disediakan Google, atau mau pakai data sendiri dan training di Google Cloud dengan GUI yang begitu mudah, atau mau pakai yg paling advanced, model bikin sendiri, data sendiri, misal pakai Tensorflow di GCP. Product ML di GCP juga banyak banget dari mulai Vision, Language, Speech, dan etc
  • Saya juga baru tau kalau BigQuery punya fitur BigQuery ML, jadi kayak semacam nulis SQL query aja buat ngetraining dan testing model. Sayangnya fitur ini masih cukup terbatas, saat ini BigQuery ML support:1. Linear regression untuk forcasting 2. Binary/multi-class Logistic regression untuk klasifikasi 3. K-mean clustering 4. Import dari Tensorflow
  • Terakhir ada examnya dan kita dikasih timer untuk meneyelesaikannya, examnya ada teori dan yang seru praktiknya pakai Qwiklab. Jadi ada semacam challenge gitu, kita harus menyelesaikannya!

Info: Saya ada sekitar 20an video ketika mengerjakan experiment pakai Qwiklab yang saya rekam, beberapa sudah saya upload di Youtube saya dan akan rilis seminggu sekali setiap hari Senin pukul 18:45pm waktu Brisbane atau 15:45pm waktu WIB.

Silahkan bisa subsribe di Yotube Channel saya Rischanlab

Semua video praktek dari course ini akan ada di playlist ini:

Oya itu baru dari coursera, masih ada edX dan beberapa portal course online yang berkualitas. Jadi bagi teman-teman di Indonesia ayoo manfaatkan kesempatan ini. Resource itu banyak tinggal kita mau memanfaatkan atau tidak.

Semoga bermanfaat, salam dari pinggir kali Brisbane

Rischan Mafrur

Kursus Data Science Gratis dari Coursera selama COVID19

Siapa sih yang tidak tahu coursera, banyak courses disana yang gratis maupun yang berbayar cukup mahal. Dan yang pasti kalau persoalan kualitas jangan ditanya lagi karena institusi/univeritas yang ngasih course adalah universitas/institusi top. Di masa Pandemi Covid19 ini coursera menawarkan beberapa courses yang tadinya berbayar menjadi gratis, udah gitu bisa dapat sertifikat.

Saya malah prihatin dengan apa yang terjadi di Indonesia, lha masa lagi wabah begini malah ada yang mau ambil keuntungan, misal malah dapat proyek dari kartu prakerja padahal kualitas materi pelatihan onlinenya begitu. Lebih baik kasih aja free internet selama PSBB kan enak, biar rakyat bebas memilih mau ikut pelatihan online dimanapun.

Ini adalah 3 sertifikat saya dari 3 courses yang saya ambil di coursera:

Certificate dari Google Cloud
Certificate dari LearnQuest, saya belajar Azure di course ini
Certificate dari AWS

Tiga courses yang saya ikuti ini tadinya berbayar tapi selama wabah corona ini jadi gratis dan dapat sertifikat. Ketika membuka halaman course, langsung pilih enroll, dan disitu nanti tertera biaya untuk mengambil course ini, klik saja proses nanti akan diarahkan ke halaman bahwa course ini sudah terbayar.

Untuk mengakses 3 courses yang saya ambil ini linknya tercantum berikut:

Di bulan puasa ini saya niatkan untuk mencoba belajar beberapa cloud providers khususnya dalam hal ML/Machine Learning. Yang terkenal diantaranya adalah #AWS, #Azure, dan #GoogleCloud .

Sebenarnya saya dulu sudah pernah pakai AWS tapi kan hanya buat storage sama bikin API aja ndak nyobain fiture machine learningnya. Nah ini tiga sertifikat yang saya dapat dari courses yang saya ambil. Untuk yang AWS mungkin karena lagi getting started jadi Quiznya cuman pilihan ganda, untuk Azure ada tugas nyobain juga, source code di attach jadi kita tinggal nyoba.

Dari tiga courses ini yang paling lumayan menurut saya yang dari Google Cloud. Jadi quiznya gak cuman pilihan ganda tapi harus langsung nyoba dan dikasih user dan password sementara serta timer untuk ngefollow guide nya. Di Google Cloud ada 3 Qwiklabs: Bikin chatbot, Training ML buat deteksi jenis Awan, Nyobain BigQuery ML buat prediksi pembelian, lumayan asik.

Untuk courses lain yang ditawarkan gratis selama pandemi covid19 bisa dibuka ditautan ini:

https://blog.coursera.org/coursera-together-free-online-learning-during-covid-19/

Semoga bermanfaat, selamat mencoba

Explainable AI

Beberapa hari yang lalu di milist data science di kampus saya ada invitation seminar, dari judulnya sepertinya sangat menarik. Judulnya adalah “Toward white-box machine learning”. Langsung saya add di calender karena saya tidak ingin kelupaan untuk menghadiri seminar ini. Pembicaranya adalah peneliti lulusan PhD dari ANU kemudian dia postdoc di NUS dan sekarang bekerja sebagai peneliti di Griffith University. Dia author dari paper “Silas: High Performance, Explainable and Verifiable Machine Learning”. Tadinya saya kira mau bahas tentang white-box deep-learning, ternyata ML yg dia pakai adalah ensemble trees (nomor 4 kalau digambar dibawah ini). Tapi keren juga sih, jadi framework yang dia buat “Silas” ini bisa menjawab pertanyaan user, bila user pingin tau kenapa output sistem seperti ini? (kalau penasaran bisa langsung di baca papernya, atau Silas juga available dan bisa kita coba kok).

Gambar ini saya ambil dari salah satu slide presenter, di gambar ini jelas ya misal seperti linear model secara logical analisis memang sangat bagus karena memang sangat mudah untuk dipahami dan transparan, kita bahkan bisa tahu kenapa model bisa memberikan jawaban A misalnya. Akan tetapi linear model seperti ini secara performance biasanya jelek. Sekarang coba kita lihat misal SVM atau yg lagi ngetrend sekarang deep learning, deep learning secara performance akurasi memang sangat bagus tapi sangat tidak transparant atau bisa kita katakan “black-box”.

Gambar lain (gambar ke 2) yang saya ambil dari KDnuggets ini juga bisa menjelaskan hubungan antara accuracy dan explainability. Seperti yang bisa kita lihat di gambar, Neural Networks adalah juaranya dalam hal accuracy tp tentu paling jelek dalam hal explainability. Sebaliknya seperti regresi apalagi regresi linear secara explainability sangat bagus tapi ya secara akurasi jauh bila kita compare dg DL/NN misalnya.

Explainable AI ini sangat menarik, kalau tidak salah tahun lalu DARPA sampai mengucurkan dana sekitar 2 milyar dollar untuk penelitian explainable AI ini. Sebenernya hemat saya sih, explainable AI ini penting untuk hal-hal yang menyangkut nyawa orang hehe, misalnya bila kita ingin implementasi deep learning di ICU atau di bidang healthcare atau self driving car itu memang kita harus consider explainable AI ini. Karena keputusan yang akan dibuat oleh AI cukup krusial maka memang perlu lah kita tau kok ini sistem ngasih output begini apa alasannya? Tapi misalnya implementasi deep learning untuk game itu menurut saya tak perlu kita rempong dengan explainable AI ini. Contoh film dokumenter yg bagus tahun 2016 misal AlphaGO dari DeepMind Google yang bisa mengalahkan juara dunia pemain Go “Lee Sedol”. Tentu tidak perlu kita terlalu risau memikirkan explainable AI untuk konteks ini hehe.

Ini video youtube alpha Go

Jadi pertanyaannya, adakah disini yang tertarik atau sedang riset explainable AI ini? sepertinya menarik kalau bisa ngopi2 hehe.

Salam dari pinggir kali Brisbane

Rischan

Numpy vs Pandas Performance

Hi guys!

In the last post, I wrote about how to deal with missing values in a dataset. Honestly, that post is related to my PhD project. I will not explain the detail of my project but I need to replace a certain of percentage (10,20,…90 %) of my dataset to NaN then impute all those NaN values. In that post, I did experiment using Boston dataset, a quite small dataset (13 dimensions and 506 rows/tuples).

I got problems when I did experiments with a bigger dataset, for instance, Flights dataset. This dataset consists of 11 dimensions and almost one million rows (tuples) which is a quite large number. Let see Figure below:

To replace 80% values to NaN in Flight dataset using Pandas operation, it takes around 469 seconds. It’s really slow. Moreover, in this case, I only work on 8 dimensions (only numerical attributes).

I guess there are some reasons, why it has a slow performance: 1) because of the code itself; 2) due to using Pandas for large number operations, or 3) due to both reasons.

I was trying to find the answer and I found two posts about comparison performance between Numpy and Pandas including when we should use Numpy and Pandas: ([1], [2])

After reading those posts, I decided to use Numpy instead of Pandas in my operation due to my dataset has a large number of tuples (almost one million tuples).

This is how I implement. This function below is a function for replacing values to NaN:

 
def dropout(a, percent):
    # create a copy
    mat = a.copy()
    # number of values to replace
    prop = int(mat.size * percent)
    # indices to mask
    mask = random.sample(range(mat.size), prop)
    # replace with NaN
    np.put(mat, mask, [np.NaN]*len(mask))
    return mat

The code below is for missing values imputation. The code below is based on scikit learn example (scikit-learn has a function for imputing missing values). I imputed all numerical missing values with mean and all categorical missing values with the most frequent values:

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        Columns of dtype object are imputed with the most frequent value 
        in column.
        Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.fill)

Here the results:

In Figure above, it can be seen that I converted Pandas data frame to numpy array. Just use this command

data = df.values

, your data frame will be converted to numpy array. Then I run the dropout function when all data in the form of numpy array. In the end, I re-converted again the data to Pandas dataframe after the operations finished.

Using Numpy operation to replace 80% data to NaN including imputing all NaN with most frequent values only takes 4 seconds. Moreover, in this case, I work on 11 dimensions (categorical and numerical attributes).

From this post, I just want to share to you that your choice matters. When we want to deal with a large number of tuples, we may consider choosing numpy instead of pandas. However, another important thing is no one can write optimized code!!

See you again in the next post!

Impute NaN values with mean of column Pandas Python

Incomplete data or a missing value is a common issue in data analysis. Systems or humans often collect data with missing values. Actually, we can do data analysis on data with missing values, it means we do not aware of the quality of data. However, it may produce the wrong results because of those missing values. The common approach to deal with missing value is dropping all tuples that have missing values. The problem with this dropping approach is it may generate bias results especially if the rows that contain NaN values are large, while in the end, we have to drop a large number of tuples. This way can be used if the data has a small number of missing values. In the case of data with a large number of missing values, we have to repair those missing values.

There are a lot of proposed imputation methods for repairing missing values. The simplest one is to repair missing values with the mean, median, or mode. It can be the mean of whole data or mean of each column in the data frame.

In this experiment, we will use Boston housing dataset. The Boston data frame has 506 rows and 14 columns. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data has been used in many machine learning papers that address regression problems. MEDV attribute is the target (dependent variable), where others are independent variables. This dataset is available in the scikit-learn library, so we can just import it directly.  As usual, in this experiment, I am going to use Python Jupyter notebook. If you are not familiar with Jupyter Notebook, Pandas, Numpy, and other python libraries, I have a couple of old posts that may useful for you: 1) setup anaconda 2) understand python libraries for data science.

Let’s get started…


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

boston_bunch = load_boston()
dfx = pd.DataFrame(boston_bunch.data, columns = boston_bunch.feature_names) #independent variables
dfy = pd.DataFrame(boston_bunch.target, columns = ['target']) #dependent variables
boston = dfx.join(dfy)
)

We can use command boston.head() to see the data, and boston.shape to see the dimension of the data. The next step is check the number of Na in boston dataset using command below.


boston.isnull().sum()

The result shows that Boston dataset has no Na values. The question is how we create/change some values to NA? is that possible?


#Change some values (20%) to NAN randomly
# Change 10% Values to NA randomly
import collections
import random
df = boston
replaced = collections.defaultdict(set)
ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
random.shuffle(ix)
to_replace = int(round(.2*len(ix)))
for row, col in ix:
    if len(replaced[row]) < df.shape[1] - 1:
        df.iloc[row, col] = np.nan
        to_replace -= 1
        replaced[row].add(col)
        if to_replace == 0:
            break

Using the code above, we can replace some values (20%) in Boston dataset to NA. We also can change the percentage of NA by changing the code above(see .2*). Na values are absolutely random with respect to the whole data. Then, now check again is there any missing values in our boston dataset?


boston.isnull().sum()

The result shows that all columns have around 20% NaN values. Then how to replace all those missing values (impute those missing values) based on the mean of each column?


#fill NA with mean() of each column in boston dataset
df = df.apply(lambda x: x.fillna(x.mean()),axis=0)

Now, use command boston.head() to see the data. We have fixed missing values based on the mean of each column. We also can impute our missing values using median() or mode() by replacing the function mean().

This imputation method is the simplest one, there are a lot of sophisticated algorithms (e.g., regression, monte carlo, etc) out there that can be used for repairing missing values. Maybe I’ll post it next time.

All codes and results can be accessed through this link https://github.com/rischanlab/PyDataScience.Org/blob/master/python_notebook/2%20Impute%20NaN%20values%20with%20mean%20of%20each%20columns%20.ipynb

Thank you, see you

Remove Duplicates from Correlation Matrix Python

Correlation is one of the most important things that usually used by the data analysts in their analytical workflow. By using correlation, we can understand the mutual relationship or association between two attributes. Let’s start with the example. For instance, I want to do an analysis of the “Boston housing dataset”, let see the example code below. If you are not familiar with Jupyter Notebook, Pandas, Numpy, and other python libraries, I have a couple of old posts that may useful for you: 1) setup anaconda 2) understand python libraries for data science.


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

boston = datasets.load_boston() #Load Boston Housing dataset, this dataset is available on Scikit-learn
boston = pd.DataFrame(boston['data'], columns=boston['feature_names'])
)

We can use the command boston.head() to see the data, and boston.shape to see the dimension of the data. We can easily use this command below to get correlation value among all attributes in Boston housing dataset. (e.g., in this experiment I used Pearson correlation).


dataCorr = boston.corr(method='pearson')
dataCorr

After using this command, we will see the matrix of correlation like in Figure below:

The question is, how to remove duplicates from this matrix of correlation to make it more readable? I found a nice answer on stackoverflow, we can use this command:


dataCorr = dataCorr[abs(dataCorr) >= 0.01].stack().reset_index()
dataCorr = dataCorr[dataCorr['level_0'].astype(str)!=dataCorr['level_1'].astype(str)]

# filtering out lower/upper triangular duplicates 
dataCorr['ordered-cols'] = dataCorr.apply(lambda x: '-'.join(sorted([x['level_0'],x['level_1']])),axis=1)
dataCorr = dataCorr.drop_duplicates(['ordered-cols'])
dataCorr.drop(['ordered-cols'], axis=1, inplace=True)

dataCorr.sort_values(by=[0], ascending=False).head(10) #Get 10 highest correlation of pairwaise attributes

Finally, we get the table that consists of the pair of attributes and the correlation values, and the most important thing is we do not have any duplication.

I also found another way from a guy in Github who create a nice function to remove this duplication. Please see through this link.

Thank you and see you next time.

Python Pandas DataFrame Basics Tutorial

In this post, I am going to show you how to deal with data in Python. Before going there, you have to understand what kind of python libraries that you need to know if you want to deal with data in Python. Python has tons of libraries especially related to data science. I have a couple of old posts that may useful for you: 1) setup anaconda 2) understand python libraries for data science. In this tutorial, I use Jupyter Notebook, if you did not have/familiar yet, please read the instruction above, otherwise, just go down!

Let start from the simplest one. When you want to deal with data in Python. Python has an amazing library called Pandas. If you are familiar with Spreadsheet tool such as MS Excel, Pandas similar to that kind of tool, Pandas shows our data in the format of Table. The only difference is, when you use Excel you just drag and drop but here in Pandas, you have to understand the standard syntax and command of pandas.

Let’s get started.

To start analyzing data, you can import your data (e.g., csv, xls, and etc) to python environment using Pandas: import pandas as pd then pd.read_csv('data.csv'). However, to make it easier, in this tutorial, I just create my data rather than import from file.


#Create DataFrame
import pandas as pd #when we want to use Pandas, we have to import it
import numpy as np #numpy is another useful library in python for dealing with number
df = pd.DataFrame(
    {'integer':[1,2,3,6,7,23,8,3],
     'float':[2,3.4,5,6,2,4.7,4,8],
     'string':['saya','aku', np.NaN ,'cinta','kamu','a','b','indonesia']}
)

1. To show your DataFrame, just use this command!

#show data in DataFrame
df

2. If you want to access your single or more data from your DataFrame, you can access it using loc syntax.

#Show data based on index
df.loc[1]

3. If you only need some columns and ignore other columns, you can just select the columns:

#show data based on columns selected
df[['string','float']]

4. You also can apply IF condition on your data similar to the filter in Excel. Use this command and see your result.

#show data with condition
df[df['float']>4]

5. You also able to rename the columns by using this command:

#rename column in DataFrame
df.rename(columns={'string':'char'})

6. When I create the data, I add one row that contains Nan or null value. The missing value is a common issue in data. So, how to deal with missing values? first, we need to know whether our DataFrame contains missing values or not by using this command.

#Show NaN value in DataFrame
df.isnull()

7. The simplest way to deal with missing values is to drop all missing values. How to drop missing values? here the command:

# Drop all rows contain NaN value
df.dropna()

8. We also can make summaries from our data (e.g., mean, median, mode, max, etc), use this command and see what you got!

#Show mean, median, and maximum in Data Frame
mean = df['float'].mean()
print("mean =", mean)
median = df['float'].median()
print("median = ", median)
max = df['float'].max()
print("max =", max)

Here the result: https://github.com/rischanlab/PyDataScience.Org/blob/master/python_notebook/1%20Basic%20Pandas%20Operation%20.ipynb

 

 

Linear Regression using Python

Whoever wants to learn machine learning or become a data scientist, the most obvious thing to learn first time is linear regression. Linear regression is the simplest machine learning algorithm and it is generally used for forecasting. The goal of linear regression is to find a relationship between one or more independent variables and a dependent variable by fitting the best line. This best fit line is known as regression line and defined by a linear equation Y= a *X + b.

For instance, in the case of the height of children vs their age. After collecting the data of children height and their age in months, we can plot the data in a scatter plot such as in Figure below.

 

Linear regression will find the relationship between age as the independent variable and height as the dependent variable. Linear regression will find the best fit line from all points on the scatter plot. Finally, it can be used as a prediction, for instance, to predict what height the children when his age enter 35 months?

 

How to implement this linear regression in Python?

First, to make easier, I will generate a random dataset for our experiment.


import pandas as pd
import numpy as np

np.random.seed(0)
x = np.random.rand(100, 1) #generate random number for x variable
y = 2 + 3 * x + np.random.rand(100, 1) # generate random number of y variable
x[:10], y[:10] #Show the first 10 rows of each x and y

 

There are many ways to build a regression model, we can build it from scratch or just use the library from Python. In this example, I use scikit-learn


from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from matplotlib import pyplot as plt

# Initialize the model
model = LinearRegression()
# Train the model - fit the data to the model
model.fit(x, y)
# Predict
y_predicted = model.predict(x)

# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)

# printing values
print('Slope:' ,model.coef_)
print('Intercept:', model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)

# plotting values
plt.scatter(x, y, s=5)
plt.xlabel('x')
plt.ylabel('y')

# predicted values
plt.plot(x, y_predicted, color='r')
plt.show()

Tarraaa!, it’s easy right?

See you next time

Read the data and plotting with multiple markers

Let’s assume that we have an excel data and we want to plot it on a line chart with different markers. Why markers? just imagine, we have plotted a line chart with multiple lines using a different colour, but we only have black and white ink, after printing, all lines will be in black colour. That’s why we need markers.

 

For instance, our data can be seen in Table above, this is just a dummy data that tells about algorithms performance vs. the number of k. We want to plot this data to the line chart.  We already have the previous experiment, how to plot the line chart with multiple lines and multiple styles. However, in the previous experiment, we used static declaration for each line. It will be hard if we have to declare one by one for each line.

Let’s get started

The first step is to load our Excel data to the DataFrame in pandas.


import pandas as pd
import numpy as np

import matplotlib as mpl

xl = pd.ExcelFile("Experiment_results.xlsx")

df = xl.parse("Sheet2", header=1, index_col=0)
df.head()

It’s very easy to load Excel data to DataFrame, we can use some parameters which very useful such as sheet name, header, an index column. In this experiment, I use “Sheet2″ due to my data in the Sheet2, and I use ”1″ as the header parameter which means I want to load the header to the DataFrame, and if you don’t want to load it just fill it with ”0″. I also use index_col equal to “0”, which means I want to use the first column in my Excel dataset as the index in my DataFrame. Now we have a dataframe that can be seen in the Table above.

The second step is how to set the markers. As I said in the previous experiment that matplotlib supports a lot of markers. Of course, I don’t want to define one by one manually. Let see the code below:


# create valid markers from mpl.markers
valid_markers = ([item[0] for item in mpl.markers.MarkerStyle.markers.items() if
item[1] is not 'nothing' and not item[1].startswith('tick') and not item[1].startswith('caret')])

# valid_markers = mpl.markers.MarkerStyle.filled_markers

markers = np.random.choice(valid_markers, df.shape[1], replace=False)

Now, we have a list of markers inside the ‘markers’ variable. We need to select the markers randomly which are defined by df.shape[1] (the number of columns). Let start to plot the data.


ax = df.plot(kind='line')
for i, line in enumerate(ax.get_lines()):
line.set_marker(markers[i])

# adding legend
ax.legend(ax.get_lines(), df.columns, loc='best')
plt.show()

Taraaaa!!!!, it’s easy, right?

 

 

The next question is how to plot Figure like below?

plotting-multiplelines-with multiple styles

Check it out here. 

Plot multiple lines on one chart with different style Python matplotlib

Sometimes we need to plot multiple lines on one chart using different styles such as dot, line, dash, or maybe with different colour as well. It is quite easy to do that in basic python plotting using matplotlib library.

We start with the simple one, only one line:


import matplotlib.pyplot as plt
plt.plot([1,2,3,4])

# when you want to give a label
plt.xlabel('This is X label')
plt.ylabel('This is Y label')
plt.show()

 

Let’s go to the next step, several lines with different colour and different styles.


import numpy as np
import matplotlib.pyplot as plt

# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)

# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()

If only three lines, it seems still easy, how if there are many lines, e.g: six lines.


import matplotlib.pyplot as plt
import numpy as np

x=np.arange(6)

fig=plt.figure()
ax=fig.add_subplot(111)

ax.plot(x,x,c='b',marker="^",ls='--',label='Greedy',fillstyle='none')
ax.plot(x,x+1,c='g',marker=(8,2,0),ls='--',label='Greedy Heuristic')
ax.plot(x,(x+1)**2,c='k',ls='-',label='Random')
ax.plot(x,(x-1)**2,c='r',marker="v",ls='-',label='GMC')
ax.plot(x,x**2-1,c='m',marker="o",ls='--',label='KSTW',fillstyle='none')
ax.plot(x,x-1,c='k',marker="+",ls=':',label='DGYC')

plt.legend(loc=2)
plt.show()

Now, we can plot multiple lines with multiple styles on one chart.

These are some resources from matplotlib documentation that may useful:

  1. Marker types of matplotlib https://matplotlib.org/examples/lines_bars_and_markers/marker_reference.html
  2. Line styles matplotlib https://matplotlib.org/1.3.1/examples/pylab_examples/line_styles.html
  3. Matplotlib marker explanation https://matplotlib.org/api/markers_api.html

In this experiment, we define each line manually while it can be hard if we want to generate line chart from dataset. In the next experiment, we use real an Excel dataset and plot the data to line chart with different markers without defining one by one for each line -> just check it here https://pydatascience.org/2017/12/05/read-the-data-and-plotting-with-multiple-markers/

*Some part of the codes, I took from StackOverflow