Explainable AI

Beberapa hari yang lalu di milist data science di kampus saya ada invitation seminar, dari judulnya sepertinya sangat menarik. Judulnya adalah “Toward white-box machine learning”. Langsung saya add di calender karena saya tidak ingin kelupaan untuk menghadiri seminar ini. Pembicaranya adalah peneliti lulusan PhD dari ANU kemudian dia postdoc di NUS dan sekarang bekerja sebagai peneliti di Griffith University. Dia author dari paper “Silas: High Performance, Explainable and Verifiable Machine Learning”. Tadinya saya kira mau bahas tentang white-box deep-learning, ternyata ML yg dia pakai adalah ensemble trees (nomor 4 kalau digambar dibawah ini). Tapi keren juga sih, jadi framework yang dia buat “Silas” ini bisa menjawab pertanyaan user, bila user pingin tau kenapa output sistem seperti ini? (kalau penasaran bisa langsung di baca papernya, atau Silas juga available dan bisa kita coba kok).

Gambar ini saya ambil dari salah satu slide presenter, di gambar ini jelas ya misal seperti linear model secara logical analisis memang sangat bagus karena memang sangat mudah untuk dipahami dan transparan, kita bahkan bisa tahu kenapa model bisa memberikan jawaban A misalnya. Akan tetapi linear model seperti ini secara performance biasanya jelek. Sekarang coba kita lihat misal SVM atau yg lagi ngetrend sekarang deep learning, deep learning secara performance akurasi memang sangat bagus tapi sangat tidak transparant atau bisa kita katakan “black-box”.

Gambar lain (gambar ke 2) yang saya ambil dari KDnuggets ini juga bisa menjelaskan hubungan antara accuracy dan explainability. Seperti yang bisa kita lihat di gambar, Neural Networks adalah juaranya dalam hal accuracy tp tentu paling jelek dalam hal explainability. Sebaliknya seperti regresi apalagi regresi linear secara explainability sangat bagus tapi ya secara akurasi jauh bila kita compare dg DL/NN misalnya.

Explainable AI ini sangat menarik, kalau tidak salah tahun lalu DARPA sampai mengucurkan dana sekitar 2 milyar dollar untuk penelitian explainable AI ini. Sebenernya hemat saya sih, explainable AI ini penting untuk hal-hal yang menyangkut nyawa orang hehe, misalnya bila kita ingin implementasi deep learning di ICU atau di bidang healthcare atau self driving car itu memang kita harus consider explainable AI ini. Karena keputusan yang akan dibuat oleh AI cukup krusial maka memang perlu lah kita tau kok ini sistem ngasih output begini apa alasannya? Tapi misalnya implementasi deep learning untuk game itu menurut saya tak perlu kita rempong dengan explainable AI ini. Contoh film dokumenter yg bagus tahun 2016 misal AlphaGO dari DeepMind Google yang bisa mengalahkan juara dunia pemain Go “Lee Sedol”. Tentu tidak perlu kita terlalu risau memikirkan explainable AI untuk konteks ini hehe.

Ini video youtube alpha Go

 

Jadi pertanyaannya, adakah disini yang tertarik atau sedang riset explainable AI ini? sepertinya menarik kalau bisa ngopi2 hehe.

Salam dari pinggir kali Brisbane

Rischan

Linear Regression using Python

Whoever wants to learn machine learning or become a data scientist, the most obvious thing to learn first time is linear regression. Linear regression is the simplest machine learning algorithm and it is generally used for forecasting. The goal of linear regression is to find a relationship between one or more independent variables and a dependent variable by fitting the best line. This best fit line is known as regression line and defined by a linear equation Y= a *X + b.

For instance, in the case of the height of children vs their age. After collecting the data of children height and their age in months, we can plot the data in a scatter plot such as in Figure below.

 

Linear regression will find the relationship between age as the independent variable and height as the dependent variable. Linear regression will find the best fit line from all points on the scatter plot. Finally, it can be used as a prediction, for instance, to predict what height the children when his age enter 35 months?

 

How to implement this linear regression in Python?

First, to make easier, I will generate a random dataset for our experiment.


import pandas as pd
import numpy as np

np.random.seed(0)
x = np.random.rand(100, 1) #generate random number for x variable
y = 2 + 3 * x + np.random.rand(100, 1) # generate random number of y variable
x[:10], y[:10] #Show the first 10 rows of each x and y

 

There are many ways to build a regression model, we can build it from scratch or just use the library from Python. In this example, I use scikit-learn


from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from matplotlib import pyplot as plt

# Initialize the model
model = LinearRegression()
# Train the model - fit the data to the model
model.fit(x, y)
# Predict
y_predicted = model.predict(x)

# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)

# printing values
print('Slope:' ,model.coef_)
print('Intercept:', model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)

# plotting values
plt.scatter(x, y, s=5)
plt.xlabel('x')
plt.ylabel('y')

# predicted values
plt.plot(x, y_predicted, color='r')
plt.show()

Tarraaa!, it’s easy right?

See you next time

Python for Data Science

I have been two years doing processing and manipulating data using R and mostly I use this language for my research project. I only heard and never tried Python for my work before. But now, after I use Python, I really fall in love with this language. Python is very simple and it is been known that this language is the easiest one to be learned. The reason why previously I used R was this language is supported by tons of libraries for scientific analysis and all of those are open source. Now, with the popularity of Python, I can find easily all libraries that I need in Python and all of them open source as well.

There are core libraries that you must know when you start to do data analytics using Python:

  1. NumPy, it stands for Numerical Python. Python is different with R, the purpose of R language is for scientist. On the other side, Python is just general programming language. That’s why Python needs a library to handle numerical things such as complex arrays and matrics. Repo project link: https://github.com/numpy/numpy
  2. SciPy, this library is for scientific and it handles such as statistic computing, linear algebra, optimation etc. Repo project link: https://github.com/scipy/scipy
  3. Pandas, if you have experiences with R, it is very similar to DataFrame. Using DataFrame, we can easily manipulate, aggregate, and doing analysis on our dataset. The data will be shown in a table similar to Excel Spreadsheet or DataFrame in R and it convenient to access the data by columns, rows or else. Repo project link: https://github.com/pandas-dev/pandas
  4. Matplotlib, Plotting is very important for data analysis. Why we need plotting? the simple answer is to make anyone easier and we know that one picture can descript 1000 words. To generate visualization from dataset, we absolutely need data visualization tools. If you have experiences with Excel, it is very easy, just block the table that you want to plot and select the plotting types such as Bar chart, line chart, etc. In R, the most popular tools for plotting is ggplot, basically, you can use standard library ‘plot’ in R but if you want more advanced and more beautiful figure you need to use ggplot.  How about in Python? Matplotlib is the basic library for visualization in Python, Repo project link: https://github.com/matplotlib/matplotlib

Those are the core basic libraries that you need when you start to use Python for data analytics. There are tons of Python libraries out there, here some of them that may useful for you:

  1. SciKit-Learn, when you want to apply machine learning, you have to understand this.
  2. Scrapy, to scrap the data from the Web, when you want to gather the data from websites for your analysis. For instance, collecting tweets data from Twitter.
  3. NLTK, if you want to do natural language processing.
  4. Theano, Tensorflow, Keras, when you are not satisfied with NumPy performance or want to apply neural network algorithms or doing deep learning stuff, you have to understand these libraries.
  5. Interactive Visualization Tools, matplotlib is basic plotting tool and it is enough for me as researcher especially for publications, but when we want a dynamic plotting or more interactive, we can use Seaborn, Ploty, or Bokeh.

pythonenvironment

If you do not want to think too much about how to install all of those libraries, just try to use Anaconda, it is really cool. 

See ya next time

Brisbane, 24 November 2017