Machine Learning · Matplotlib · NumPy · Pandas · SciKit-Learn

Linear Regression using Python

Whoever wants to learn machine learning or become a data scientist, the most obvious thing to learn in the first time is linear regression. Linear regression is the simplest machine learning algorithm and it is generally used for forecasting. The goal of linear regression is to find a relationship between one or more independent variables and a dependent variable by fitting the best line. This best fit line is known as regression line and defined by a linear equation Y= a *X + b.

For instance, in the case of the height of children vs their age. After collecting the data of children height and their age in months, we can plot the data in a scatter plot such as in Figure below.

 

Linear regression will find the relationship between age as the independent variable and height as the dependent variable. Linear regression will find the best fit line from all points on the scatter plot. Finally, it can be used as a prediction, for instance, to predict what height the children when his age enter 35 months?

 

How to implement this linear regression in Python?

First, to make easier, I will generate a random dataset for our experiment.


import pandas as pd
import numpy as np

np.random.seed(0)
x = np.random.rand(100, 1) #generate random number for x variable
y = 2 + 3 * x + np.random.rand(100, 1) # generate random number of y variable
x[:10], y[:10] #Show the first 10 rows of each x and y


 

There are many ways to build a regression model, we can build it from scratch or just use the library from Python. In this example, I use scikit-learn


from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from matplotlib import pyplot as plt

# Initialize the model
model = LinearRegression()
# Train the model - fit the data to the model
model.fit(x, y)
# Predict
y_predicted = model.predict(x)

# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)

# printing values
print('Slope:' ,model.coef_)
print('Intercept:', model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)

# plotting values
plt.scatter(x, y, s=5)
plt.xlabel('x')
plt.ylabel('y')

# predicted values
plt.plot(x, y_predicted, color='r')
plt.show()


Tarraaa!, it’s easy right?

See you next time

Matplotlib · NumPy · Pandas · Plotting in Python

Read the data and plotting with multiple markers

Just assume we have excel data and we want to plot it on a line chart with different markers. Why markers? just imagine, we have plotted a line chart with multiple lines by different colour, but we only have black and white ink, after printing, all lines will be in black colour. That’s why we need markers.

 

For instance, our data can be seen in the Table above, this data about algorithms performance vs. the number of k. I want to plot this data to the line chart.  We already have the previous experiment, how to plot the line chart with multiple lines and multiple styles. However, in the previous experiment, we used the static declaration for each line. It will be hard if we have to declare one by one for each line.

Let’s just start to code..

The first step is to load Excel data to the DataFrame in pandas.


import pandas as pd
import numpy as np

import matplotlib as mpl

xl = pd.ExcelFile("Experiment_results.xlsx")

df = xl.parse("Sheet2", header=1, index_col=0)
df.head()

It’s very easy to load the Excel data to DataFrame, we can use some parameters which very useful such as sheet name, header, an index column. In this experiment, I use “Sheet2″ due to my data in the Sheet2, and I use ”1″ as the header parameter which means I want to load the header to the DataFrame, and if you don’t want to load it just fill it with ”0″. I also use index_col equal to “0”, which means I want to use the first column in my Excel dataset as the index in my DataFrame. Now we have a dataframe that can be seen in the Table above.

The second step is how to set the markers. As I said in the previous experiment that matplotlib supports a lot of markers. Of course, I don’t want to define one by one manually. Let see the code below:


# create valid markers from mpl.markers
valid_markers = ([item[0] for item in mpl.markers.MarkerStyle.markers.items() if
item[1] is not 'nothing' and not item[1].startswith('tick') and not item[1].startswith('caret')])

# valid_markers = mpl.markers.MarkerStyle.filled_markers

markers = np.random.choice(valid_markers, df.shape[1], replace=False)

Now, we have a list of markers inside the ‘markers’ variable. We need to select the markers randomly which are defined by df.shape[1] (the number of columns). Let start to plot the data.


ax = df.plot(kind='line')
for i, line in enumerate(ax.get_lines()):
line.set_marker(markers[i])

# adding legend
ax.legend(ax.get_lines(), df.columns, loc='best')
plt.show()

Taraaaa!!!!, it’s easy, right?

 

 

 

Data Analysis · Matplotlib · Plotting in Python

Plot multiple lines in one chart with different style Python matplotlib

Sometimes we need to plot multiple lines in one chart using different styles such as dot, line, dash, or maybe with different colour too. It is quite easy to do that in basic python plotting using matplotlib library.

We start with the simple one, only one line:


import matplotlib.pyplot as plt
plt.plot([1,2,3,4])

# when you want to give a label
plt.xlabel('This is X label')
plt.ylabel('This is Y label')
plt.show()

 

Let’s go to the next step, several lines with different colour and different styles.


import numpy as np
import matplotlib.pyplot as plt

# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)

# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()

If only three lines, it seems still easy, how if there are many lines, e.g: six lines.


import matplotlib.pyplot as plt
import numpy as np

x=np.arange(6)

fig=plt.figure()
ax=fig.add_subplot(111)

ax.plot(x,x,c='b',marker="^",ls='--',label='Greedy',fillstyle='none')
ax.plot(x,x+1,c='g',marker=(8,2,0),ls='--',label='Greedy Heuristic')
ax.plot(x,(x+1)**2,c='k',ls='-',label='Random')
ax.plot(x,(x-1)**2,c='r',marker="v",ls='-',label='GMC')
ax.plot(x,x**2-1,c='m',marker="o",ls='--',label='KSTW',fillstyle='none')
ax.plot(x,x-1,c='k',marker="+",ls=':',label='DGYC')

plt.legend(loc=2)
plt.show()

Now, we can plot multiple lines with multiple styles in one chart.

These are the resources from matplotlib that may useful:

  1. Marker types of matplotlib https://matplotlib.org/examples/lines_bars_and_markers/marker_reference.html
  2. Line styles matplotlib https://matplotlib.org/1.3.1/examples/pylab_examples/line_styles.html
  3. Matplotlib marker explanation https://matplotlib.org/api/markers_api.html

 

Experiment case using real Excel dataset and plot it to line chart with different markers automatically without defining one by one for each line can be seen on this post https://pydatascience.org/2017/12/05/read-the-data-and-plotting-with-multiple-markers/

*Some part of the codes, I took from StackOverflow

Anaconda · Data Analysis

Python for Data Science using Anaconda

I am a lazy guy, when I already have my setup environment, it is hard for me to move on. In my lovely MacBook, I had Python with python virtual environments (virtualenv), of course, one of my virtualenv has a complete data science libraries. I used this virtualenv when I want to do data analysis. I also had experiences using docker-machine to be more productive and reproducible in analyzing data but it is heavy so I keep using virtualenv.

I have heard about Anaconda or conda which is the platform that bundles all data science libraries to one plate and you just enjoy it, but I never tried yet. Still, it is hard to move on. Then, after I have a new computer in my office and it is Windows. I wanted to start working with data using Python in my computer and I tried to remember what steps I have to do? Installing Python, installing PIP, installing virtualenv, installing virtualenv wrapper, installing all data science libraries to one of my virtualenv, and start working.

As the lazy guy, I do not want to do that. I went to anaconda website and I decided to try Anaconda and tarrraa!!!!!!!!

Just go to https://www.anaconda.com/download/ download the installer which matches with your operating system, install it then launch the Anaconda Navigator.

 

It was surprising me, I even can run Rstudio using Anaconda Navigator. If you enjoy using Jupyter (IPython Notebook), you just need to press launch for the Jupyter. It is also supported by one of beautiful IDE to do data analysis in Python which is Spider. It is really cool. Previously when I used R, I always use Rstudio as my IDE to do data analysis and now if you want to move to Python, you can use a similar IDE which is Spider.

If you want to know what kind of data science libraries that you need to install manually if you don’t want to use Anaconda, please visit this link. The picture below describes the Python environment for Data Science which may useful for you:

Bokeh · Data Analysis · Data Mining · Keras · Machine Learning · Matplotlib · NumPy · Pandas · Plotting in Python · Ploty · SciKit-Learn · SciPy · Seaborn

Python for Data Science

I have been two years doing processing and manipulating data using R and mostly I use this language for my research project. I only heard and never tried Python to analyze my data before. But now, after I use Python, I really fall in love with this language. Python is very simple and it is been known that this language is the easiest one to be learned. The reason why previously I used R was this language supported by tons of libraries for scientific analysis and all of those are open source. Now, with the popularity of Python, all libraries that I need, I can find it easily in Python and all of them open source as well.

There are the core libraries that you must know when you start to do data analytics using Python:

  1. NumPy, it stands for Numerical Python. Python is different with R, the purpose of R language is for scientist and on the other side, Python is just the general programming language. So, it is needed a library which can handle numerical things such as complex arrays and matrics in Python. Repo project link: https://github.com/numpy/numpy
  2. SciPy, this library is for scientific and it handles such as statistic computing, linear algebra, optimation etc. Repo project link: https://github.com/scipy/scipy
  3. Pandas, when you ever play with R, it is very similar with DataFrame. By using DataFrame, we can easily to manipulate, aggregate, and doing analysis on our dataset. The data will be shown in a table like in Excel or DataFrame in R and it convenient to access the data by columns, rows or else. Repo project link: https://github.com/pandas-dev/pandas
  4. Matplotlib, Plotting is very important for data analysis. To make the data easy to read by people and we know that one picture can descript 1000 words, we absolutely need the data visualization tools. If you have experience with Excel, it is very easy, just block the table that you want to plot and select the plotting types such as Bar chart, line chart, etc. In R the most popular tools for plotting is ggplot, basically, you can use standard library ‘plot’ in R but if you want more advanced and more beautiful figure you need to use ggplot.  This library is the basic library for visualizing your data similar as I explained above, Repo project link: https://github.com/matplotlib/matplotlib

Those are the core basic libraries that you need when you start to use this language for data analytics. There are still many libraries that very useful such as:

  1. SciKit-Learn, when you want to apply machine learning on your data analytics.
  2. Scrapy, to scrap the data from the internet, when you want to gather the data from websites for your analysis. I used tweepy library to collect tweets data from Twitter.
  3. NLTK, if you want to do natural language processing.
  4. Theano, Tensorflow, Keras, when you are not satisfied with NumPy performance or want to apply neural network algorithm or doing deep learning stuff, these libraries are very useful for it.
  5. Interactive Visualization Tools, matplotlib is basic plotting tools and it is enough for me as researcher especially for publications, but when you want a dynamic plotting or more interactively you can use Seaborn, Ploty, or Bokeh.

pythonenvironment

If you are a lazy guy like me to install all the stuff above, you can try to use Anaconda, it is really cool. 

 

See ya on the next post..

Brisbane, 24 November 2017