Python Pandas DataFrame Basics Tutorial

In this post, I am going to show you how to deal with data in Python. Before going there, you have to understand what kind of python libraries that you need to know if you want to deal with data in Python. Python has tons of libraries especially related to data science. I have a couple of old posts that may useful for you: 1) setup anaconda 2) understand python libraries for data science. In this tutorial, I use Jupyter Notebook, if you did not have/familiar yet, please read the instruction above, otherwise, just go down!

Let start from the simplest one. When you want to deal with data in Python. Python has an amazing library called Pandas. If you are familiar with Spreadsheet tool such as MS Excel, Pandas similar to that kind of tool, Pandas shows our data in the format of Table. The only difference is, when you use Excel you just drag and drop but here in Pandas, you have to understand the standard syntax and command of pandas.

Let’s get started.

To start analyzing data, you can import your data (e.g., csv, xls, and etc) to python environment using Pandas: import pandas as pd then pd.read_csv('data.csv'). However, to make it easier, in this tutorial, I just create my data rather than import from file.


#Create DataFrame
import pandas as pd #when we want to use Pandas, we have to import it
import numpy as np #numpy is another useful library in python for dealing with number
df = pd.DataFrame(
    {'integer':[1,2,3,6,7,23,8,3],
     'float':[2,3.4,5,6,2,4.7,4,8],
     'string':['saya','aku', np.NaN ,'cinta','kamu','a','b','indonesia']}
)

1. To show your DataFrame, just use this command!

#show data in DataFrame
df

2. If you want to access your single or more data from your DataFrame, you can access it using loc syntax.

#Show data based on index
df.loc[1]

3. If you only need some columns and ignore other columns, you can just select the columns:

#show data based on columns selected
df[['string','float']]

4. You also can apply IF condition on your data similar to the filter in Excel. Use this command and see your result.

#show data with condition
df[df['float']>4]

5. You also able to rename the columns by using this command:

#rename column in DataFrame
df.rename(columns={'string':'char'})

6. When I create the data, I add one row that contains Nan or null value. The missing value is a common issue in data. So, how to deal with missing values? first, we need to know whether our DataFrame contains missing values or not by using this command.

#Show NaN value in DataFrame
df.isnull()

7. The simplest way to deal with missing values is to drop all missing values. How to drop missing values? here the command:

# Drop all rows contain NaN value
df.dropna()

8. We also can make summaries from our data (e.g., mean, median, mode, max, etc), use this command and see what you got!

#Show mean, median, and maximum in Data Frame
mean = df['float'].mean()
print("mean =", mean)
median = df['float'].median()
print("median = ", median)
max = df['float'].max()
print("max =", max)

Here the result: https://github.com/rischanlab/PyDataScience.Org/blob/master/python_notebook/1%20Basic%20Pandas%20Operation%20.ipynb