I have been two years doing processing and manipulating data using R and mostly I use this language for my research project. I only heard and never tried Python to analyze my data before. But now, after I use Python, I really fall in love with this language. Python is very simple and it is been known that this language is the easiest one to be learned. The reason why previously I used R was this language supported by tons of libraries for scientific analysis and all of those are open source. Now, with the popularity of Python, all libraries that I need, I can find it easily in Python and all of them open source as well.
There are the core libraries that you must know when you start to do data analytics using Python:
- NumPy, it stands for Numerical Python. Python is different with R, the purpose of R language is for scientist and on the other side, Python is just the general programming language. So, it is needed a library which can handle numerical things such as complex arrays and matrics in Python. Repo project link: https://github.com/numpy/numpy
- SciPy, this library is for scientific and it handles such as statistic computing, linear algebra, optimation etc. Repo project link: https://github.com/scipy/scipy
- Pandas, when you ever play with R, it is very similar with DataFrame. By using DataFrame, we can easily to manipulate, aggregate, and doing analysis on our dataset. The data will be shown in a table like in Excel or DataFrame in R and it convenient to access the data by columns, rows or else. Repo project link: https://github.com/pandas-dev/pandas
- Matplotlib, Plotting is very important for data analysis. To make the data easy to read by people and we know that one picture can descript 1000 words, we absolutely need the data visualization tools. If you have experience with Excel, it is very easy, just block the table that you want to plot and select the plotting types such as Bar chart, line chart, etc. In R the most popular tools for plotting is ggplot, basically, you can use standard library ‘plot’ in R but if you want more advanced and more beautiful figure you need to use ggplot. This library is the basic library for visualizing your data similar as I explained above, Repo project link: https://github.com/matplotlib/matplotlib
Those are the core basic libraries that you need when you start to use this language for data analytics. There are still many libraries that very useful such as:
- SciKit-Learn, when you want to apply machine learning on your data analytics.
- Scrapy, to scrap the data from the internet, when you want to gather the data from websites for your analysis. I used tweepy library to collect tweets data from Twitter.
- NLTK, if you want to do natural language processing.
- Theano, Tensorflow, Keras, when you are not satisfied with NumPy performance or want to apply neural network algorithm or doing deep learning stuff, these libraries are very useful for it.
- Interactive Visualization Tools, matplotlib is basic plotting tools and it is enough for me as researcher especially for publications, but when you want a dynamic plotting or more interactively you can use Seaborn, Ploty, or Bokeh.
If you are a lazy guy like me to install all the stuff above, you can try to use Anaconda, it is really cool.
See ya on the next post..
Brisbane, 24 November 2017