Top 20 Python libraries for data science in 2018

Python continues to take leading positions in solving data science tasks and challenges. Last year we made a blog post overviewing the Python´s libraries that proved to be the most helpful at that moment. This year, we expanded our list with new libraries and gave a fresh look to the ones we already talked about, focusing on the updates that have been made during the year. Our selection actually contains more than 20 libraries, as some of them are alternatives to each other and solve the same problem. Therefore we have grouped them as it’s difficult to distinguish one particular leader at the moment.
1. NumPy (Commits: 17911, Contributors: 641)
2. SciPy (Commits: 19150, Contributors: 608)
3. Pandas (Commits: 17144, Contributors: 1165)
4. StatsModels (Commits: 10067, Contributors: 153)
5. Matplotlib (Commits: 25747, Contributors: 725)
6. Seaborn (Commits: 2044, Contributors: 83)
7. Plotly (Commits: 2906, Contributors: 48)
8. Bokeh (Commits: 16983, Contributors: 294)
9. Pydot (Commits: 169, Contributors: 12)
10. Scikit-learn (Commits: 22753, Contributors: 1084)
11. XGBoost / LightGBM / CatBoost (Commits: 3277 / 1083 / 1509, Contributors: 280 / 79 / 61)
12. Eli5 (Commits: 922, Contributors: 6)
13. TensorFlow (Commits: 33339, Contributors: 1469)
14. PyTorch (Commits: 11306, Contributors: 635)
15. Keras (Commits: 4539, Contributors: 671)
16. Dist-keras / elephas / spark-deep-learning (Commits: 1125 / 170 / 67, Contributors: 5 / 13 / 11)
17. NLTK (Commits: 13041, Contributors: 236)
18. SpaCy (Commits: 8623, Contributors: 215)
19. Gensim (Commits: 3603, Contributors: 273)
20. Scrapy (Commits: 6625, Contributors: 281)

Discrete or continuous modeling ?

Tuesday, we got our conference ‘Insurance, Actuarial Science, Data & Models’ and Dylan Possamaï gave a very interesting concluding talk. In the introduction, he came back briefly on a nice discussion we usually have in economics on the kind of model we should consider. It was about optimal control. In many applications, we start with a one period economy, then a two period economy, and pretend that we can extend it to nnn period economy. And then, the continuous case can also be considered. A few years ago, I was working on sports game as an optimal effort startegy (within in a game – fixed time). It was with a discrete model, I was running simulations to get an efficient frontier, where coaches might say ‘ok, now we have enough (positive) difference, and we get closer to the end of the game, so we can ‘lower the effort’ i.e. top players can relax a little bit’ (it was on basket-ball games). I asked a good friend of mine, Romuald, to help me on some technical parts of proofs, but he did not like so much my discrete-time model, and wanted to move to continuous time. And for now six years, we keep saying that someday we should get back to that paper….

A Comparative Review of the Deducer GUI for R

Deducer is a free and open source Graphical User Interface for the R software, one that provides beginners a way to point-and-click their way through analyses. It also integrates into an environment designed to help programmers be more productive. Deducer is available on Windows, Mac, and Linux; there is no server version. This post one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. However, the reviews will include a cursory description of the programming support that each GUI offers.

Anomaly Detection for Business Metrics with R

The larger and more complex the business the more metrics and dimensions. One day you understand that it is impossible to track them with only your eyes. Reducing the number of metrics and/or dimensions can prevent us from tracking all aspects of the business or analyzing aggregated data (for example, without dimensions), which can substantially smooth out or hide the anomalies. In such a situation, the detection of any anomalies after the actual occurrence can either be missed or have a significant time gap. Therefore, we have to react immediately in order to learn about the event as soon as possible, identify its causes, and understand what to do about it. For this, we can use the Anomaly Detection system and identify abnormal values, collect the corresponding events centrally, and monitor a much larger number of metrics and dimensions than what human capabilities allow.

Hotfix for Microsoft R Open 3.5.0 on Linux

On Monday, we learned about a serious issue with the installer for Microsoft R Open on Linux-based systems. (Thanks to Norbert Preining for reporting the problem.) The issue was that the installation and de-installation scripts would modify the system shell, and did not use the standard practices to create and restore symlinks for system applications. The Microsoft R team developed a solution the problem with the help of some Debian experts at Microsoft, and last night issued a hotfix for Microsoft R Open 3.5.0 which is now available for download. With this fix, the MRO installer no longer relinks /bin/sh to /bin/bash, and instead uses dpkg-divert for Debian-based platforms and update-alternatives for RPM-based platforms. We will also request a discussion with the Debian maintainers of R to further review our installation process. Finally, with the next release – MRO 3.5.1, scheduled for August 9 – we will also include the setup code (including the installation scripts) in the MRO GitHub repository for everybody to inspect and give feedback on.

Monte Carlo Part Two

In a previous post, we reviewed how to set up and run a Monte Carlo (MC) simulation of future portfolio returns and growth of a dollar. Today, we will run that simulation many, many, times and then visualize the results. Our ultimate goal is to build a Shiny app that allows an end user to build a custom portfolio, simulate returns, and visualize the results. If you just can´t wait, a link to the final Shiny app is available here.

How to Eliminate Silos in Company-Wide Data Analytics

Silos’ are something of a buzzword, but the concept they describe warrants your attention. Silos emerge when a cluster of individuals in your company (usually within a specific department) have trouble communicating with, or collaborating with another cluster of individuals in your company (usually within another department). In some ways, this is a natural result of building a company; if you want your sales team to focus on sales and your marketing team to focus on marketing, eventually, it will be difficult for your sales and marketing staff to collaborate on a mutual problem. But if you want your company´s data to be streamlined, accessible, and impactful to your organization´s bottom line, you´ll need to eliminate these silos, or at least mitigate their development.

reticulate – another step towards a multilingual and collaborative way of working

R, Julia, Python – todays data scientists have the choice between numerous different programming languages, each with their own strengths and weaknesses. Would it not be convenient to bring those languages together and use the individual strength of each language The package ‘reticulate’ for R takes a step in this direction.

Generating Text with RNNs in 4 Lines of Code

Want to generate text with little trouble, and without building and tuning a neural network yourself Let’s check out a project which allows you to ‘easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.’