Command Line Tricks For Data Scientists

For many data scientists, data manipulation begins and ends with Pandas or the Tidyverse. In theory, there is nothing wrong with this notion. It is, after all, why these tools exist in the first place. Yet, these options can often be overkill for simple tasks like delimiter conversion. Aspiring to master the command line should be on every developer’s list, especially data scientists. Learning the ins and outs of your terminal will undeniably make you more productive. Beyond that, the command line serves as a great history lesson in computing. For instance, awk – a data-driven scripting language. Awk first appeared in 1977 with the help of Brian Kernighan, the K in the legendary K&R book. Today, some near 50 years later, awk remains relevant with new books still appearing every year! Thus, it’s safe to assume that an investment in command line wizardry won’t depreciate any time soon.


Data science as a language: challenges for computer science – a position paper

In this paper, I posit that from a research point of view, Data Science is a language. More precisely Data Science is doing Science using computer science as a language for datafied sciences; much as mathematics is the language of, e.g., physics. From this viewpoint, three (classes) of challenges for computer science are identified; complementing the challenges the closely related Big Data problem already poses to computer science. I discuss the challenges with references to, in my opinion, related, interesting directions in computer science research; note, I claim neither that these directions are the most appropriate to solve the challenges nor that the cited references represent the best work in their field, they are inspirational to me. So, what are these challenges? Firstly, if computer science is to be a language, what should that language look like? While our traditional specifications such as pseudocode are an excellent way to convey what has been done, they fail for more mathematics like reasoning about computations. Secondly, if computer science is to function as a foundation of other, datafied, sciences, its own foundations should be in order. While we have excellent foundations for supervised learning—e.g., by having loss functions to optimize and, more general, by PAC learning (Valiant in Commun ACM 27(11):1134-1142, 1984)—this is far less true for unsupervised learning. Kolmogorov complexity—or, more general, Algorithmic Information Theory—provides a solid base (Li and Vitányi in An introduction to Kolmogorov complexity and its applications, Springer, Berlin, 1993). It provides an objective criterion to choose between competing hypotheses, but it lacks, e.g., an objective measure of the uncertainty of a discovery that datafied sciences need. Thirdly, datafied sciences come with new conceptual challenges. Data-driven scientists come up with data analysis questions that sometimes do and sometimes don’t, fit our conceptual toolkit. Clearly, computer science does not suffer from a lack of interesting, deep, research problems. However, the challenges posed by data science point to a large reservoir of untapped problems. Interesting, stimulating problems, not in the least because they are posed by our colleagues in datafied sciences. It is an exciting time to be a computer scientist.


A Comprehensive List Of R Packages For Portfolio Analysis

R language is a free statistical computing environment; hence there are multiple ways/packages to achieve a particular statistical/quantitative output. I am going to discuss here a concise list of R packages that one can use for the modeling of financial risks and/or portfolio optimization with utmost efficiency and effectiveness. The intended audience for this article is financial market analysts interested in using R, and also for quantitatively inclined folks with a background in finance, statistics, and mathematics.


How to prove it in math: why deeper decision trees will never have higher expected cross entropy?

What I discussed here is not only the math derivation which has usually been ignored in decision tree, but also the following question: what does the cross-entropy really means for decision tree, and how will it lead to over-fitting. The expected cross-entropy is usually used as the objective function for the decision tree. You can find the definition of expected cross entropy everywhere. Let’s start our story from a simple example.


Top 5 GitHub Repositories and Reddit Discussions for Data Science & Machine Learning (April 2018)

GitHub and Reddit are two of the most popular platforms when it comes to data science and machine learning. The former is an awesome tool for sharing and collaborating on codes and projects while the latter is the best platform out there for engaging with data science enthusiasts from around the world.
This year, we have covered the top GitHub repositories each month and from this month onwards, we will be including the top Reddit threads as well that generated the most interesting and intriguing discussions in the machine learning space.
April saw some amazing python libraries being open sourced. From Deep Painterly Harmonization, a library that makes manipulated images look ultra realistic, to Swift for TensorFlow, this article covers the best from last month.

GitHub Repositories
1. Deep Painterly Harmonization
2. Swift for TensorFlow
3. MUNIT: Multimodal UNsupervised Image-to-image Translation
4. GluonNLP
5. PyTorch GAN


Bayesian Inference with Backfitting MCMC

Recently I’ve been reading about Bayesian additive regression trees (BART). For those interested, the paper is here. I use similar notation when describing the backfitting procedure in this post. I believe this MCMC method was first described here. BART a Bayesian nonparametric model that can be fit using backfitting MCMC – which is cool in itself. I thought I’d write a post describing the intuition behind backfitting as well as a toy example.


Qualitative Data Science: Using RQDA to analyse interviews

Qualitative data science sounds like a contradiction in terms. Data scientists generally solve problems using numerical solutions. Even the analysis of text is reduced to a numerical problem using Markov chains, topic analysis, sentiment analysis and other mathematical tools. Scientists and professionals consider numerical methods the gold standard of analysis. There is, however, a price to pay when relying on numbers alone. Numerical analysis reduces the complexity of the social world. When analysing people, numbers present an illusion of precision and accuracy. Giving primacy to quantitative research in the social sciences comes at a high price. The dynamics of reality are reduced to statistics, losing the narrative of the people that the research aims to understand. Being both an engineer and a social scientist, I acknowledge the importance of both numerical and qualitative methods. My dissertation used a mixed-method approach to review the relationship between employee behaviour and customer perception in water utilities. This article introduces some aspects of qualitative data science with an example from my dissertation. In this article, I show how I analysed data from interviews using both quantitative and qualitative methods and demonstrate why qualitative data science is better to understand text than numerical methods. The most recent version of the code is available on my GitHub repository.


Statistical Sins: Is Your Classification Model Any Good?

A regression model returns the linear correction applied to the predictor variables to reproduce the outcome, and will highlight whether a predictor was significantly related to the outcome or not. But a big question you may be asking of your binomial model is: how well does it predict the outcome? Specifically, how can you examine whether your regression model is correctly classifying cases?


How efficient are multifactorial experiments?

I recently described why we might want to conduct a multi-factorial experiment, and I alluded to the fact that this approach can be quite efficient. It is efficient in the sense that it is possible to test simultaneously the impact of multiple interventions using an overall sample size that would be required to test a single intervention in a more traditional RCT. I demonstrate that here, first with a continuous outcome and then with a binary outcome.


How to build analytic products in an age when data privacy has become critical

Privacy-preserving analytics is not only possible, but with GDPR about to come online, it will become necessary to incorporate privacy in your data products.


Complementary learning for AI-based predictive quality and maintenance

PQM solutions, which harness data gathered by both the Internet of Things (IoT) and data from traditional legacy systems, focus on detecting and addressing quality and maintenance issues before they turn into serious problems—for example, problems that can cause unplanned downtime. Unplanned downtime is a major cost driver in any industry that must maintain large inventories of capital assets. For an airline, for example, delaying flights due to unplanned maintenance can cost thousands of dollars each minute. Unplanned shutdowns of oil platforms can run into the millions of dollars. And in manufacturing plants, the costs of disruptions go directly to the bottom line. It is the goal of every organization to eliminate unplanned downtime in favor of planned maintenance. PQM solutions can help with planned maintenance, too, by shortening maintenance operations windows.


Stock Market Predictions with LSTM in Python

Discover Long Short-Term Memory (LSTM) networks in Python and how you can use them to make stock market predictions!


Evaluation of Topic Modeling: Topic Coherence

In this article, we will go through the evaluation of Topic Modelling by introducing the concept of Topic coherence, as topic models give no guaranty on the interpretability of their output. Topic modeling provides us with methods to organize, understand and summarize large collections of textual information. There are many techniques that are used to obtain topic models. Latent Dirichlet Allocation (LDA) is a widely used topic modeling technique to extract topic from the textual data.


Visualize Market Basket analysis in R

In this paper, we will go through the MBA (Market Basket analysis) in R, with focus on visualization of MBA. We will use the Instacart customer orders data, publicly available on Kaggle. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.