GoogleVision tricked by Illusions?

This tweet by Max Woolf was being talked about everywhere for a while – Is it a Duck or a Rabbit? For Google Cloud Vision, it depends how the image is rotated.

How to use Test Driven Development in a Data Science Workflow

Every software developer knows about Test Driven Development (or TDD for short) but not enough people in data science and machine learning. This is surprising since TDD can add a lot of speed and quality to data science projects, too. In this post, I walk you through the main ideas behind TDD and a code example that illustrates both the merit of TDD for data science and how to actually implement it in a Python-based project.

Extremely Imbalanced data – Fraud detection

To fight fraud, we have to detect it first. When finding fraud, you have to consider:
• if you try to find all cases of fraud, some of those cases will be mislabeled. This will cause innocent people to be charged with committing fraud.
• if you try to keep innocent people from being charged, you will mislabel some of the fraudsters as innocent. In this case, your company will lose more money.
It is inevitable that your fraud detection algorithm will not be perfect. Which way will you lean?
Let’s look at this data.

Data Science Code Refactoring Example

When learning to code for data science we don’t usually consider the idea of modifying our code to reap a particular benefit in terms of performance. We code to modify our data, produce a visualization, and to construct our ML models. But if your code is going to be used for a dashboard or app, we have to consider if our code is optimal. In this code example, we will make a small modification to an ecdf function for speed.

Assessing Causality from Observational Data using Pearl’s Structural Causal Models

In 20th century statistics classes, it was common to hear the statement: ‘You can never prove causality.’ As a result, researchers published results saying ‘x is associated with y’ as a way of circumventing the issue of causality yet implicitly suggesting that the association is causal. As an example from my former discipline, political science, there was an interest in determining how representative democracy works. Do politicians respond to voters, or do voters just update their policy beliefs to line up with the party they’ve always preferred? It turns out that this is a very difficult question to answer, so political scientists interested in publishing choose their language carefully and pronounce that policy ‘congruence’ exists between voters and politicians. The upshot is that there now exists a scholarly literature on ‘voter-party congruence,’ which tells you exactly nothing about how democracy works but allows democracy researchers to get their papers past peer review.

What is Augmented Data Science and Why is it Important to My Business?

If Data Science was once the sole domain of analysts and data scientists, Augmented Data Science represents the democratized view of this domain. With Augmented Data Science, the average business user can engage with advanced analytics tools that allow for automated machine learning (AutoML) and leverage sophisticated analytical techniques and algorithms in a guided environment that uses auto-recommendations and suggestions to lead users through the complex world of data science with ease and intuitive tools. Augmented Data Science is integrated into the average enterprise, the domain of data science proliferates as does the knowledge and understanding of analytical techniques. Citizen X roles, like the much-discussed Citizen Data Scientist, Analytics Translator, Data Translator, Citizen Integrators and Citizen Developers will emerge, cascading knowledge and leveraging power users as liaisons with IT and data science staff. The propagation of these tools throughout the enterprise will improve decisions, planning, and competitive advantage.

Two New Frameworks that Google and DeepMind are Using to Scale Deep Learning Workflows

Your greatest strength can become your biggest weakness says the old proverb and that certainly applies to deep learning models. The entire deep learning space was possible in part to the ability of deep neural networks to scale across GPU topologies. However, that same ability to scale resulted in the creation of computationally intensive programs that result operationally challenging to most organizations. From training to optimization, the lifecycle of deep learning programs requires robust infrastructure building blocks to be able to parallelize and scale computation workloads. While deep learning frameworks are evolving at a rapid pace, the corresponding infrastructure models remain relatively nascent. Just last week, both Google and DeepMind unveiled separate efforts for enabling the parallelization of deep learning models across large GPU infrastructures.

Plotting text and image vectors using t-SNE

John Micternan rightly pointed out that ‘In presentation lies the true entertainment’, and if you can present your data in a way the other guy wished, he would be in a state of bliss. Imagine a world where communication of all data could be visualized,both my brain and my boss would thank me for the gesture .Vikram Varshney Ayush Gupta .Given the human brain’s comfort when it comes to pictorial content and considering the fact the 90% of the information transmitted through the human brain is visual,we have tried to plot sentence and image vectors in a 2-D space ,where each spatial spot represents a sentence and similar sentence vectors would be placed in spatial proximity.

The Pareto Principle for Data Scientists

More than a century ago, Vifredo Pareto, a professor of Political Economy, published the results of his research on the distribution of wealth in society. The dramatic inequalities he observed, e.g. 20% of the people owned 80% of the wealth, surprised economists, sociologists, and political scientists. Over the last century, several pioneers in varied fields observed this disproportionate distribution in several situations, including business. The theory that a vital few inputs/causes (e.g. 20% of inputs) directly influence a significant majority of the outputs/effects (e.g. 80% of outputs) came to be known as the Pareto Principle – also referred to as the 80-20 rule.

An interesting and intuitive view of AUC

A bunch of resources about AUC are available online. They usually start by explaining to the readers about true positives, sensitivity, type I, II errors, and FPR, etc. Many of them are nice and they explain the concepts in detail, but some concepts may confuse a few people without analytics background. What I’d like point out here is, one can actually learn AUC well without knowing all these technical terms.

How to build a k-NN in Node.js without TensorFlow

k-NN is a simple and intuitive instance based learner that works well when the training data is large. It also consumes comparably more memory because the model has to memorize all the points inside the training data. In order to build a k-NN in Node.js one would first think of TensorFlow.js, which is a popular machine learning framework that offers API in JavaScript. But what if you are an experienced researcher who wanted to twist the model a little bit to see if there can be an improvement? Then you will need to go down another level to change the core architecture. And if you are a beginner, being able to write the model from scratch will surely improve your understanding.

Putting ML in production II: logging and monitoring

In our previous post we showed how one could use the Apache Kafka’s Python API (Kafka-Python) to productionise an algorithm in real time. In this post we will focus more on the ML aspects, more specifically on how to log information during the (re)training process and monitor the results from the experiments. To that aim we will use MLflow along with Hyperopt or HyperparameterHunter.

Succeed in the Intelligent Era with an End-to-End Data Management Framework

The last decade has seen unprecedented advancements in artificial intelligence. We have moved towards a data-centric approach, and data is the center of everything digital. The data collected through different sources is refined, analyzed, and orchestrated with data platforms to generate intelligent insights that can facilitate the growth of any organization. The spread of these data platforms, coupled with the advancements in artificial intelligence, enable what has come to be known as the intelligent era. Enterprises are now making smart decisions – backed by actionable insights that also give them the guidance they need for the future. As an SAP partner, I was given the opportunity to explore the SAP Data Hub and get insights into the data management challenges organizations face in the new Intelligent Era.

Are R ecosystems the future?

Over the past 6 months I’ve been creating, refining, and delivering a variety of ‘Introduction to R’ training courses. The more I do this, the more I come to the view that not nearly enough is made of taking an ecosystem-oriented view to packages.

Predicting Stock Prices with Echo State Networks

The stock market is typically viewed as a chaotic time series, and advanced stochastic methods are often applied by companies to try and make reasonably accurate predictions so that they can get the upper hand and make money. This is essentially the idea behind all investment banking, especially those who are market traders. I do not claim to know much about the stock market (I am, after all, a scientist and not an investment banker), but I do know a reasonable amount about machine learning and stochastic methods. One of the greatest problems in this area is trying to accurately predict chaotic time series in a reliable manner. The idea of predicting the dynamics of chaotic systems is somewhat counterintuitive given that something chaotic, by definition, does not behave in a predictable manner. The study of time series was around before the introduction of the stock market but saw a marked increase in its popularity as individuals tried to leverage the stock market in order to ‘beat the system’ and become wealthy. In order to do this, people had to develop reliable methods of estimating market trends based on prior information. First, let us talk about some properties of time series that make them easy to analyze so that we can appreciate why time series analysis can get pretty tough when we look at the stock market.

Optimizing Jupyter Notebook: Tips, Tricks, and nbextensions

Jupyter Notebooks are a web-based and interactive tool that the machine learning and data science community uses a lot. They are used for quick testing, as a reporting tool or even as highly sophisticated learning materials in online courses. So here in this blog, I’m going to list down a few of the shortcuts, magic commands and nbextensions.

Jump Detection and Noise Separation by a Singular Wavelet Method for Predictive Analytics of High-Frequency Data

High-frequency data is a big data in finance in which a large amount of intra-day transactions arriving irregularly in financial markets are recorded. Given the high frequency and irregularity, such data require efficient tools to filter out the noise (i.e. jumps) arising from the anomaly, irregularity, and heterogeneity of financial markets. In this article, we use a recurrently adaptive separation algorithm, which is based on the maximal overlap discrete wavelet transform (MODWT) and that can effectively: (1) identify the time-variant jumps, (2) extract the time-consistent patterns from the noise (jumps), and (3) denoise the marginal perturbations. In addition, the proposed algorithm enables reinforcement learning to optimize a multiple-criteria decision or convex programming when reconstructing the wavelet-denoised data. Using simulated data, we show the proposed approach can perform efficiently in comparison with other conventional methods documented in the literature. We also apply our method in an empirical study by using high-frequency data from the US stock market and confirm that the proposed method can significantly improve the accuracy of predictive analytics models for financial market returns.

Comparative Quality Estimation for Machine Translation. An Application of Artificial Intelligence on Language Technology using Machine Learning of Human Preferences

In this thesis we focus on Comparative Quality Estimation, as the automaticprocess of analysing two or more translations produced by a Machine Translation(MT) system and expressing a judgment about their comparison. We approach theproblem from a supervised machine learning perspective, with the aim to learnfrom human preferences. As a result, we create the ranking mechanism, a pipelinethat includes the necessary tasks for ordering several MT outputs of a givensource sentence in terms of relative quality. Quality Estimation models are trained to statistically associate the judgmentswith some qualitative features. For this purpose, we design a broad set offeatures with a particular focus on the ones with a grammatical background.Through an iterative feature engineering process, we investigate several featuresets, we conclude to the ones that achieve the best performance and we proceedto linguistically intuitive observations about the contribution of individualfeatures. Additionally, we employ several feature selection and machine learning methodsto take advantage of these features. We suggest the usage of binary classifiersafter decomposing the ranking into pairwise decisions. In order to reduce theamount of uncertain decisions (ties) we weight the pairwise decisions with theirclassification probability. Through a set of experiments, we show that the ranking mechanism can learn andreproduce rankings that correlate to the ones given by humans. Most importantly,it can be successfully compared with state-of-the-art reference-aware metricsand other known ranking methods for several language pairs. We also apply thismethod for a hybrid MT system combination and we show that it is able to improvethe overall translation performance. Finally, we examine the correlation between common MT errors and decoding eventsof the phrase-based statistical MT systems. Through evidence from the decodingprocess, we identify some cases where long-distance grammatical phenomena cannotbe captured properly. An additional outcome of this thesis is the open source software Qualitative,which implements the full pipeline of ranking mechanism and the systemcombination task. It integrates a multitude of state-of-the-art natural languageprocessing tools and can support the development of new models. Apart from theusage in experiment pipelines, it can serve as an application back-end for webapplications in real-use scenaria.