Book Memo: “Probabilistic Data Structures and Algorithms for Big Data Applications”

A technical book about popular space-efficient data structures and fast algorithms that are extremely useful in modern Big Data applications. The purpose of this book is to introduce technology practitioners, including software architects and developers, as well as technology decision makers to probabilistic data structures and algorithms. Reading this book, you will get a theoretical and practical understanding of probabilistic data structures and learn about their common uses.

R Packages worth a look

The Attributable Fraction (AF) Described as a Function of Disease Heritability, Prevalence and Intervention Specific Factors (AFheritability)
The AFfunction() is a function which returns an estimate of the Attributable Fraction (AF) and a plot of the AF as a function of heritability, disease …

Computation and Visualization of Package Download Counts and Percentiles (packageRank)
Compute and visualize the cross-sectional and longitudinal number and rank percentile of package downloads from RStudio’s CRAN mirror.

Amazon Redshift Tools (redshiftTools)
Efficiently upload data into an Amazon Redshift database using the approach recommended by Amazon <<a href=";. …

Spy on Your R Session (matahari)
Conveniently log everything you type into the R console. Logs are are stored as tidy data frames which can then be analyzed using ‘tidyverse’ style tools.

Agglomerative Partitioning Framework for Dimension Reduction (partition)
A fast and flexible framework for agglomerative partitioning. ‘partition’ uses an approach called Direct-Measure-Reduce to create new variables that ma …

Tools and Palettes for Bivariate Thematic Mapping (biscale)
Provides a ‘ggplot2’ centric approach to bivariate mapping. This is a technique that maps two quantities simultaneously rather than the single value th …

If you did not already know

Moving Average Convergence Divergence (MACD) google
MACD, short for moving average convergence/divergence, is a trading indicator used in technical analysis of stock prices, created by Gerald Appel in the late 1970s. It is supposed to reveal changes in the strength, direction, momentum, and duration of a trend in a stock’s price. The MACD indicator (or ‘oscillator’) is a collection of three time series calculated from historical price data, most often the closing price. These three series are: the MACD series proper, the ‘signal’ or ‘average’ series, and the ‘divergence’ series which is the difference between the two. The MACD series is the difference between a ‘fast’ (short period) exponential moving average (EMA), and a ‘slow’ (longer period) EMA of the price series. The average series is an EMA of the MACD series itself.
Moving Average Convergence Divergence (MACD)

Bessels Correction google
In statistics, Bessel’s correction, named after Friedrich Bessel, is the use of n – 1 instead of n in the formula for the sample variance and sample standard deviation, where n is the number of observations in a sample. This corrects the bias in the estimation of the population variance, and some (but not all) of the bias in the estimation of the population standard deviation, but often increases the mean squared error in these estimations. …

Orbit google
Orbit is a composable framework for orchestrating change processing, tracking, and synchronization across multiple data sources. Orbit is written in Typescript and distributed on npm through the @orbit organization. Pre-built distributions are provided in several module formats and ES language levels. Orbit is isomorphic – it can be run both in modern browsers as well as in the Node.js runtime. …

Distilled News

Building Efficient Custom Datasets in PyTorch

PyTorch has been around my circles as of late and I had to try it out despite being comfortable with Keras and TensorFlow for a while. Surprisingly, I found it quite refreshing and likable, especially as PyTorch features a Pythonic API, a more opinionated programming pattern and a good set of built-in utility functions. One that I enjoy particularly well is the ability to easily craft a custom Dataset object which can then be used with the built-in DataLoader to feed data when training a model. In this article, I will be exploring the PyTorch Dataset object from the ground up with the objective of making a dataset for handling text files and how one could go about optimizing the pipeline for a certain task. We start by going over the basics of the Dataset utility with a toy example and work our way up to the real task. Specifically, we want to create a pipeline to feed first names of character names, from The Elder Scrolls (TES) series, the race of those character names and the gender of the names as one-hot tensors. You can find this dataset on my website.

Building a Connection Between Decision Maker and Data-Driven Decision Process

It is quite common that most of companies’ decisions are made based on feelings, intuitions or personal experiences. The reasons for such patterns have organizational, technical and process oriented backgrounds. For instance, there is no structured way to deal with the analytical results on both sides simultaneously – organizational and technical. Usually, in case of analytics the ones doing analysis (e.g. data scientists) and the ones using results of analytics (e.g. decision makers) are different persons. As a result, such a structure leads to ambiguity and misunderstanding between the involved parties. In order to bridge the existing gap between data scientists and decision makers, we introduced the Data Product Profile which links both data science and data-driven decision processes.

Plotting Markowitz Efficient Frontier with Python

This article is a follow up on the article about calculating the Sharpe Ratio. After knowing how to get the Sharpe ratio, we will simulate over a few thousand possible portfolio allocations, and draw the outcomes in a chart. With this we can easily find out the best allocation for our stocks for any given level of risk we are willing to take. Like in the previous article, we will need to load our data. I’ll be using four simple files with two columns – a date and closing price. You can use your own data, or find something in Quandl, which is very good for this purpose. We have 4 companies – Amazon, IBM, Cisco, and Apple. Each file is loaded as amzn, ibm, cisco and aapl respectively. The head of aapl is printed below.

NLP vs. NLU: from Understanding a Language to Its Processing

As artificial intelligence progresses and technology becomes more sophisticated, we expect existing concepts to embrace this change – or change themselves. Similarly, in the domain of computer-aided processing of natural languages, shall the concept of natural language processing give way to natural language understanding? Or is the relation between the two concepts subtler and more complicated that merely linear progressing of a technology? In this post we’ll scrutinize over the concepts of NLP and NLU and their niches in the AI-related technology. Importantly, though sometimes used interchangeably, they are actually two different concepts that have some overlap. First of all, they both deal with the relationship between a natural language and artificial intelligence. They both attempt to make sense of unstructured data, like language, as opposed to structured data like statistics, actions, etc. However, NLP and NLU are opposites of a lot of other data mining techniques.

Illustrated Machine Learning Cheatsheets

My twin brother Afshine and I created this set of illustrated Machine Learning cheatsheets covering the content of the CS 229 class, which I TA-ed in Fall 2018 at Stanford. They can (hopefully!) be useful to all future students of this course as well as to anyone else interested in Machine Learning.

Basic Statistics Concepts Data Scientist Should know

Data science is a multidisciplinary blend of data inference, algorithm development, and technology in order to solve analytically complex problems. At the core is data. Troves of raw information, streaming in and stored in enterprise data warehouses. Much to learn by mining it. Advanced capabilities we can build with it. Data science is ultimately about using this data in creative ways to generate business value The broader fields of understanding what data science includes mathematics, statistics, computer science and information science. For career as Data Scientist, you need to have a strong background in statistics and mathematics. Big companies will always give preference to those with good analytical and statistical skills. In this blog, we will be looking at the basic statistical concepts which every data scientists must know. Let’s understand them one by one in the next section.

Basic Linear Algebra Concepts for Machine Learning

The field of Data Science has seen exponential growth in the last few years. Though the concept was prevalent previously as well the recent hype is the result of the variety of huge volumes of unstructured data that is getting generated across different industries and the enormous potential that’s hidden beneath those data. On top of the top, the massive computational power that modern day computers possess has made it even more possible to mine such huge chunks of data. Now Data Science is a study which is comprised of several disciplines starting from exploratory data analysis to predictive analytics. There are various tools and techniques that professionals use to extract information from the data. However, there is a common misconception among them is to focus more on those tools rather than the math behind the data modelling. People tend to put too much importance on the Machine Learning algorithms instead of the Linear Algebra or the Probability concepts that are required to fetch relevant meaning from the data. Thus, in this blog post, we would cover one of the pre-requisites in Data Science i.e. Linear Algebra and some of the basic concepts that you should learn.

Statistics for Data Science: Introduction to t-test and its Different Types (with Implementation in R)

Every day we find ourselves testing new ideas, finding the fastest route to the office, the quickest way to finish our work, or simply finding a better way to do something we love. The critical question, then, is whether our idea is significantly better than what we tried previously. These ideas that we come up with on such a regular basis – that’s essentially what a hypothesis is. And testing these ideas to figure out which one works and which one is best left behind, is called hypothesis testing. Hypothesis testing is one of the most fascinating things we do as data scientists. No idea is off-limits at this stage of our project. I have personally seen so many insights coming out of hypothesis testing – insights most of us would have missed if not for this stage!

Personalized treatment effects with model4you

Typical models estimating treatment effects assume that the treatment effect is the same for all individuals. Model-based recursive partitioning allows to relax this assumption and to estimate stratified treatment effects (model-based trees) or even personalised treatment effects (model-based forests). With model-based trees one can compute treatment effects for different strata of individuals. The strata are found in a data-driven fashion and depend on characteristics of the individuals. Model-based random forests allow for a similarity estimation between individuals in terms of model parameters (e.g. intercept and treatment effect). The similarity measure can then be used to estimate personalised models. The R package model4you implements these stratified and personalised models in the setting with two randomly assigned treatments with a focus on ease of use and interpretability so that clinicians and other users can take the model they usually use for the estimation of the average treatment effect and with a few lines of code get a visualisation that is easy to understand and interpret.

One of the most familiar settings for a machine learning engineer is having access to a lot of data, but modest resources to annotate it. Everyone in that predicament eventually goes through the logical steps of asking themselves what to do when they have limited supervised data, but lots of unlabeled data, and the literature appears to have a ready answer: semi-supervised learning. And that’s usually when things go wrong.

Let’s get it right

Article: Machine intelligence makes human morals more important

Machine intelligence is here, and we’re already using it to make subjective decisions. But the complex way AI grows and improves makes it hard to understand and even harder to control. In this cautionary talk, techno-sociologist Zeynep Tufekci explains how intelligent machines can fail in ways that don’t fit human error patterns – and in ways we won’t expect or be prepared for. ‘We cannot outsource our responsibilities to machines,’ she says. ‘We must hold on ever tighter to human values and human ethics.’

Article: AI – Fear, uncertainty, and hope

How to cope with AI and start becoming a part of it If you open a news site today, you are almost sure to be met with an article about AI, Robotics, Quantum computing, genetic engineering, autonomous vehicles, natural language processing, and other technologies from the box called ‘The fourth industrial revolution’. Rating these technologies makes no sense, as they all have a staggering potential to change our world forever. Artificial intelligence, however, is already surging into all of the other technologies. To facilitate by the mastering of big data, pattern recognition, or prediction is an inherent quality of AI, and is frequently being applied to support ground-breaking discoveries in other technologies. I once heard a driving instructor comparing holding the hands on the steering wheel to having a gun in the hand, because of how dangerous it is to drive a car. AI is also dangerous, and we need to face the dark side of AI too, not only revel in the glorious benefits it brings us. Anything else would be reckless driving.

Article: Towards Trans-Inclusive AI

AI ‘thinks’ like those who designed them – with a heteronormative conception of gender. They exclude transgender people and reinforce gender stereotypes. Worse, governments across the world spend billions of dollars to scale cis-sexist AI to new industries like government agencies and to new applications like image recognition with little regard for their gendered impacts. The computer science community, tech community, and government agencies should be more accountable for the gendered impacts of their algorithms. They need to learn to analyze the gendered impacts of algorithms using queer and transgender theory then apply that learning to the design, deployment, and monitoring of AI algorithms in society.

Article: 9 Steps Toward Ethical AI

Few current laws address the use of artificial intelligence. That puts companies under greater pressure to reassure the public that their AI applications are ethical and fair.

Article: Will Big Data Affect Opinion Polls?

Statisticians feel recently a pressure for substituting sample surveys with new opportunities offered by Big Data. Some authors suggest that opinion polls and other random sample surveys have become obsolete in the new era of Big Data. The author discusses relationships between survey-based and Big Data-based approaches to the measurement of consumers’ and public opinions. Special attention is given to traditional opinion polls.

Article: The Hitchhiker’s Guide to AI Ethics

A machine learning algorithm. OpenAI’s GPT2 language model is trained to predict text. It’s huge, complex, takes months of training over tons of data on expensive computers; but once that’s done it’s easy to use. A prompt (‘The Hitchhiker’s Guide to AI Ethics is a’) and a little curation is all it took to generate my raving review using a smaller version of GPT2. The text has some obvious errors but it is a window into the future. If AI can generate human-like output can it also make human-like decisions? Spoiler alert: yes it can, it already is. But is *human-like* good enough? What happens to TRUST in a world where machines generate human-like output and make human-like decisions? Can I trust an autonomous vehicle to have seen me? Can I trust the algorithm processing my housing loan to be fair? Can we trust the AI in the ER enough to make life and death decisions for us? As technologists we must flip this around and ask: How can we make algorithmic systems trustworthy? Enter Ethics. To understand more we need some definitions, a framework, and lots of examples. Let’s go!

Article: AI TRAPS: Automating Discrimination

A close look at how AI & algorithms reinforce prejudices and biases of its human creators and societies, and how to fight discrimination.

Article: Understanding Dataism: Manipulation and Threat Behind AI

If you experience something – record it. If you record something – upload it. If you upload something – share it’ – this phrase best explains dataism, the new 21st-century religion focused mainly on the rapid development of technology, Internet obsession, and general data-worship. Artificial Intelligence, Machine Learning, Data Science, Big Data.. these things and more are so powerful and new for the majority of us that we start to fear what kind of dangerous transformations it may bring. Science fiction, however, is quickly becoming science fact – the future is the machine. But, isn’t it too early for assuming something like that and declare technology will destroy humanity? The truth is, right now Dataism is far away from being a pure religion in its meaning, or scientific concept grounding on true laws. It is rather a complex of fear and a vision that AI and other stuff are nothing more than just manipulation and threat. Weird comparison, but just like capitalism, Dataism too began as a neutral scientific theory but is now mutating into a religion that claims to determine right and wrong. So, to believe in Dataism or not? Let’s make a little investigation on this.

If you did not already know

Distributed Cooperative Logistics Platform (DCLP) google
Supply Chains and Logistics have a growing importance in global economy. Supply Chain Information Systems over the world are heterogeneous and each one can both produce and receive massive amounts of structured and unstructured data in real-time, which are usually generated by information systems, connected objects or manually by humans. This heterogeneity is due to Logistics Information Systems components and processes that are developed by different modelling methods and running on many platforms; hence, decision making process is difficult in such multi-actor environment. In this paper we identify some current challenges and integration issues between separately designed Logistics Information Systems (LIS), and we propose a Distributed Cooperative Logistics Platform (DCLP) framework based on NoSQL, which facilitates real-time cooperation between stakeholders and improves decision making process in a multi-actor environment. We included also a case study of Hospital Supply Chain (HSC), and a brief discussion on perspectives and future scope of work. …

Diverse Online Feature Selection google
Online feature selection has been an active research area in recent years. We propose a novel diverse online feature selection method based on Determinantal Point Processes (DPP). Our model aims to provide diverse features which can be composed in either a supervised or unsupervised framework. The framework aims to promote diversity based on the kernel produced on a feature level, through at most three stages: feature sampling, local criteria and global criteria for feature selection. In the feature sampling, we sample incoming stream of features using conditional DPP. The local criteria is used to assess and select streamed features (i.e. only when they arrive), we use unsupervised scale invariant methods to remove redundant features and optionally supervised methods to introduce label information to assess relevant features. Lastly, the global criteria uses regularization methods to select a global optimal subset of features. This three stage procedure continues until there are no more features arriving or some predefined stopping condition is met. We demonstrate based on experiments conducted on that this approach yields better compactness, is comparable and in some instances outperforms other state-of-the-art online feature selection methods. …

Pairwise Augmented GAN google
We propose a novel autoencoding model called Pairwise Augmented GANs. We train a generator and an encoder jointly and in an adversarial manner. The generator network learns to sample realistic objects. In turn, the encoder network at the same time is trained to map the true data distribution to the prior in latent space. To ensure good reconstructions, we introduce an augmented adversarial reconstruction loss. Here we train a discriminator to distinguish two types of pairs: an object with its augmentation and the one with its reconstruction. We show that such adversarial loss compares objects based on the content rather than on the exact match. We experimentally demonstrate that our model generates samples and reconstructions of quality competitive with state-of-the-art on datasets MNIST, CIFAR10, CelebA and achieves good quantitative results on CIFAR10. …

Document worth reading: “Graph Kernels: A Survey”

Graph kernels have attracted a lot of attention during the last decade, and have evolved into a rapidly developing branch of learning on structured data. During the past 20 years, the considerable research activity that occurred in the field resulted in the development of dozens of graph kernels, each focusing on specific structural properties of graphs. Graph kernels have proven successful in a wide range of domains, ranging from social networks to bioinformatics. The goal of this survey is to provide a unifying view of the literature on graph kernels. In particular, we present a comprehensive overview of a wide range of graph kernels. Furthermore, we perform an experimental evaluation of several of those kernels on publicly available datasets, and provide a comparative study. Finally, we discuss key applications of graph kernels, and outline some challenges that remain to be addressed. Graph Kernels: A Survey

Book Memo: “Automated Machine Learning”

This open access book presents the first comprehensive overview of general methods in Automated Machine Learning (AutoML), collects descriptions of existing systems based on these methods, and discusses the first series of international challenges of AutoML systems. The recent success of commercial ML applications and the rapid growth of the field has created a high demand for off-the-shelf ML methods that can be used easily and without expert knowledge. However, many of the recent machine learning successes crucially rely on human experts, who manually select appropriate ML architectures (deep learning architectures or more traditional ML workflows) and their hyperparameters. To overcome this problem, the field of AutoML targets a progressive automation of machine learning, based on principles from optimization and machine learning itself. This book serves as a point of entry into this quickly-developing field for researchers and advanced students alike, as well as providing a reference for practitioners aiming to use AutoML in their work.