Propensity Score Methods for Causal Inference: an Overview

Propensity score methods are popular and effective statistical techniques for reducing selection bias in observational data to increase the validity of causal inference based on observational studies in behavioral and social science research. Some methodologists and statisticians have raised concerns about the rationale and applicability of propensity score methods. In this review, we addressed these concerns by reviewing the development history and the assumptions of propensity score methods, followed by the fundamental techniques of and available software packages for propensity score methods. We especially discussed the issues in and debates about the use of propensity score methods. This review provides beneficial information about propensity score methods from the historical point of view and helps researchers to select appropriate propensity score methods for their observational studies.

Building Google Dataset Search and Fostering an Open Data Ecosystem

At a very high level, Google Data Search relies on dataset providers, big and small, adding structured metadata on their sites using the open standard. The metadata specifies the salient properties of each dataset: its name and description, spatial and temporal coverage, provenance information, and so on. Dataset Search uses this metadata, links it with other resources that are available at Google (more on this below!), and builds an index of this enriched corpus of metadata. Once we built the index, we can start answering user queries – and figuring out which results best correspond to the query.

A Multivariate Time Series Guide to Forecasting and Modeling (with Python codes)

Time is the most critical factor that decides whether a business will rise or fall. That´s why we see sales in stores and e-commerce platforms aligning with festivals. These businesses analyze years of spending data to understand the best time to throw open the gates and see an increase in consumer spending. But how can you, as a data scientist, perform this analysis? Don´t worry, you don´t need to build a time machine! Time Series modeling is a powerful technique that acts as a gateway to understanding and forecasting trends and patterns.

Demystifying crucial statistics in Python

Learn about the basic statistics required for Data Science and Machine Learning in Python.

Factor Levels in R

This tutorial takes course material from DataCamp’s free Intro to R course and allows you to practice Factors.

Visualize your Portfolio´s Performance and Generate a Nice Report with R

There are few things more exciting than seeing your stocks values going up! I started investing last year in stocks and, like visualization and R lover, I couldn´t help but create some nice plots and functions to automate the process of watching it happen.

A/B Testing with a Data Science Approach

How to perform statistically significant A/B tests

Make Python Pandas go fast

Suppose you have a Data Analysis batch job that runs every hour on a dedicated machine. As the weeks go by, you notice that the inputs are getting larger and the time taken to run it gets longer, slowly nearing the one hour mark. You worry that subsequent executions might begin to ‘run into’ each other and cause your business pipelines to misbehave. Or perhaps you´re under SLA to deliver results for a batch of information within a given time constraint, and with the batch size slowly increasing in production, you´re approaching the maximum allotted time.

Featuretools on Spark

Apache Spark is one of the most popular technologies on the big data landscape. As a framework for distributed computing, it allows users to scale to massive datasets by running computations in parallel either on a single machine or on clusters of thousands of machines. Spark can be used with Scala, R, Java, SQL, or Python code and its capabilities have led to a rapid adoption as the size of datasets – and the need for methods to work with them – increase.

Time Series with Matrix Profile

If you are here, you are likely aware of what a Distance Matrix (DM) is. If not, think about those tables that used to be on maps with the distance between cities. It is widely used in TS for clustering, classification, motif search, etc. However, even for modestly sized datasets, the algorithms can take months to compute it and even with speed-up techniques (i.e., indexing, lower-bounding, early abandoning) they can be, at best, one or two orders of magnitude faster. Matrix Profile it´s like a DM but faster (much faster) to compute. Figure 1 shows a DM and a Matrix Profile. As you can see, in the Matrix Profile, as the name suggests, you see the Profile of a DM. It stores the minimum Euclidean distance of every subset of one TS (think of a Sliding Window) with another (or itself, called Self-Join). It also stores a companion vector called Profile Index, that gives us the index of each nearest neighbor.

Manifold Learning: The Theory Behind It

Manifold Learning has become an exciting application of geometry and in particular differential geometry to machine learning. However, I feel that there is a lot of theory behind the algorithm that is left out, and understanding it will benefit in applying the algorithms more effectively.

A comprehensive Machine Learning workflow with multiple modelling using caret and caretEnsemble in…

A comprehensive Machine Learning workflow with multiple modelling using caret and caretEnsemble in R

Data Preparation and Preprocessing is just as important creating the actual Model in Data Sciences

This article focuses on the data understanding and pre-processing using a few steps viz. Preliminary feature selection on the basis of correlation matrix, outlier treatment and transformation on target variable. In the corresponding article, I will also talk about steps like missing value treatment and feature engineering.

Performance Exploding! Introduction to TigerGraph Developer Edition

Graph database is a database type that grows with the popularity of Big Data. After years of focusing on business market, TigerGraph publish its free developer edition. TigerGraph itself claims to be Graph Database 3.0, with capability to handle huge amount of data, supporting realtime updates and queries, and it also defines a new language called GSQL. This is no doubt a great news to graph database fans.

Neural Networks: an Alternative to ReLU

Rectified Linear Units still have several advantages. They are easy to compute, ideal for specialized hardware architectures like Google’s TPU. They are non-linear, creating sharp boundaries that a single straight line cannot do. And, they create a durable gradient (on their positive side, at least), which differentiates them from activation functions which taper – sigmoid and logistic functions for example. At those activations’ taper, the slope is nearly horizontal, so gradient descent makes little attempt to modify the neuron, and learning progresses slowly. ReLUs’ right half has constant, positive slope, so it receives the same strength of signal from gradient descent at all those points.

Thinking Fast & Slow – A Machine Learning Practitioner´s Perspective

I just finished reading ‘Thinking Fast And Slow’ by research psychologist and Nobel prize winner Daniel Kahneman, or should I say listening to the audio version of it (I am lazy). I highly recommend it, but if you don’t want to read it, you are in luck, I am going to talk about what I liked most!

Microsoft’s AI-powered Sketch2Code builds websites and apps from drawings

Microsoft has developed an AI-powered web design tool capable of turning sketches of websites into functional HTML code. Called Sketch2Code, Microsoft AI’s senior product manager Tara Shankar Jana explained that the tool aims to ’empower every developer and every organisation to do more with AI’. It was born out of the ‘intrinsic’ problem of sending a picture of a wireframe or app designs from whiteboard or paper to a designer to create HTML prototypes. To break this process Microsoft developed a web-based application which cuts out the extra human element (in this case the designer). Instead, images taken of sketches are sent to AI servers based on Microsoft’s Azure cloud infrastructure.

What to do with the data? The evolution of data platforms in a post big data world

Note: I’ve had the eminent thought leader Esteban Kolsky, founder and managing principal of ThinkJar, doing guest posts before on this blog. Time and again, the guy simply nails what the core of contemporary thinking is and how to approach it. This time, he goes to the heart of how the business world is evolving and what it takes to have a transformative success – and that means ecosystems and platforms. This post is the first of two that he will have here. (Part two comes next week.) The idea for these posts grew out of research that Esteban just finished for Radius, a company that characterizes itself as providing Customer Data Platforms (CDP) for B2B revenue teams. This research inspired more than simply a post with market data; this is significant thinking on where data platforms are going in a world that has solved (more or less) big data. So, Esteban, start the ball rolling…

Introduction to the RStudio Package Manager

R’s package ecosystem is awesome, with more than 13K packages on CRAN and countless more on GitHub, Bioconductor, and hundreds of updates or additions each day. But, this ecosystem can be intimidating for organizations relying on R. Data scientists face challenges discovering packages, sharing code, and reproducing work. IT organizations struggle to manage change control while keeping users secure and compliant with InfoSec requirements. This webinar will introduce RStudio Package Manager, a new product to organize R packages across teams, departments, and organizations.