The Data Calculator: Data Structure Design and Cost Synthesis From First Principles, and Learned Cost Models

Data structures are critical in any data-driven scenario, but they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously. We present a design engine, the Data Calculator, which enables interactive and semi-automated design of data structures. It brings two innovations. First, it offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. This allows for a structured description of the universe of possible data structure designs that can be synthesized as combinations of those primitives. The second innovation is computation of performance using learned cost models. These models are trained on diverse hardware and data profiles and capture the cost properties of fundamental data access primitives (e.g., random access). With these models, we synthesize the performance cost of complex operations on arbitrary data structure designs without having to: 1) implement the data structure, 2) run the workload, or even 3) access the target hardware. We demonstrate that the Data Calculator can assist data structure designers and researchers by accurately answering rich what-if design questions on the order of a few seconds or minutes, i.e., computing how the performance (response time) of a given data structure design is impacted by variations in the: 1) design, 2) hardware, 3) data, and 4) query workloads. This makes it effortless to test numerous designs and ideas before embarking on lengthy implementation, deployment, and hardware acquisition steps. We also demonstrate that the Data Calculator can synthesize entirely new designs, auto-complete partial designs, and detect suboptimal design choices.

Churn prediction

Customer churn, also known as customer attrition, occurs when customers stop doing business with a company. The companies are interested in identifying segments of these customers because the price for acquiring a new customer is usually higher than retaining the old one. For example, if Netflix knew a segment of customers who were at risk of churning they could proactively engage them with special offers instead of simply losing them.

Graph Databases. What’s the Big Deal?

Continuing the analysis on semantics and data science, it’s time to talk about graph databases and what they have to offer us.

Confluo: Millisecond-level Queries on Large-scale Live Data

Confluo is a system for real-time distributed analysis of multiple data streams. Confluo simultaneously supports high throughput concurrent writes, online queries at millisecond timescales, and CPU-efficient ad-hoc queries via a combination of data structures carefully designed for the specialized case of multiple data streams, and an end-to-end optimized system design.

10 Tips for Choosing the Optimal Number of Clusters

Clustering is one of the most common unsupervised machine learning problems. Similarity between observations is defined using some inter-observation distance measures or correlation-based distance measures.

Check-mark State Recognition will take NLP projects to the next level!

Check-mark state detection from scanned images and further analysis on them can add meaningful features to NLP projects.

Building an interactive computer vision demo in a few hours on AWS DeepLens

Building an interactive computer vision demo in a few hours on AWS DeepLens and Slack. A step-by-step guide to build a demo that can help you explain what you do to kids, impress clients, or automate actions by detecting people.

An Introduction to the Bootstrap Method

Bootstrap is a powerful, computer-based method for statistical inference without relying on too many assumption. The first time I applied the bootstrap method was in an A/B test project. At that time I was like using an powerful magic to form a sampling distribution just from only one sample data. No formula needed for my statistical inference. Not only that, in fact, it is widely applied in other statistical inference such as confidence interval, regression model, even the field of machine learning. That’s lead me go through some studies about bootstrap to supplement the statistical inference knowledge with more practical other than my theory mathematical statistics classes.

Practical tips for class imbalance in binary classification

Binary classification problem is arguably one of the simplest and most straightforward problems in Machine Learning. Usually we want to learn a model trying to predict whether some instance belongs to a class or not. It has many practical applications ranging from email spam detection to medical testing (determine if a patient has a certain disease or not).

Use-cases of Google’s Universal Sentence Encoder in Production

Before building any Deep Learning model in Natural Language Processing (NLP), test embedding plays a major role. The text embedding converts text (words or sentences) into a numerical vector.

Understanding Markov Decision Processes

At a high level intuition, a Markov Decision Process(MDP) is a type of machine learning, reinforcement learning to be specific. The model allows machines and agents to determine the ideal behavior within a specific environment, in order to maximize the model’s ability to achieve a certain goal/destination/or state of being. This goal is determined by us, and MDP seeks to optimize the solution to achieve this goal. This optimization is done with a reward feedback system, where given the desired output, a score is passed to the agent. This is known as the reinforcement signal.

Quick Hit: Automating Production Graphics Uploads in R Markdown Documents with googledrive

As someone who measures all kinds of things on the internet as part of his $DAYJOB, I can say with some authority that huge swaths of organizations are using cloud-services such as Google Apps, Dropbox and Office 365 as part of their business process workflows. For me, one regular component that touches the ‘cloud’ is when I have to share R-generated charts with our spiffy production team for use in reports, presentations and other general communications. These are typically project-based tasks and data science team members typically use git- and AWS-based workflows for gathering data, performing analyses and generating output. While git is great at sharing code and ensuring the historical integrity of our analyses, we don’t expect the production team members to be or become experts in git to use our output. They live in Google Drive and thanks to the googledrive?? package we can bridge the gap between code and output with just a few lines of R code.

Understanding Entity Embeddings and It’s Application

As of late I’ve been reading a lot on entity embeddings after being tasked to work on a forecasting problem. The task at hand was to predict the salary of a given job title, given the historical job ads data that we have in our data warehouse. Naturally, I just had to seek out how this can be solved using deep learning?-?since it’s a lot more sexier nowadays to do stuff in deep learning instead of plain ‘ol linear regression (if you’re reading this, and since only a data scientist would ever be reading this post, I’m sure you’d understand 🙂 ). And what better way to go about learning deep learning than to look for examples from Lo and behold?-?they actually do have an example of that.

Deep learning from a programmer’s perspective (aka Differentiable Programming)

Any program can be expressed as a sequence of elementary operations, interleaved with basic control flow and loops. As any experienced programmer knows, writing a function is much simpler that ensuring that the function is correct. This is why in modern programming practice, any function should come with one or more unit tests, that evaluate its correct behavior on a series of use cases: https://…/href In machine learning terminology, this set of use cases is called the test set. An intriguing property of unit tests is that they are generally simpler both to design and to implement than the function they are testing (which is why an entire programming phylosophy is built on top of them).

Time Series of Price Anomaly Detection

Also known as outlier detection, anomaly detection is a data mining process used to determine types of anomalies found in a data set and to determine details about their occurrences. Automatic anomaly detection is critical in today’s world where the sheer volume of data makes it impossible to tag outliers manually. It has a wide range of applications such as fraud detection, system health monitoring, fault detection, and event detection systems in sensor networks, and so on. But I would like to apply anomaly detection to hotel room prices. The reason is somewhat selfish.