An Introduction to Clustering and different methods of clustering

Have you come across a situation when a Chief Marketing Officer of a company tells you – “Help me understand our customers better so that we can market our products to them in a better manner!” I did and the analyst in me was completely clueless what to do! I was used to getting specific problems, where there is an outcome to be predicted for various set of conditions. But I had no clue, what to do in this case. If the person would have asked me to calculate Life Time Value (LTV) or propensity of Cross-sell, I wouldn’t have blinked. But this question looked very broad to me! This is usually the first reaction when you come across a unsupervised learning problem for the first time! You are not looking for specific insights for a phenomena, but what you are looking for are structures with in data with out them being tied down to a specific outcome. The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups. In this article, I will be taking you through the types of clustering, different clustering algorithms and a comparison between two of the most commonly used cluster methods.

Logistic Regression Regularized with Optimization

In the first part of this exercise, we will build a logistic regression model to predict whether a student gets admitted into a university. Suppose that you are the administrator of a university department and you want to determine each applicant’s chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set for logistic regression. For each training example, you have the applicant’s scores on two exams and the admissions decision. Our task is to build a classification model that estimates an applicant’s probability of admission based the scores from those two exams.

Beginner’s Guide to Customer Segmentation

In this post I’m going to talk about something that’s relatively simple but fundamental to just about any business: Customer Segmentation. At the core of customer segmentation is being able to identify different types of customers and then figure out ways to find more of those individuals so you can… you guessed it, get more customers! In this post, I’ll detail how you can use K-Means clustering to help with some of the exploratory aspects of customer segmentation. I’ll be walking through the example using Yhat’s own Python IDE, Rodeo, which you can download for Windows, Mac or Linux here. If you’re using a Windows machine, Rodeo ships with Python (via Continuum’s Miniconda). How convenient!

Prophet: forecasting at scale

Today Facebook is open sourcing Prophet, a forecasting tool available in Python and R. Forecasting is a data science task that is central to many activities within an organization. For instance, large organizations like Facebook must engage in capacity planning to efficiently allocate scarce resources and goal setting in order to measure performance relative to a baseline. Producing high quality forecasts is not an easy problem for either machines or for most analysts. We have observed two main themes in the practice of creating a variety of business forecasts:
• Completely automatic forecasting techniques can be brittle and they are often too inflexible to incorporate useful assumptions or heuristics.
• Analysts who can produce high quality forecasts are quite rare because forecasting is a specialized data science skill requiring substantial experience.

BlueData Brings DevOps Agility to Data Science Operations with Spark, R, and Python

BlueData, provider of a leading Big-Data-as-a-Service (BDaaS) software platform, announced the new winter release for the BlueData EPIC software platform. This new release delivers several new enhancements for data science operations, bringing DevOps agility and collaboration to data science teams as well as support for new machine learning use cases. More organizations are now building data science teams and the role of the data scientist is the #1 job in the U.S. for the second year in a row. Data scientists are highly skilled at developing advanced analytical models and prototypes; their data-driven innovations can be game-changing. But the siloed efforts and custom-crafted prototypes of individual data scientists can be difficult to scale, reproduce, and share across multiple users. What works for an ad-hoc model in development may not necessarily work in production; what works as a one-off prototype on a laptop might not work as a consistent and repeatable process in a distributed computing environment.

Moving from R to Python: The Libraries You Need to Know


The Anatomy of Deep Learning Frameworks

Deep Learning, whether you like it or not is here to stay, and with any tech gold-rush comes a plethora of options that can seem daunting to newcomers. If you were to start off with deep learning, one of the first questions to ask is, which framework to learn? I’d say instead of a simple trial-and-error, if you try to understand the building blocks of all these frameworks, it would help you make an informed decision. Common choices include Theano, TensorFlow, Torch, and Keras. All of these choices have their own pros and cons and have their own way of doing things. After exploring the white papers and the dev docs, I could understand the design choices and was able to abstract the fundamental concepts that are common to all of these. In this post, I have tried to sketch out these common principles which would help you better understand the frameworks and for the brave hearts among you, provide a guide on how to implement your own deep learning framework. The interesting thing about these principles is that they are not specific to DL alone, they are applicable whenever you want to do a series of computations on data. Hence, most DL frameworks can be used for non-DL tasks as well.

Why Deep Learning Works 3: BackProp minimizes the Free Energy

Deep Learning is presented as Energy-Based Learning Indeed, we train a neural network by running BackProp, thereby minimizing the model error-which is like minimizing an Energy.

Reinforcement Learning in R

Reinforcement learning has gained considerable traction as it mines real experiences with the help of trial-and-error learning to model decision-making. Thus, this approach attempts to imitate the fundamental method used by humans of learning optimal behavior without the requirement of an explicit model of the environment. In contrast to many other approaches from the domain of machine learning, reinforcement learning works well with learning tasks of arbitrary length and can be used to learn complex strategies for many scenarios, such as robotics and game playing.

Iteration and closures in R

I recently read an interesting thread on unexpected behavior in R when creating a list of functions in a loop or iteration. The issue is solved, but I am going to take the liberty to try and re-state and slow down the discussion of the problem (and fix) for clarity. The issue is: are references or values captured during iteration? Many users expect values to be captured. Most programming language implementations capture variables or references (leading to strange aliasing issues). It is confusing (especially in R, which pushes so far in the direction of value oriented semantics) and best demonstrated with concrete examples.