Probability Distributions

This article contains the definitions of probability distributions for reference. Where possible, the definitions are consistent with R.

Cheatsheet – Python & R codes for common Machine Learning Algorithms

In his famous book – Think and Grow Rich, Napolean Hill narrates story of Darby, who after digging for a gold vein for a few years walks away from it when he was three feet away from it! Now, I don’t know whether the story is true or false. But, I surely know of a few Data Darby around me. These people understand the purpose of machine learning, its execution and use just a set 2 – 3 algorithms on whatever problem they are working on. They don’t update themselves with better algorithms or techniques, because they are too tough or they are time consuming. Like Darby, they are surely missing from a lot of action after reaching this close! In the end, they give up on machine learning by saying it is very computation heavy or it is very difficult or I can’t improve my models above a threshold – what’s the point? Have you heard them? Today’s cheat sheet aims to change a few Data Darby’s to machine learning advocates. Here’s a collection of 10 most commonly used machine learning algorithms with their codes in Python and R. Considering the rising usage of machine learning in building models, this cheat sheet is good to act as a code guide to help you bring these machine learning algorithms to use. Good Luck!

Big Data Top Trends in 2015

• Simple Interfaces for Non-Data Scientists
• Easy Availability of Sensor Driven Data
• Valuable Customer Insights in the Offing
• Cloud Impacts Big Data Profits Positively
• NoSQL or SQL?
• Faster in-Memory Databases
• Recognition of the Positives of HR Analytics

Java Machine Learning Tools & Libraries

This is a list of 25 Java Machine learning tools & libraries.

Python: Learn Data Analysis by analyzing weed prices

Analysing Weed Pricing across US – Data Analysis Workshop

Sharks, Landsharks, Geoplotting, and KDTrees

It’s been somewhat of a sharky summer these past 3 months: there were a smattering of attacks up and down the east coast of the US, professional surfer Mick Fanning, had a close call at a competition in South Africa, and several beaches were closed in California after a Great White took a bite out of a surfboard! So now with the end of summer officially here (at least in the northern hemisphere), we thought it would be interesting to dig into some shark attack data. In this post, we’ll look through the Global Shark Attack File, checkout some of the characteristics of shark attacks and then dive in to some geo-plotting with Matplotlib Basemap.

Deeplearning4j on Spark

Given that deep learning is computationally intensive, if you’re working with large datasets, you should think about how to train deep neural networks in parallel. With Spark standalone, Deeplearning4j can run multi-threaded on your local machine; i.e. you don’t need a cluster or the cloud.

Introduction to Spark

After lots of ground-breaking work led by the UC Berkeley AMP Lab, Spark was developed to utilize distributed, in-memory data structures to improve speeds by orders of magnitude for many data processing workloads. There are wonderful resources online if you are interested in learning more about why Spark is a crossover hit for data scientists or read some of the original papers on the Apache Spark homepage.

Data Science for Internet of Things – practitioner course

Created by Data Science and IoT professionals, the course covers infrastructure (Hadoop – Spark), Programming / Modelling(R/Time series) and ioT. Course starts Nov 2015, delivered online, and will have limited participants.

C5.0 Class Probability Shrinkage

I was recently asked to explain a potential disconnect in C5.0 between the class probabilities shown in the terminal nodes and the values generated by the prediction code. Here is an example using the iris data:

Fully customizable legends for rleafmap

This is a functionality I wanted to add for some time… and finally it’s here! I just pushed on GitHub a new version of rleafmap which brings the possibility to attach legends to data layers. You simply need to create a legend object with the function layerLegend and then to pass this object when you create your data layer via the legend argument. Thus, a map can contain different legends, each of them being independent. This is cool because it means that when you mask a data layer with the layer control, the legend will also disappear. You can create legends with five different styles to better suit your data: points, lines, polygons, icons and gradient.

How to perform a Logistic Regression in R

Logistic regression is a method for fitting a regression curve, y = f(x), when y is a categorical variable. The typical use of this model is predicting y given a set of predictors x. The predictors can be continuous, categorical or a mix of both. The categorical variable y, in general, can assume different values. In the simplest case scenario y is binary meaning that it can assume either the value 1 or 0. A classical example used in machine learning is email classification: given a set of attributes for each email such as number of words, links and pictures, the algorithm should decide whether the email is spam (1) or not (0). In this post we call the model “binomial logistic regression”, since the variable to predict is binary, however, logistic regression can also be used to predict a dependent variable which can assume more than 2 values. In this second case we call the model “multinomial logistic regression”. A typical example for instance, would be classifying films between “Entertaining”, “borderline” or “boring”.

How do you know if your model is going to work? Part 3: Out of sample procedures

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 3 of our four part mini-series “How do you know if your model is going to work?” we develop out of sample procedures.

Building Packages in R – Part 1: The Skeleton

Last week we took some time to set up the prerequisites of building R packages, so if you haven’t done so already, feel free to take a look at Part 0: Setting Up R. This week we will look at the basic structure of R-Packages – its skeleton if you will.

urlshorteneR: A package for shortening URLs

This is a small package I put together quickly to satisfy an immediate need: generating abbreviated URLs in R. As it happens I require this functionality in a couple of projects, so it made sense to have a package to handle the details. It’s not perfect but it does the job. The code is available from github along with vague usage information.