Your Load Generator is Probably Lying to You – Take the Red Pill and Find Out Why

Pretty much all your load generation and monitoring tools do not work correctly. Those charts you thought were full of relevant information about how your system is performing are really just telling you a lie. Your sensory inputs are being jammed.

Introduction to Logistic Regression with R

In my previous blog I have explained about linear regression. In today’s post I will explain about logistic regression. Consider a scenario where we need to predict a medical condition of a patient (HBP) ,HAVE HIGH BP or NO HIGH BP, based on some observed symptoms – Age, weight, Issmoking, Systolic value, Diastolic value, RACE, etc.. In this scenario we have to build a model which takes the above mentioned symptoms as input values and HBP as response variable. Note that the response variable (HBP) is a value among a fixed set of classes, HAVE HIGH BP or NO HIGH BP.

Getting started with Pandas

We have made use of Python’s Pandas package in a variety of posts on the site. These have showcased some of Pandas’ abilities including the following:
• DataFrames for data manipulation with built in indexing
• Handling of missing data
• Data alignment
• Melting/stacking and Pivoting/unstacking data sets
• Groupby feature allowing split -> apply -> combine operations on data sets
• Data merging and joining

Bayesian Survival Analysis in Python with pymc3

Survival analysis studies the distribution of the time to an event. Its applications span many fields across medicine, biology, engineering, and social science. This post shows how to fit and analyze a Bayesian survival model in Python using pymc3.

Understanding Support Vector Machine algorithm from examples (along with code)

Mastering machine learning algorithms isn’t a myth at all. Most of the beginners start by learning regression. It is simple to learn and use, but does that solve our purpose? Of course not! Because, you can do so much more than just Regression! Think of machine learning algorithms as an armory packed with axes, sword, blades, bow, dagger etc. You have various tools, but you ought to learn to use them at the right time. As an analogy, think of ‘Regression’ as a sword capable of slicing and dicing data efficiently, but incapable of dealing with highly complex data. On the contrary, ‘Support Vector Machines’ is like a sharp knife – it works on smaller datasets, but on them, it can be much more stronger and powerful in building models. By now, I hope you’ve now mastered Random Forest, Naive Bayes Algorithm and Ensemble Modeling. If not, I’d suggest you to take out few minutes and read about them as well. In this article, I shall guide you through the basics to advanced knowledge of a crucial machine learning algorithm, support vector machines.

September 2015: Scripts of the Week

Our top scripts from September give you: fork-friendly code for exploring large datasets, tips for quickly using pandas to answer questions about your data, and an intro to bag-of-words in R. Plus, one Kaggler digs deeper into gender stereotypes in the medical field and finds a surprising conclusion.

Integrating Python and R into a Data Analysis Pipeline – Part 1

For a conference in the R language, EARL London 2015 saw a surprising number of discussions about Python. I like to think that at least some of this was to do with the fact that the day before the conference, we ran a 3-hour workshop outlining various strategies for integrating Python and R. This is the first in a series of three blog posts that:
• outline the basic strategy for integrating Python and R;
• run through the different steps involved in this process; and
• give a real example of how and why you would want to do this.
This post kicks everything off by:
• covering the reasons why you may want to include both languages in a pipeline;
• introducing ways of running R and Python from the command line; and
• showing how you can accept inputs as arguments and write outputs to various file formats.

Rapid Development & Performance in Spark For Data Scientists

Spark is a cluster computing framework that can significantly increase the efficiency and capabilities of a data scientist’s workflow when dealing with distributed data. However, deciding which of its many modules, features and options are appropriate for a given problem can be cumbersome. Our experience at Stitch Fix has shown that these decisions can have a large impact on development time and performance. This post will discuss strategies at each stage of the data processing workflow which data scientists new to Spark should consider employing for high productivity development on big data.

5 Tools Everyone in the Big Data Analytics Industry Should Be Using

• Tableau Desktop and Serve
• Splunk
• Pentaho BA
• Karmasphere
• Skytree Server

Investigating Data Scientists, their Skills and Team Makeup

A new survey of 490 data professionals from small to large companies, conducted by AnalyticsWeek in partnership with Business Over Broadway, provides a look into the field of data science.

Time Series IoT applications in Railroads

In this article, we have highlighted some use cases of time series data for Railroads. There are many more factors that could be considered especially in the use of Technology for implementing these Time series algorithms. In subsequent sections, we will show how some of these use cases could be implemented based on the R programming language.

Who are alike? Use BigObject feature vector to find similarities

Cluster Analysis is a common technique to group a set of objects in the way that the objects in the same group share certain attributes. It’s commonly used in marketing and sales planning to define market segmentations. Here at BigObject we adopt a simple approach to exploring the similarities between objects. We simply calculate the “Feature Vector” based on given attributes and use the score to determine which objects are “alike.” This is a simple example to show how to use BigObject to extract product features and then find similar products in your retail data. We use the default sample data in the BigObject docker image to demonstrate the task. You may run the docker image on your own computer or play around in our sandbox.