Logistic Regression in One Picture

Logistic regression is regressing data to a line (i.e. finding an average of sorts) so you can fit data to a particular equation and make predictions for your data. This type of regression is a good choice when modeling binary variables, which happen frequently in real life (e.g. work or don’t work, marry or don’t marry, buy a house or rent…). The logistic regression model is popular, in part, because it gives probabilities between 0 and 1. Let’s say you were modeling a risk of credit default: values closer to 0 indicate a tiny risk, while values closer to 1 mean a very high risk. The following image shows an example of how one might tailor a logistic model for credit score based risk.


K-nn Clustering Explained in One Picture

This is a simple overview of the k-NN process. Perhaps the most challenging step is finding a k that’s ‘just right’. The square root of n can put you in the ballpark, but ideally you should use a training set (i.e. a nicely categorized set) to find a ‘k’ that works for your data. Remove a few categorized data points and make them ‘unknowns’, testing a few values for k to see what works.


GPU Databases On the Rise at NVIDIA #GTC19

While at the NVIDIA GPU Technology Conference (GTC) this week in Silicon Valley, I noted how well GPU databases are resonating with the marketplace. The vendors in this space were highly regarded judging by the crowds in the technical sessions presented by these companies, and also the general buzz heard on the conference floor. Interest in GPU databases is certainly on the rise.


How to Speed Up Gradient Boosting by a Factor of Two

At STATWORX, we are not only helping customers find, develop, and implement a suitable data strategy but we also spend some time doing research to improve our own tool stack. This way, we can give back to the open-source community. One special focus of my research lies in tree-based ensemble methods. These algorithms are bread and butter tools in predictive learning and you can find them as a standalone model or as an ingredient to an ensemble in almost every supervised learning Kaggle challenge. Renowned models are Random Forest by Leo Breiman, Extremely Randomized Trees (Extra-Trees) by Pierre Geurts, Damien Ernst & Louis Wehenkel, and Multiple Additive Regression Trees (MART; also known as Gradient Boosted Trees) by Jerome Friedman.


Learning Theory: (Agnostic) Probably Approximately Correct Learning

In my previous article, I discussed what is Empirical Risk Minimization and the proof that it yields a satisfactory hypothesis under certain assumptions. Now I want to discuss Probably Approximately Correct Learning (which is quite a mouthful but kinda cool), which is a generalization of ERM. For those who are not familiar with ERM, I suggest reading my previous article on the topic since it is a prerequisite for understanding PAC learning.


Statistical Overview of Linear Regression (Examples in Python)

In statistics we are often looking for ways to quantify relationships between factors and responses in real life. That being said, we can largely divide the responses we want to understand into two types: categorical responses and continuous responses. For the categorical case, we are looking to see how certain factors influence the determination of which type of response we have out of a set of options. For example, consider a data set concerning brain tumors. In this case, the factors would involve the size of the tumor and the age of the patient. On the other hand, the response variable would have to be whether the tumor is benign or malignant. These types of problems are usually called classification problems and can indeed be handled by a special type of regression called logistic regression. For the continuous case however, we are looking to see how much our factors influence a measurable change in our response variable. In our particular example we will be looking at the widely used Boston Housing data set which could be found in the scikit-learn library While this data set is commonly used for predictive multiple regression models, we are going to focus on a statistical treatment to come to understand how features like an increase in room size can come to affect housing value.


Implementing the XOR Gate using Backpropagation in Neural Networks

Implementing logic gates using neural networks help understand the mathematical computation by which a neural network processes its inputs to arrive at a certain output. This neural network will deal with the XOR logic problem. An XOR (exclusive OR gate) is a digital logic gate that gives a true output only when both its inputs differ from each other.


Programming Best Practices For Data Science

Data scientists and chief data officers are the hot hire these days, and government agencies at all levels are working to get more out of their rapidly growing troves of data. Determining how to approach all that data, however, can be daunting. ‘To solve business problems, develop new products and services and optimize processes,’ TDWI’s Dave Stodder wrote, ‘organizations increasingly need analytics insights produced by data science teams with a diverse set of technical skills and business knowledge who are also good communicators.’ To make the most of such investments, the report recommends:


Programming Best Practices For Data Science

The ethical considerations are also a very important area which should be reported on. It is important to critically think about this area and assess the impact of the system’s capabilities. Ethical considerations need to be a part of the product design and planning process. Data science is an emerging discipline. It will most likely evolve over time. Follow interesting problems, people and technologies into the future of what data science will become. The data science life cycle is generally comprised of the following components:
• data retrieval
• data cleaning
• data exploration and visualization
• statistical or predictive modeling
While these components are helpful for understanding the different phases, they don’t help us think about our programming workflow.
Often, the entire data science life cycle ends up as an arbitrary mess of notebook cells in either a Jupyter Notebook or a single messy script. In addition, most data science problems require us to switch between data retrieval, data cleaning, data exploration, data visualization, and statistical/predictive modeling.


Data Scientist’s Guide to Summarization

Did you ever face a situation where you had to scroll through a 400 word article only to realize that there are only 4 key points in the article? All of us have been there. In this age of information, with content getting generated every second around the world, it has become quite difficult to extract the most important information in an optimal amount of time. Unfortunately, we have only 24 hours in a day and that is not going to change. In the recent years, advancements in Machine learning and Deep learning techniques paved way to the evolution of text summarization which might solve this problem of summarization for us. In this article, we will give an overview about text summarization, different techniques of text summarization, various resources that are available, our results and challenges that we faced during the implementation of these techniques.


Using R and H2O to identify product anomalies during the manufacturing process.

We will identify anomalous products on the production line by using measurements from testing stations and deep learning models. Anomalous products are not failures, these anomalies are products close to the measurement limits, so we can display warnings before the process starts to make failed products and in this way the stations get maintenance.


How cdata Control Table Data Transforms Work

With all of the excitement surrounding cdata style control table based data transforms (the cdata ideas being named as the ‘replacements’ for tidyr’s current methodology, by the tidyr authors themselves!) I thought I would take a moment to describe how they work.


Breaking down Mean Average Precision (mAP)

If you have come across the PASCAL Visual Object Classes (VOC) and MS Common Objects in Context (COCO) challenge, or dabbled with projects involving information retrieval and re-identification (ReID), you might then be quite familiar with a metric called mAP.


Will Scientific Research be able to avoid Artificial Intelligence pitfalls?

It’s now obvious that AI, Machine Learning and Deep Learning are no longer buzzwords as they’re getting more and more present in every industry. Notwithstanding the trend has been overhyped in 2017, we are now certain that these technologies will be ubiquitous by 2020. Scientific research has not been left behind and AI has been the trigger of a significant change in this domain. Machine Learning techniques such as medical image processing, biological features identification by segmentation and AI-based medical diagnosis are officially adopted in by health researchers with an uncritical approach.


Deep Learning Literature with Kaggle and Google Cloud Platform

I first discovered Kaggle around 4 years ago when I was first starting out on my journey into the world of data. Their ‘Titanic’ dataset and associated kernels helped me to get up to speed on a lot of modern data science tools and techniques, and I’ve been hooked ever since. The platform has several parts, including machine-learning competitions, datasets, kernels, discussion forums and courses. The dataset section is my personal favourite, as it offers a vast playground to explore open data from any field you can imagine. You can then share your work via kernels (notebooks), which are a mix of text, code and code outputs.