Understanding and coding Neural Networks From Scratch in Python and R

In this article, I will discuss the building block of a neural network from scratch and focus more on developing this intuition to apply Neural networks. We will code in both “Python” and “R”. By end of this article, you will understand how Neural networks work, how do we initialize weigths and how do we update them using back-propagation.

Managing Spark data handles in R

When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list or data_frame.

Data Science for Business – Time Series Forecasting Part 1: EDA & Data Preparation

Data Science is a fairly broad term and encompasses a wide range of techniques from data visualization to statistics and machine learning models. But the techniques are only tools in a – sometimes very messy – toolbox. And while it is important to know and understand these tools, here, I want to go at it from a different angle: What is the task at hand that data science tools can help tackle, and what question do we want to have answered? A straight-forward business problem is to estimate future sales and future income. Based on past experience, i.e. data from past sales, data science can help improve forecasts and generate models that describe the main factors of influence. This, in turn, can then be used to develop actions based on what we have learned, like where to increase advertisement, how much of which products to keep in stock, etc.

5 ways to measure running time of R code

A reviewer asked me to report detailed running times for all (so many :scream:) performed computations in one of my papers, and so I spent a Saturday morning figuring out my favorite way to benchmark R code. This is a quick summary of the options I found to be available. A quick online search revealed at least three R packages for benchmarking R code (rbenchmark, microbenchmark, and tictoc). Additionally, base R provides at least two methods to measure the running time of R code (Sys.time and system.time). In the following I briefly go through the syntax of using each of the five option, and present my conclusions at the end.

Free Data Science Resources for Beginners

In this guide, we’ll share 65 free data science resources that we’ve hand-picked and annotated for beginners. To become data scientist, you have a formidable challenge ahead. You’ll need to master a variety of skills, ranging from machine learning to business analytics. However, the rewards are worth it. Organizations will prize alchemists who can turn raw data into smarter decisions, better products, happier customers, and ultimately more profit. Plus, you’ll get to solve interesting problems and master new, impactful technologies. If that sounds like a career you’d enjoy, then bookmark this page and read on because we compiled this list just for you.

How Bayesian inference works

Bayesian inference is a way to get sharper predictions from your data. It’s particularly useful when you don’t have as much data as you would like and want to juice every last bit of predictive strength from it. Although it is sometimes described with reverence, Bayesian inference isn’t magic or mystical. And even though the math under the hood can get dense, the concepts behind it are completely accessible. In brief, Bayesian inference lets you draw stronger conclusions from your data by folding in what you already know about the answer. Bayesian inference is based on the ideas of Thomas Bayes, a nonconformist Presbyterian minister in London about 300 years ago. He wrote two books, one on theology, and one on probability. His work included his now famous Bayes Theorem in raw form, which has since been applied to the problem of inference, the technical term for educated guessing. The popularity of Bayes’ ideas was aided immeasurably by another minister, Richard Price. He saw their significance, refined them and published them. It would be more accurate and historically just to call Bayes’ Theorem the Bayes-Price Rule.

Fast clustering algorithms for massive datasets

Here we discuss two potential algorithms that can perform clustering extremely fast, on big data sets, as well as the graphical representation of such complex clustering structures. By extremely fast, we mean a computational complexity of order O(n) and even faster such as O(n/log n). This is much faster than good Hierarchical Agglomerative Clustering which are typically O(n^2 log n). By big data, we mean several millions, possibly a billion observations.

Principal Component Analysis using R

One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality. In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.

Website Crawler & Sentiment Analysis

To start with Sentiment Analysis, what comes first to our mind is where and how we can crawl oceans of data for our analysis. Normally, web crawler or crawling from web social media should be one reasonable way to get access to the public opinion data resource. Thus, in this writing, I want to share with you about how I crawled the website using web crawler and proceeded to deal with those data for Sentiment Analysis to develop an application which ranks universities based on users’s opinions crawled from social media website – Twitter.

Machine Learning Workflows in Python from Scratch Part 1: Data Preparation

This post is the first in a series of tutorials for implementing machine learning workflows in Python from scratch, covering the coding of algorithms and related tools from the ground up. The end result will be a handcrafted ML toolkit. This post starts things off with data preparation.

Automate your Machine Learning in Python – TPOT and Genetic Algorithms

Automatic Machine Learning (AML) is a pipeline, which enables you to automate the repetitive steps in your Machine Learning (ML) problems and so save time to focus on parts where your expertise has higher value. What is great is that it is not only some vague idea, but there are applied packages, which build on standard python ML packages such as scikit-learn. Anyone familiar with Machine Learning will in this context most probably recall the term grid search. And they will be entirely right to do so. AML is in fact an extension of grid search, as applied in scikit-learn, however instead of iterating over a predefined set of values and their combinations it searches for optimal solutions across methods, features, transformations and parameter values. AML “grid search” therefore does not have to be an exhaustive search over the space of possible configurations – one great application of AML is package called TPOT, which offers applications of e.g. genetic algorithms to mix the individual parameters within a configuration and arrive at the optimal setting. In this post I will shortly present some basics of AML and then dive into applications using TPOT package including its genetic algorithm solution optimization.

Challenges in Machine Learning for Trust

With an explosive growth in the number of transactions, detecting fraud cannot be done manually and Machine Learning-based methods are required. We examine what are the main challenges for using Machine Learning for Trust.


A platform that helps you build, manage and monitor deep learning models