The data universe is expanding rapidly – it’s time we started recognizing just how big this field is and that working in one part of it doesn’t automatically require us to be experts of all of it. Instead of expecting data people to be able to do all of it, let’s start asking one another, ‘Which kind are you?’ Most importantly, it’s time we asked ourselves that same question.
Various texts on the Poisson process explain how the Poisson distribution is the limiting case of the Binomial distribution i.e. as n ? 8, the Binomial distribution’s PMF morphs into the Poisson distribution’s PMF. At least that is how the math works. But the PMF is more than just math. It can be used to model real events happening in our lives. If I am a restaurant owner, I can use the Poisson PMF to determine how many booths to install and how many chefs and waiters to hire. I can use the Poisson PMF to do capacity planning for my business. So in the context of real phenomenon, if the probability of k events occurring in t is what we want to know, why is the Poisson PMF, the way it is structured, able to answer this question so well? That’s the question we’ll answer in this article.
As visual creatures, humans are sensitive to visual signal impairments such as blockiness, blurriness, noisiness, and transmission loss. Thus, I have focused my research on finding how image quality affects user behavior in web applications. Lately, several studies test the effect of low-quality images on web sites. Cornell University  shows that poor pictures negatively impact the user experience, website conversion ratio, how long people stay on the website, and trust/credibility. They use a deep neural network model trained with a publicly available dataset from LetGo.com. The objective is to measure the effect of image quality in sales and perceived trustworthiness. It is found that the predicted image quality is 1.25x more likely to be sold, but they could not measure the effect of image quality to trustworthiness. I conducted a similar study to measure the impact of image quality in listing clicks and leads. Using a private dataset, I found that listings with excellent image quality got 1.31x more clicks and 1.5x more leads.
Breakthroughs in artificial intelligence complicate the belief that our intellect – even our creativity – will remain unrivaled. The concept of human distinctiveness is central to our self-understanding; however, time and again, what was believed to be our distinguishing feature proved no more than a mirage. For millennia, humans have asserted themselves as superior to other living forms. From Mesopotamia to China, the mythologies of ancient civilizations traced humankind’s origin to divine creation. Religious doctrines have echoed and solidified this sentiment. In the Bible and the Qur’an, humans are portrayed as divinely ordained to have dominion over all creation. Even in Buddhism, where all things living are connected in an endless cycle of reincarnation, only a human can attain enlightenment.
We show what metric to use for visualizing and determining an optimal number of clusters much better than the usual practice – elbow method. Clustering is an important part of the machine learning pipeline for business or scientific enterprises utilizing data science. As the name suggests, it helps to identify congregations of closely related (by some measure of distance) data points in a blob of data, which, otherwise, would be difficult to make sense of. However, mostly, the process of clustering falls under the realm of unsupervised machine learning. And unsupervised ML is a messy business. There is no known answers or labels to guide the optimization process or measure our success against. We are in the uncharted territory.
In my last article, I argued for a case about democratizing data science. The article resonated with many people. And I was asked this follow up question – it is great to imagine a future where data science is accessible by everyone, but starting from today, what could I do in my company to initiate the change? This is a common question we often receive from our customers. We are seeing more and more consumer insight managers, strategists, or HR specialists asked by their leadership team to incorporate data into their decision making. Over the years, we have helped many companies to build their data driven culture. While there are many different paths that lead to the success, the failures appear more or less the same: a data strategy championed by an executive from the highest level is imposed to the middle management team. The middle management reluctantly initiate a few projects and implement some changes, but in a year to 18 months, the strategy was never mentioned again. A top down strategy doesn’t work. To initiate the culture change, you have to start from the bottom. And here I summarized a few steps we learnt from our customers that could help you start building the data driven culture from the position you are in today.
Many machine learning algorithms cannot handle categorical variables as they require all inputs to be numerical. For this reason, we need to find a way for transforming values of categorical variable to numbers. The general idea behind the most popular methods for encoding categorical data is to come up with a mapping of each category to numbers.
Deep Learning is the most promising field of artificial intelligence with proven success in areas ranging from computer vision to natural language processing. In this article, you will learn how to build a neural network with no prior domain knowledge.
Inthe realm of unsupervised learning algorithms, Gaussian Mixture Models or GMMs are special citizens. GMMs are based on the assumption that all data points come from a fine mixture of Gaussian distributions with unknown parameters. They are parametric generative models that attempt to learn the true data distribution. Hence, once we learn the Gaussian parameters, we can generate data from the same distribution as the source.
The material discussed here is also of interest to machine learning, AI, big data, and data science practitioners, as much of the work is based on heavy data processing, algorithms, efficient coding, testing, and experimentation. Also, it’s not just two new conjectures, but paths and suggestions to solve these problems. The last section contains a few new, original exercises, some with solutions, and may be useful to students, researchers, and instructors offering math and statistics classes at the college level: they range from easy to very difficult. Some great probability theorems are also discussed, in layman’s terms: see section 1.2. The two deep conjectures highlighted in this article (conjectures B and C) are related to the digit distribution of well known math constants such as Pi or log 2, with an emphasis on binary digits of SQRT(2). This is an old problem, one of the most famous ones in mathematics, still unsolved today.
Reinforcement Learning (RL) is one of the most happening field of Machine Learning (ML) and Artificial Intelligence (AI).Though RL existed for many decades, only recently the giant has awaken after explosion in Neural Network based Deep Learning. This blog is an attempt to explain basic concepts of Reinforcement Learning using simple example and explanation that anyone with elementary English knowledge would be able to understand. Many ideas and concepts are borrowed from numerous books, videos and blogs/writings and i am indebted to those scholars and writers. This is the first part in this series of blogs on RL.
In the first blog of the series, I have covered basic terminology needed to understand RL. In this blog I will cover RL problem description using Markov Decision Process (MDP), Bellman equation and solving MDP using Dynamic Programming. First, I will introduce few notations (don’t get intimidated by these as they are not necessarily scary as they might look at first sight and mainly needed for mathematical expression). Below are few commonly used notations we will refer time to time and will introduce few more along the way of this RL journey.
I am currently working on a computer vision project and I wanted to look into image pre-processing to help improve the machine learning models that I am planning to build. Image pre-processing involves applying image filters to an image. This article will compare a number of the most well known image filters. Image filters can be used to reduce the amount of noise in an image and to enhance the edges in an image. There are two types of noise that can be present in an image: speckle noise and salt-and-pepper noise. Speck noise is the noise that occurs during image acquisition while salt-and-pepper noise (which refers to sparsely occurring white and black pixels) is caused by sudden disturbances in an image signal. Enhancing the edges of an image can help a model detect the features of an image.
Explanation of the ‘AOD-Net: All-in-One Dehazing Network’ paper by Boyi Li et. al. (ICCV 2017) and a tutorial to implement the same in Tensorflow. Haze degrades image quality and limits visibility especially in outdoor settings. This consequently affects performance on other high-level tasks such as object detection and recognition. The AOD network proposed by Boyi Li et. al. is an end-to-end CNN to de-haze an image. AOD takes as input a hazy image and generates a de-hazed image. In this post, I have provided an explanation for the main components of the AOD-net paper along with a step-by-step tutorial to implement AOD-net in Tensorflow.
With Julia and JuliaBox you can make impressive data visualizations with almost no programming knowledge and no need to install anything. Julia is a relatively new language for data analysis. It has a high-level syntax and designed to be easy to use and understand. Some have called it the new Python. Unlike Python, though, it is a complied language, which means that while it is as easy to write as Python, it runs much faster because it is converted to low-level code that is more easily understood by a computer. This is a great if you have to deal with large data sets that require a lot of processing. Julia is also much less fussy about how a program is laid out than Python. (Python is one of the few languages that forces the programmer to layout code in a particular way, with certain parts properly indented by a fixed number of spaces or tabs. This makes for easy to read code but can be a bit fiddly unless you have an good editor.) Julia has all the features that you would expect of a modern programming language but, here, we are going to take a look at Julia’s data visualization capabilities, These are both impressive and easy to use.
Stochastic gradient descent is a very popular and common algorithm used in various Machine Learning algorithms, most importantly forms the basis of Neural Networks. In this article, I have tried my best to explain it in detail, yet in simple terms. I highly recommend going through linear regression before proceeding with this article.