Dimensionality Reduction – PCA, ICA and Manifold learning

In this blog, I primarily discuss about:
• Algorithms on dimensionality reduction including PCA (Principal Component Analysis), ICA (Independent Component Analysis) and Projection & Manifold Learning.
• Applications in various field with special emphasis of Manifold Learning in the field of research.

Semantic search

A brief post on semantics, search, and semantic search. Definitions, examples, implementation and reference papers in under 5mins.

Airflow to orchestrate your machine learning algorithms

As data engineer a big challenge is to manage, schedule and run work-flow to prepare data, generate reports and run algorithms. The scope of this post is to suggest a possible, fast to implement, solution for these activities with a simple example.

Multivariate Time Series Forecasting Using Random Forest

In my earlier post (Understanding Entity Embeddings and It’s Application) [1], I’ve talked about solving a forecasting problem using entity embeddings?-?basically using tabular data that have been represented as vectors and using them as input to a neural network based model to solve a forecasting problem. This time around though, I’ll be doing the same via a different technique called Random Forest.

Applied AI: Going From Concept to ML Components

Opening your mind to different ways of applying machine learning to the real world. By Abraham Kang with special thanks to Kunal Patel and Jae Duk Seo for being a sounding board and providing input for this article.

Data Science as Software: from Notebooks to Tools [Part 1]

Nowadays there is a lot of hype concerning Data Science and many people only know certain aspects about this interesting field of work. A big challenge occurs when Data Scientists produce great MVPs (Minimum Viable Products) but fail to go on from there. In this article I want to show how a possible journey in Data Science might look like, starting from the very beginning of a project to getting it ready for handover. The goal of this Article series is to show the different phases of a Data Science project from the perspective of a developer and provide best practises for those working in this field.

Apache Druid (part 1): A Scalable Timeseries OLAP Database System

Apache Druid was created by advertising analytics company Metamarkets and so far has been used by many companies, including Airbnb, Netflix, Nielsen, eBay, Paypal and Yahoo. It combines ideas from OLAP databases, time-series databases, and search systems to create a unified system for a broad range of use cases. Initially, Apache Druid became an open-source software in 2012 under the GPL license, thereafter in 2015 changed to Apache 2 license and in 2018 joined the Apache Software Foundation as an incubating project.

Build vs. Buy – A Scalable Machine Learning Infrastructure

In this blog post we’ll look at which parts a machine learning platform consists of and compare building your own infrastructure from scratch to buying a ready-made service that does everything for you.

Imbalanced Class Sizes and Classification Models: A Cautionary Tale

For a recent data science project, I developed a supervised learning model to classify the booking location of a first-time user of the vacation home site Airbnb. This dataset is available on Kaggle as a part of a 2015 Kaggle competition. For my project, I decided to group users into two groups: those who booked their first trip within the U.S.A. and Canada, and those who booked their first trip elsewhere internationally, essentially turning the problem into a binary classification problem. Sounds simple, right?

Friend Recommendation Using Heterogeneous Network Embeddings

Imagine Snoopy without Woodstock or Calvin without Hobbes, Friends without Rachel, Batman without Robin or Mowgli without Baloo. Social platforms thrive on the ability of the members to find relevant friends to interact with. The network effect is what drives growth or time spent and daily active users on the application. This is even more important for Hike because Hike is a network for close friends. So we need to make sure that finding friends, inviting them and adding them to the network is easy.

32 Statistical Concepts Explained in Simple English – Part 11

This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, decision trees, ensembles, correlation, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, cross-validation, model fitting, and many more.

Easyalluvial 0.2.0 released

easyalluvial allows you to build exploratory alluvial plots (sankey diagrams) with a single line of code while automatically binning numerical variables. In version 0.2.0 marginal histograms improve the visibility of those numerical variables. Further a method has been added that creates model agnostic 4 dimensional partial dependence alluvial plots to visualise the response of statistical models.

Finding a Difference that Matters

A previous article that I wrote tried to predict housing prices in Boston, the city where I live. A critical factor in improving performance was adding a variable for the neighborhood where the property was located. If neighborhood matters, that means that we should be able to first prove statistically that there are differences across the mean values, then go to the next level and understand how neighborhoods compare to each other.

Machine learning for data cleaning and unification

In conclusion, data cleaning and unification at the source are essential to create trustworthy analytics for organizations downstream. It is important to recognize that data quality problems cannot be solved properly in isolation and machine learning solutions that offer holistic approaches to cleaning and unifying data may be the best solution. At the same time, we must understand that in order to develop scaleable ML pipelines that work at the organizational level we must ensure that these solutions build upon legacy operations and bring humans into the loop.

Data Science Software Used in Journals: Stat Packages Declining (including R), AI/ML Software Growing

In my neverending quest to track The Popularity of Data Science Software, it’s time to update the section on Scholarly Articles. The rapid growth of R could not go on forever and, as you’ll see below, its use actually declined over the last year.

Statistically Controlling for Confounding Constructs Is Harder than You Think

Social scientists often seek to demonstrate that a construct has incremental validity over and above other related constructs. However, these claims are typically supported by measurement-level models that fail to consider the effects of measurement (un)reliability. We use intuitive examples, Monte Carlo simulations, and a novel analytical framework to demonstrate that common strategies for establishing incremental construct validity using multiple regression analysis exhibit extremely high Type I error rates under parameter regimes common in many psychological domains. Counterintuitively, we find that error rates are highest – in some cases approaching 100% – when sample sizes are large and reliability is moderate. Our findings suggest that a potentially large proportion of incremental validity claims made in the literature are spurious. We present a web application (http://…/ivy ) that readers can use to explore the statistical properties of these and other incremental validity arguments. We conclude by reviewing SEM-based statistical approaches that appropriately control the Type I error rate when attempting to establish incremental validity.