One thing prevalent in most data science departments is messy notebooks and messy code. There are examples of beautiful notebooks out there, but for the most part notebook code is rough … really rough. Not to mention all the files, functions, visitations, reporting metrics, etc. scattered through files and folders with no real structure to them.
In this blog, we will discuss the workflow of a Machine learning project this includes all the steps required to build the proper machine learning project from scratch. We will also go over data pre-processing, data cleaning, feature exploration and feature engineering and show the impact that it has on Machine Learning Model Performance. We will also cover a couple of the pre-modelling steps that can help to improve the model performance.
In this article, I will introduce you to the algorithm at the heart of AlphaGo – Monte Carlo Tree Search (MCTS). This algorithm has one main purpose – given the state of a game, choose the most promising move. To give you some context behind AlphaGo, we’ll first briefly look at the history of game playing AI programs. Then, we’ll see the components of AlphaGo, the Game Tree Concept, a few tree search algorithm, and finally dive into how the MCTS algorithm works.
In the present days of data accumulation there is a global craving for the innovative and business use of AI at all levels. Maybe it’s time to stop and reflect on that burning desire of using AI everywhere and consider ‘the Law of the Instrument’ for a moment: ‘if all you have is a hammer, everything looks like a nail’.
In my previous post I covered the theory behind Variational Autoencoders. It’s time now to get our hands dirty and develop some code that can lead us to a better comprehension of this technique. I decided to use Tensorflow since I want to improve my skills with it and adapt to the last changes that are being pushed towards the 2.0 version. Let’s code!
In a previous post, I introduced the theory of support vector machine (SVM). Now, I will further explain how SVMs work with fours different exercises! The first part will show how to perform classification with a linear kernel and how the regularization parameter C impacts the resulting hyperplane. Then, the second part will show how to work with a Gaussian kernel to generate a non-linear hyperplane. The third part simulates overlapping classes and we will use cross-validation to find the best parameters for the SVM. Finally, we perform a very simple spam classifier using SVM.
In a previous blog post, I explained how we can leverage the k-means clustering algorithm to count the number of red baubles on a Christmas tree. This method fails however if we put Christmas tinsels on it. Let’s find a solution for this more difficult case.
I was starting a project where I had to quickly check if a package, Flask, worked with the Python installed on my machine. As I ran the command to install Flask, it alerted me that the package was already installed as I had Anaconda on my machine. But when I tried to run the Hello World Flask app on Sublime Text 3, the console gave an error that it could not find the Flask module. I was confused and started reading online about the problem. I discovered that Anaconda had the Flask module but the Python I was using inside Sublime Text did not have it. I jot down to find a solution to understand how to setup Python properly, install the correct packages at the right place, and setup Sublime Text build system. My online research revealed about Virtual Environments, something that I had not read about before.
I was recently asked by a startup I’m consulting (BigPanda) to give my opinion about the structure and flow of data science projects, which made me think about what makes them unique. Both managers and the different teams in a startup might find the differences between a data science project and a software development one unintuitive and confusing. If not stated and accounted for explicitly, these fundamental differences might cause misunderstanding and clashes between the data scientist and his peers. Respectively, researchers coming from academia (or highly research-oriented industry research groups) might have their own challenges when arriving at a startup or a smaller company. They might find it challenging to incorporate new types of inputs, such as product and business needs, tighter infrastructure and compute constraints and costumer feedback, into their research and development process. The aim of this post, then, is to present the characteristic project flow that I have identified in the working process of both my colleagues and myself in recent years. Hopefully, this can help both data scientists and the people working with them to structure data science projects in a way that reflects their uniqueness.
I made C++ implementation of Mask R-CNN with PyTorch C++ frontend. The code is based on PyTorch implementations from multimodallearning and Keras implementation from Matterport . Project was made for educational purposes and can be used as comprehensive example of PyTorch C++ frontend API. Besides regular API you will find how to: load data from MSCoco dataset, create custom layers, manage weights across language boundaries(Python to C++).
Hyperparameter Optimisation for Text Classification with Flair
HBase is an open-source, distributed, versioned, non-relational, column-oriented database built on top of HDFS. All the data is taken in the form of key-value pairs.
Having effective data structures, enables and empowers different types of analyses to be performed. Overall looking at some of the traditional use cases for e-commerce, we can extract 5 different distinct areas of focus that we might want to address. We could also add to this data structures related to clickstreams and conversion optimization, marketing spend, product recommendations … but these would need to be the subject of a separate post.
Feature extraction and storage is one of the most important and often overlooked aspects of machine learning solutions. Features play a key role helping machine learning models to process and understand datasets for training and production. If you are building a single machine learning model, feature extraction seems like a very basic thing to do but that picture gets really complicated as your team scales. Picture a large organization with dozens of data science teams cranking up machine learning models. Each team needs to process different datasets and extract the corresponding features which becomes computationally extremely expensive and nearly impossible to scale. Building mechanisms for reusing features across different models is one of the key challenges faced by high performance machine learning teams.