Entropy: How Decision Trees Make Decisions

You’re a Data Scientist in training. You’ve come a long way from writing your first line of Python or R code. You know your way around Scikit-Learn like the back of your hand. You spend more time on Kaggle than Facebook now. You’re no stranger to building awesome random forests and other tree based ensemble models that get the job done. However , you’re nothing if not thorough. You want to dig deeper and understand some of the intricacies and concepts behind popular machine learning models. Well , so do I. In this blog post, I will introduce the concept of Entropy as a general topic in statistics which will allow me to further introduce the concept of Information Gain and subsequently explain why both these fundamental concepts form the basis of how decision trees build themselves on the data we supply them.


End To End Guide For Machine Learning Project

This article aims to provide an end-to-end guide for getting a successful machine learning project implemented.


Dow Jones Stock Market Index (4/4): Trade Volume GARCH Model

This is the final part of the 4-series posts. In this fourth post, I am going to build an ARMA-GARCH model for Dow Jones Industrial Average (DJIA) daily trade volume log ratio. You can read the other three parts in the following links: part 1, part2, and part 3. Packages


Docker image for NLP

NLP is one of the main directions of our work at Poteha Labs. We do text analysis, chatbot development and information retrieval. Therefore, we regularly use Flair, Natasha, TensorFlow and Pytorch, NLTK, sometimes encountering languages other than English. Existing solutions are not always suitable for every problem we face: some are difficult to launch, others are too complicated and heavy. Consequently, we’ve compiled our own Docker image with all the convenient frameworks for NLP, including deep learning. It suits almost 80% of our tasks and saves time for installing the frameworks.


Data and Container Placement in Scalable Data Analytics Platforms

Distributed dataflow systems process large volume of data in parallel on multiple machines. In production, multiple dataflow applications are scheduled for execution in virtual containers on a per-job basis. Furthermore, they access datasets partitioned into datablocks across the cluster machines’ disks. Runtime performance is important for many of these jobs, as their users expect fast results. However, optimizing performance is difficult, because dataflow jobs are very diverse and used in a wide variety of domains such as relational processing, machine learning, and graph processing. Container and datablock placement decisions impact a job’s runtime performance significantly. Furthermore, changing placements affects runtime performance without modifying the application’s code, and thus can be applied to many jobs without much configuration effort from the user’s side. However, jobs benefit differently from placement decisions, because their resource demands differ from job to job. Hence, there is not a single placement strategy that is optimal for all possible jobs. Besides that, users require a secure long-term data retention for their documents and datasets. This thesis presents container and datablock placement strategies to optimize the runtime performance of distributed dataflow applications running on shared data analytics platforms. It contributes two placement methods for this. The first method improves the efficiency of a job’s dataflow operations and the degree of data locality by colocating its input datablocks and containers on a selected set of nodes. The second method places a job’s containers based on network distances between containers and its input datablocks as well as container interference. In addition, this thesis explores the problem of data retention in shared data analytics platforms. Therefore, it contributes a method of storing and accessing lineage metadata through smart-contracts executed on a decentralized blockchain network. The methods presented in this thesis have been implemented in a research prototype that has been integrated with Hadoop and Ethereum. For evaluation, we used a 64 nodes commodity cluster and workloads consisting of applications implemented in Flink from the domains of relational processing, machine learning, and graph processing. We compared the runtime performance of workloads scheduled with our methods with Hadoop’s default placement method. For our blockchain-based data retention method, we measured overhead in terms of additional response time and reported costs using it on Ethereum’s blockchain network.


Capsule Networks: The New Deep Learning Network

A guide and introduction to understanding and using Capsule Networks. Convolutional Networks have been hugely successful in the field of deep learning and they are the primary reason why deep learning is so popular right now! They seem to work well enough, but they have drawbacks in their basic architecture, causing them to not work very well for some tasks. Convolutional neural nets detect features in images and learn how to recognize objects with these features. Layers near the start detecting really simple features like edges and layers that are deeper can detect more complex features like eyes, noses, or an entire face. It then uses all of these features which it has learned, to make a final prediction. Herein lies the flaws of this system?-?there is no spatial information that is used anywhere in Convnets and the pooling function that is used to connect layers, is really really inefficient.


Building a Conversational Chatbot for Slack using Rasa and Python -Part 2

This is the concluding part of the article: Building a Conversational Chatbot for Slack using Rasa and Python -Part 1. In the first part, we discussed in detail about Rasa Stack: an open source machine learning toolkit that lets developers expand bots beyond answering simple questions. We then used two modules of Rasa namely Rasa NLU and Rasa Core to build a fully functional chatbot capable of checking in on people’s mood and take the necessary actions to cheer them up. The actions included showing the users an image of a dog, cat or bird depending upon the user’s choice.


Bayesian A/B Testing with Python: the easy guide

If you landed on this page, you already know what an A/B test is, but maybe you have some doubt on how to analyse the results within a bayesian framework. To start the post, then, let me point you the practical difference of the bayesian approach from the classical, ‘frequentist’ one: the difference is in the form in which the metric you want to study actually is. For the frequentists, it’s a simple number, while for bayesians it’s a distribution. While having a distribution instead of a number appears as just a complication, it turns out to be extremely useful in so many cases, and in A/B tests too. And with this post, you will see that in practice is not so difficult to deal with it. In fact, just few lines of code.


AzureR packages now on CRAN

The suite of AzureR packages for interfacing with Azure services from R is now available on CRAN. If you missed the earlier announcements, this means you can now use the install.packages function in R to install these packages, rather than having to install from the Github repositories. Updated versions of these packages will also be posted to CRAN, so you can get the latest versions simply by running update.packages.
The packages are:
• AzureVM (CRAN): a package for working with virtual machines and virtual machine clusters. It allows you to easily deploy, manage and delete a VM from R. A number of templates are provided based on the Data Science Virtual Machine, or you can also supply your own template. Introductory vignette.
• AzureStor: a package for working with storage accounts. In addition to a Resource Manager interface for creating and deleting storage accounts, it also provides a client interface for working with the storage itself. You can upload and download files and blobs, list file shares and containers, list files within a share, and so on. It supports authenticated access to storage via either a key or a shared access signature (SAS). Introductory vignette.
• AzureContainers: a package for working with containers in Azure: specifically, it provides an interface to Azure Container Instances, Azure Container Registry and Azure Kubernetes Service. You can easily create an image, push it to an ACR repo, and then deploy it as a service to AKS. It also provides lightweight shells to docker, kubectl and helm (assuming these are installed). Vignettes on deploying with the plumber package and with Microsoft ML Server.
• AzureRMR: the base package of the AzureR family for interfacing with the Azure Resource Manager and which can be extended to interface with other services. There’s an introductory vignette, and another on extending AzureRMR.


Autonomous Testing Is Like Autonomous Driving: The AI Needs Human Assistance

The US Society of Automotive Engineers (SAE) has defined a scale to describe the autonomous capabilities that self-driving cars have – i.e., their levels of automation.


Animating Data Transformations

A common statistical procedure is to convert a tall dataset into a wide one, or visa versa. A tall, or ‘tidy’ format, meets the following criteria (See https://…/tidy-data.html for more information):
• Each variable forms a column.
• Each observation forms a row.
• Each type of observational unit forms a table.
In this blogpost, we will outline the steps needed to create a wide-to-tall animated gif. To do so, we’ve followed and adapted a tutorial produced by Luis Verde Arregoitia here:
https://…/animate-untangle


Andrew Ng’s Machine Learning Course in Python (Anomaly Detection)

This is the last part of Andrew Ng’s Machine Learning Course python implementation and I am very excited to finally complete the series. To give you guys some perspective, it took me a month to convert these codes to python and writes an article for each assignment. If any of you were hesitating to do your own implementation, be it in Python, R or Java, I strongly recommend you to go for it. Coding these algorithms from scratch not only reinforce the concepts taught, you will also get to practice your data science programming skills in the language you are comfortable with.


An introduction to context-oriented programming in Kotlin

In this article I will try to describe a new interesting phenomenon which appeared as a by-product of fascinating progress made by Kotlin development team. Namely, the new approach to library and application architecture design, which I call context-orie


A Non-Technical Reading List for Data Science

1. The Signal and the Noise: Why So Many Predictions Fail – But Some Don’t by Nate Silver
2. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neill
3. (Tie) Algorithms to Live By: The Computer Science of Human Decisions by Brian Christian and Tom Griffiths
4. How Not to be Wrong: The Power of Mathematical Thinking by Jordan Ellenberg
5. Thinking, Fast and Slow by Daniel Kahneman
6. (Dark horse): The Black Swan: The Impact of the Highly Improbable by Nassim Nicholas Taleb
Advertisements