Classification from scratch, bagging and forests 10/8

Tenth post of our series on classification from scratch. Today, we´ll see the heuristics of the algorithm inside bagging techniques.

Classification from scratch, boosting 11/8

Eleventh post of our series on classification from scratch. Today, that should be the last one… unless I forgot something important. So today, we discuss boosting.

Why the Future of Machine Learning is Tiny

When Azeem asked me to give a talk at CogX, he asked me to focus on just a single point that I wanted the audience to take away. A few years ago my priority would have been convincing people that deep learning was a real revolution, not a fad, but there have been enough examples of shipping products that that question seems answered. I knew this was true before most people not because I´m any kind of prophet with deep insights, but because I´d had a chance to spend a lot of time running hands-on experiments with the technology myself. I could be confident of the value of deep learning because I had seen with my own eyes how effective it was across a whole range of applications, and knew that the only barrier to seeing it deployed more widely was how long it takes to get from research to deployment. Instead I chose to speak about another trend that I am just as certain about, and will have just as much impact, but which isn´t nearly as well known. I´m convinced that machine learning can run on tiny, low-power chips, and that this combination will solve a massive number of problems we have no solutions for right now. That´s what I´ll be talking about at CogX, and in this post I´ll explain more about why I´m so sure.

How to think about AI and machine learning technologies, and their roles in automation

In this post, I share slides and notes from a talk Roger Chen and I gave in May 2018 at the Artificial Intelligence Conference in New York. Most companies are beginning to explore how to use machine learning and AI, and we wanted to give an overview and framework for how to think about these technologies and their roles in automation. Along the way, we describe the machine learning and AI tools that can be used to enable automation.

Anomaly Detection in R

Imagine you are a credit card selling company and you know about a particular customer who makes a purchase of 25$ every week. You guessed this purchase is his fixed weekly rations but one day, this customer makes a different purchase of 700$. This development will not just startle you but also compel you to talk to the customer and find out the reason so you can approve the transaction. This is because, the behavior of the customer had become fixed and the change was so different that it was not expected. Hence we call this event as an anomaly. Anomalies are hard to detect because they can also be real phenomenon. Let´s say that the customer in the example above made the usual purchases while he was living alone and will be starting his family this week. This will mean that this should be the first of his future purchases of similar magnitude or he is throwing a party this week and this was a one-time large purchase. In all these cases, the customer will be classified as making an ‘abnormal’ choice. We as the credit card seller need to know which of these cases are genuine and which are mistakes which can be corrected if they reconfirm the same with the customer. The usefulness of detecting such anomalies are very useful especially in BFSI industry with the primary use in credit card transactions. Such anomalies can be signs of fraud or theft. Someone making multiple transactions of small amounts from the same credit card, making one very large transaction which is a few order of magnitudes larger than the average, making transactions from an unfamiliar location are such examples that can caused by fraudsters and must be caught. With the popularity of adoption, let´s study the ways we can detect anomalies.

Information Security: Anomaly Detection and Threat Hunting with Anomalize

Information Security (InfoSec) is critical to a business. For those new to InfoSec, it is the state of being protected against the unauthorized use of information, especially electronic data. A single malicious threat can cause massive damage to a firm, large or small. It´s this reason when I (Matt Dancho) saw Russ McRee´s article, ‘Anomaly Detection & Threat Hunting with Anomalize’, that I asked him to repost on the Business Science blog. In his article, Russ speaks to use of our new R package, anomalize, as a way to detect threats (aka ‘threat hunting’). Russ is Group Program Manager of the Blue Team (the internal security team that defends against real attackers) for Microsoft´s Windows and Devices Group (WDG), now part of the Cloud and AI (C+AI) organization. He writes toolsmith, a monthly column for information security practitioners, and has written for other publications including Information Security, (IN)SECURE, SysAdmin, and Linux Magazine. The data Russ routinely deals with is massive in scale: He processes security event telemetry of all types (operating systems, network, applications, service layer) for all of Windows, Xbox, the Universal Store (transactions/purchases), and a few others. Billions of events in short order.

Running RStudio (1.2) Background Jobs

The forthcoming RStudio 1.2 release has a new ‘Jobs’ feature for running and managing background R tasks. I did a series of threaded screencaps on Twitter but that doesn´t do the feature justice. So I threw together a quick splainer on how to run and Python (despite RStudio not natively supporting Python) code in the background while you get other stuff done, then work with the results.

Microsoft R Open 3.5.0 now available

Microsoft R Open 3.5.0 is now available for download for Windows, Mac and Linux. This update includes the open-source R 3.5.0 engine, which is a major update with many new capabilities and improvements to R. In particular, it includes a major new framework for handling data in R, with some major behind-the-scenes performance and memory-use benefits (and with further improvements expected in the future).

Why Bother with Shiny?

For the last week we’ve been talking on the blog and Twitter about some of the functionality in Shiny and how you can learn it. But, if you haven’t already made the leap and started using Shiny, why should you

Scale-Invariant Clustering and Regression

The impact of a change of scale, for instance using years instead of days as the unit of measurement for one variable in a clustering problem, can be dramatic. It can result in a totally different cluster structure. Frequently, this is not a desirable property, yet it is rarely mentioned in textbooks. I think all clustering software should state in their user guide, that the algorithm is sensitive to scale. We illustrate the problem here, and propose a scale-invariant methodology for clustering. It applies to all clustering algorithms, as it consists of normalizing the observations before classifying the data points. It is not a magic solution, and it has its own drawbacks as we will see. In the case of linear regression, there is indeed no problem, and this is one of the few strengths of this technique.

Packaging and Distributing Your Python Project to PyPI for Installation Using pip

You might worked with several languages such as Java, C++, and Python and created a number of projects but unfortunately these projects are buried and no one knows about. Why not making these projects alive by making them available online This tutorial will explain the steps required to package your Python projects, distribute them in distribution formats using steptools, upload them into the Python Package Index (PyPI) repository using twine, and finally installation using Python installers such as pip and conda. The platform used in this tutorial is Linux Ubuntu 18.04 with Python 3.6.5. But you can still use other platforms such as Windows with little or no difference in the commands used.

Easy APA Formatted Bayesian Correlation

The Bayesian framework is the right way to go for psychological science. To facilitate its use for newcommers, we implemented the bayes_cor.test function in the psycho package, a user-friendly wrapper for the correlationBF function of the great BayesFactor package by Richard D. Morey.