Statistics Sunday: Introduction to Regular Expressions

In my last Statistics Sunday post, I briefly mentioned the concept of regular expressions, also known as regex (though note that in some contexts, these refer to different things – see here). A regular expression is a text string, which you ask your program to match. You can use this to look for files with a particular name or extension, or search a corpus of text for a specific word or word(s) that match a certain pattern. This concept is used in many programming languages, and R is no exception. In order to use a regular expression, you need a regular expression engine. Fortunately, R base and many R packages come with a regular expression engine, which you call up with a function. In R base, those functions are things like grep, grepl, regexpr, and so on. Today, I want to talk about some of the syntax you use to describe different patterns of text characters, which can include letters, digits, white space, and punctuation, starting from the most simple regular expressions. Typing out a specific text string matches only that. In last week’s example, I was working with files that all start with the word ‘exam.’ So if I had other files in that folder that didn’t follow that naming convention, I could look specifically for those by simply typing exam into the pattern attribute of my list.files() function. These files follow exam with a date, formatted as YYYYMMDD, and a letter to delineate each file from that day (a to z). Pretty much all of our files end in a, but in the rare instances where they send two files for a day, that one would end in b. How can I write a regular expression that matches this pattern?

Improving Binning by Bootstrap Bumping

In the post (https://…onic-binning-based-on-isotonic-regression ), a more robust version of monotonic binning based on the isotonic regression was introduced. Nonetheless, due to the loss of granularity, the predictability has been somewhat compromised, which is a typical dilemma in the data science. On one hand, we don’t want to use a learning algorithm that is too greedy and therefore over-fits the data at the cost of simplicity and generality. On the other hand, we’d also like to get the most predictive power out of our data for better business results. It is worth mentioning that, although there is a consensus that advanced ensemble algorithms are able to significantly improve the prediction outcome, both bagging and boosting would also destroy the simple structure of binning outputs and therefore might not be directly applicable in this simple case.

Time Series Forecasting with LSTMs and Prophet: Predict your email workload

Time series forecasting offers a fantastic playground to test data science algorithms. After all, how great would it be if one is able to forecast the future? Typical datasets that are used to demonstrate forecasting algorithms are stock charts, sales, and meteorological data. Here, we will try something that is more relatable to every online user, namely the number of e-mails you will receive. According to an e-mail statistics report, the average office worker received 85 e-mails per day in 2014. Let’s try to make an accurate prediction based on your historical inbox data. For this, we will explore the use of LSTMs and Facebook’s Prophet. The aim here is on how to prepare the data for the different algorithms and providing a qualitative overview rather than finetuning. Arguably, the results will vary significantly on the number of e-mails you typically receive and hence the data you have for training.

Algorithmic Trading using Sentiment Analysis on News Articles

Well I was supposed to study for my exams but I was very tempted to explore this while I was taking breaks. I realized that I could use elements from my previous projects to work on this. I previously wrote about Automating Social Media Contests with Web Scraping which is relevant to news scraping and I also wrote about Generating Singlish Text Messages where I worked with text data. Isn’t it interesting how it could all overlap?

RetinaNet: how Focal Loss fixes Single-Shot Detection

Object detection is a tremendously important field in computer vision needed for autonomous driving, video surveillance, medical applications, and many other fields. Have you heard buzzwords such as YOLO or SSD? In this article, I will explain to you several important advances recently made in state-of-the-art object detection using deep learning and lead you to great resources that demonstrate how to implement them. After reading this article you will have a good overview of the field, no prior knowledge in object detection required.

A Beginner’s Guide to Brain-Computer Interface and Convolutional Neural Networks

The big picture of brain-computer interface and AI + Research papers

Translating SQL to pandas

Like a person with SQL background and a person that works a lot with SQL, first steps with pandas were little bit difficult for me. I would always think in terms of SQL and then wonder why pandas is so not-intuitive. But with the time I got used to a syntax and found my own associations between these two. So I decide to create kind of cheat sheet of basic SQL commands translated into pandas language.

Plotting wind highways using rWind

Our manuscript about rWind R package has been recently accepted for publication in Ecography! As you know, rWind is a tool used to download and manage wind data, with some utilities that make easy to include wind information in ecological or evolutionary analyses (or others!). Though there are several examples in the Supporting material of the publication, and others in the vignette included in the CRAN version, here you have a full tutorial with the main utilities of rWind… enjoy!


A tool for exploring a docker image, layer contents, and discovering ways to shrink your Docker image size.

Probabilistic Models of Cognition

This book explores the probabilistic approach to cognitive science, which models learning and reasoning as inference in complex probabilistic models. We examine how a broad range of empirical phenomena, including intuitive physics, concept learning, causal reasoning, social cognition, and language understanding, can be modeled using probabilistic programs (using the WebPPL language).

If You Can’t Measure It, You Can’t Improve It !!!

Online shopping has changed quite a lot in the past few years. Online stores like Amazon, treats their customers at a more personal level. They understand your interests in certain products based on your online shopping activities (viewing items, adding items to your cart, and eventually making a purchase). For instance, you go on Amazon and search for an item and then you click on some of the search results. The next time you visit Amazon, you can see that there is a specific section that recommends similar products for you based on what you searched for the last time. The story does not end here and as you interact more with the online store, you receive more personalized recommendations including ‘Customers who bought this item also bought’ and it shows you a list of items that are bought frequently together. Also some stores send promotional emails offering products that are targeting customers who are more likely to purchase those items.

Understanding a Machine Learning workflow through food

Through food?! Yes, you got that right, through food! 🙂 Imagine yourself ordering a pizza and, after a short while, getting that nice, warm and delicious pizza delivered to your home. Have you ever wondered the workflow behind getting such a pizza delivered to your home? I mean, the full workflow, from the sowing of tomato seeds to the bike rider buzzing at your door! It turns out, it is not so different from a Machine Learning workflow. Really! Let’s check it out!

Hex – Creating Intelligent Opponents with Minimax Driven AI (Part 1: – Pruning)

In today’s article, I am going to show you how to create intelligent opponents with Alpha-Beta Minimax algorithm. We will create an agent that can successfully compete with humans in the classic Hex game. After the end of this article, you will be able to create adversarial search agents that can competitively play zero-sum and perfect information games.

Comparison of BI and Analytics Platforms

In this tutorial, you will learn about different factors that are taken into consideration before choosing your Analytics Platform both at the organization and individual level.

Transfer Learning with Convolutional Neural Networks in PyTorch

Although Keras is a great library with a simple API for building neural networks, the recent excitement about PyTorch finally got me interested in exploring this library. While I’m one to blindly follow the hype, the adoption by researchers and inclusion in the library convinced me there must be something behind this new entry in deep learning. Since the best way to learn a new technology is by using it to solve a problem, my efforts to learn PyTorch started out with a simple project: use a pre-trained convolutional neural network for an object recognition task. In this article, we’ll see how to use PyTorch to accomplish this goal, along the way, learning a little about the library and about the important concept of transfer learning.