Building Efficient Custom Datasets in PyTorch

PyTorch has been around my circles as of late and I had to try it out despite being comfortable with Keras and TensorFlow for a while. Surprisingly, I found it quite refreshing and likable, especially as PyTorch features a Pythonic API, a more opinionated programming pattern and a good set of built-in utility functions. One that I enjoy particularly well is the ability to easily craft a custom Dataset object which can then be used with the built-in DataLoader to feed data when training a model. In this article, I will be exploring the PyTorch Dataset object from the ground up with the objective of making a dataset for handling text files and how one could go about optimizing the pipeline for a certain task. We start by going over the basics of the Dataset utility with a toy example and work our way up to the real task. Specifically, we want to create a pipeline to feed first names of character names, from The Elder Scrolls (TES) series, the race of those character names and the gender of the names as one-hot tensors. You can find this dataset on my website.

Building a Connection Between Decision Maker and Data-Driven Decision Process

It is quite common that most of companies’ decisions are made based on feelings, intuitions or personal experiences. The reasons for such patterns have organizational, technical and process oriented backgrounds. For instance, there is no structured way to deal with the analytical results on both sides simultaneously – organizational and technical. Usually, in case of analytics the ones doing analysis (e.g. data scientists) and the ones using results of analytics (e.g. decision makers) are different persons. As a result, such a structure leads to ambiguity and misunderstanding between the involved parties. In order to bridge the existing gap between data scientists and decision makers, we introduced the Data Product Profile which links both data science and data-driven decision processes.

Plotting Markowitz Efficient Frontier with Python

This article is a follow up on the article about calculating the Sharpe Ratio. After knowing how to get the Sharpe ratio, we will simulate over a few thousand possible portfolio allocations, and draw the outcomes in a chart. With this we can easily find out the best allocation for our stocks for any given level of risk we are willing to take. Like in the previous article, we will need to load our data. I’ll be using four simple files with two columns – a date and closing price. You can use your own data, or find something in Quandl, which is very good for this purpose. We have 4 companies – Amazon, IBM, Cisco, and Apple. Each file is loaded as amzn, ibm, cisco and aapl respectively. The head of aapl is printed below.

NLP vs. NLU: from Understanding a Language to Its Processing

As artificial intelligence progresses and technology becomes more sophisticated, we expect existing concepts to embrace this change – or change themselves. Similarly, in the domain of computer-aided processing of natural languages, shall the concept of natural language processing give way to natural language understanding? Or is the relation between the two concepts subtler and more complicated that merely linear progressing of a technology? In this post we’ll scrutinize over the concepts of NLP and NLU and their niches in the AI-related technology. Importantly, though sometimes used interchangeably, they are actually two different concepts that have some overlap. First of all, they both deal with the relationship between a natural language and artificial intelligence. They both attempt to make sense of unstructured data, like language, as opposed to structured data like statistics, actions, etc. However, NLP and NLU are opposites of a lot of other data mining techniques.

Illustrated Machine Learning Cheatsheets

My twin brother Afshine and I created this set of illustrated Machine Learning cheatsheets covering the content of the CS 229 class, which I TA-ed in Fall 2018 at Stanford. They can (hopefully!) be useful to all future students of this course as well as to anyone else interested in Machine Learning.

Basic Statistics Concepts Data Scientist Should know

Data science is a multidisciplinary blend of data inference, algorithm development, and technology in order to solve analytically complex problems. At the core is data. Troves of raw information, streaming in and stored in enterprise data warehouses. Much to learn by mining it. Advanced capabilities we can build with it. Data science is ultimately about using this data in creative ways to generate business value The broader fields of understanding what data science includes mathematics, statistics, computer science and information science. For career as Data Scientist, you need to have a strong background in statistics and mathematics. Big companies will always give preference to those with good analytical and statistical skills. In this blog, we will be looking at the basic statistical concepts which every data scientists must know. Let’s understand them one by one in the next section.

Basic Linear Algebra Concepts for Machine Learning

The field of Data Science has seen exponential growth in the last few years. Though the concept was prevalent previously as well the recent hype is the result of the variety of huge volumes of unstructured data that is getting generated across different industries and the enormous potential that’s hidden beneath those data. On top of the top, the massive computational power that modern day computers possess has made it even more possible to mine such huge chunks of data. Now Data Science is a study which is comprised of several disciplines starting from exploratory data analysis to predictive analytics. There are various tools and techniques that professionals use to extract information from the data. However, there is a common misconception among them is to focus more on those tools rather than the math behind the data modelling. People tend to put too much importance on the Machine Learning algorithms instead of the Linear Algebra or the Probability concepts that are required to fetch relevant meaning from the data. Thus, in this blog post, we would cover one of the pre-requisites in Data Science i.e. Linear Algebra and some of the basic concepts that you should learn.

Statistics for Data Science: Introduction to t-test and its Different Types (with Implementation in R)

Every day we find ourselves testing new ideas, finding the fastest route to the office, the quickest way to finish our work, or simply finding a better way to do something we love. The critical question, then, is whether our idea is significantly better than what we tried previously. These ideas that we come up with on such a regular basis – that’s essentially what a hypothesis is. And testing these ideas to figure out which one works and which one is best left behind, is called hypothesis testing. Hypothesis testing is one of the most fascinating things we do as data scientists. No idea is off-limits at this stage of our project. I have personally seen so many insights coming out of hypothesis testing – insights most of us would have missed if not for this stage!

Personalized treatment effects with model4you

Typical models estimating treatment effects assume that the treatment effect is the same for all individuals. Model-based recursive partitioning allows to relax this assumption and to estimate stratified treatment effects (model-based trees) or even personalised treatment effects (model-based forests). With model-based trees one can compute treatment effects for different strata of individuals. The strata are found in a data-driven fashion and depend on characteristics of the individuals. Model-based random forests allow for a similarity estimation between individuals in terms of model parameters (e.g. intercept and treatment effect). The similarity measure can then be used to estimate personalised models. The R package model4you implements these stratified and personalised models in the setting with two randomly assigned treatments with a focus on ease of use and interpretability so that clinicians and other users can take the model they usually use for the estimation of the average treatment effect and with a few lines of code get a visualisation that is easy to understand and interpret.

One of the most familiar settings for a machine learning engineer is having access to a lot of data, but modest resources to annotate it. Everyone in that predicament eventually goes through the logical steps of asking themselves what to do when they have limited supervised data, but lots of unlabeled data, and the literature appears to have a ready answer: semi-supervised learning. And that’s usually when things go wrong.