The following story is one of the most often told in the Data Science community: some time ago the military built a system which aim it was to distinguish military vehicles from civilian ones. They chose a neural network approach and trained the system with pictures of tanks, humvees and missile launchers on the one hand and normal cars, pickups and trucks on the other. After having reached a satisfactory accuracy they brought the system into the field (quite literally). It failed completely, performing no better than a coin toss. What had happened? No one knew, so they re-engineered the black box (no small feat in itself) and found that most of the military pics where taken at dusk or dawn and most civilian pics under brighter weather conditions. The neural net had learned the difference between light and dark!
In this post in the R:case4base series we will look at string manipulation with base R, and provide an overview of a wide range of functions for our string working needs.
Here, let me tell you something about some awesome libraries that R has. I consider these libraries to be the top libraries for Data Science. These libraries have wide range of functions and is quite useful for Data Science operations. I’ve used them and still use them for most of my day to day Data Science operations. Without wasting any further time, let me get you started with awesome R stuff.
1. Dplyr
2. Ggplot2
3. Esquisse
4. BioConductor
5. Shiny
6. Lubridate
7. Knitr
8. Mlr
9. Quanteda.dictionaries
10. DT
11. RCrawler
12. Caret
13. RMarkdown
14. Leaflet
15. Janitor
16. Other worth mentioning R libraries :
16.1. Ggvis
16.2. Plotly
16.3. Rcharts
16.4. Rbokeh
16.5. Broom
16.6. StringR
16.7. Magrittr
16.8. Slidify
16.9. Rvest
16.10. Future
16.11. RMySQL
16.12. RSQLite
16.13. Prophet
16.14. Glmnet
16.15. Text2Vec
16.16. SnowballC
16.17. Quantmod
16.18. Rstan
16.19. Swirl
16.20. DataScienceR
This article is part of my series that delve into how to optimize portfolio allocations. This series explains the mathematics and design principles behind my open source portfolio optimization library OptimalPortfolio. The first article here dealt with finding market invariants. Having done so, we must now estimate the distribution of the market invariants in order to extract useful information from them.
Nowadays, many companies?-?Netflix, Amazon, Uber, but also smaller?-?constantly run experiments (A/B testing) in order to test new features and implement those, which the users find best and which, in the end, lead to revenue growth. Data scientists’ role is to help in evaluating these experiments?-?in other words?-?verify if the results from these tests are reliable and can/should be used in the decision-making process. In this article I provide an introduction to power analysis. Shortly speaking, power is used to report confidence in the conclusions drawn from the results of an experiment. It can also be used for estimating the sample size required for the experiment, i.e., a sample size in which?-?with a given level of confidence?-?we should be able to detect an effect. By effect one can understand many things, for instance, more frequent conversion within a group, but also higher average spend of customers going through a certain signup flow in an online shop etc.
In this blog post, I will be explaining how to evaluate and find optimal policies using Dynamic Programming. This series of blog posts contain a summary of concepts explained in Introduction to Reinforcement Learning by David Silver.
In this article well be learning about Natural Language Processing(NLP) which can help computers analyze text easily i.e detect spam emails, autocorrect. We’ll see how NLP tasks are carried out for understanding human language.
Recently, I am working on a Optical Character Recognition (OCR) related project. On of the challenge is that pre-trained OCR model output incorrect text. Besides the performance of OCR model, image quality and layout alignment are other major source to text error.
In this post, I will give an introduction of Support Vector Machine classifier. This post will be a part of the series in which I will explain Support Vector Machine (SVM) including all the necessary minute details and mathematics behind it. It will be easy, believe me! Without any delay let’s begin – Suppose we’re given these two samples of blue stars and purple hearts (just for schematic representation and no real data are used here), and our job is to find out a line that separates them best. What do we mean by best here ?
All researchers are familiar with the importance of delivering a paper that is written in a clean and organized way. However, the same thing can often not be said about the way that we organize and maintain the code and data used in the backend (i.e. code and data layer) of a research project. This part of a project is usually not visible and good intentions to keep it organized tend to be one of the first things to fly out the window when a deadline is approaching. While understandable, I consider this to be an area with a lot of room for improvement. We spend the large majority of our time interacting with this part of the project and we can save ourselves a lot of time and frustration if we keep it clean and organized!
This is my renewed self-torturing attempt at learning machine learning. A couple of months ago, I earned a Machine Learning Engineer Nanodegree from the online nanodegree factory udacity. Nano probably denotes that it is infinitesimally small and insignificant.