An Introductory Guide to Maximum Likelihood Estimation (with a case study in R)

Interpreting how a model works is one of the most basic yet critical aspects of data science. You build a model which is giving you pretty impressive results, but what was the process behind it As a data scientist, you need to have an answer to this oft-asked question. For example, let´s say you built a model to predict the stock price of a company. You observed that the stock price increased rapidly over night. There could be multiple reasons behind it. Finding the likelihood of the most probable reason is what Maximum Likelihood Estimation is all about. This concept is used in economics, MRIs, satellite imaging, among other things.

Hypothesis Testing I: Prerequisites

In our day to day routine we come across many questions such as:
1.How much liters of water should be alotted on an average to a household in a region.
2.How much stock of apple juice should I order for my guests.
3.By what amount should we increase credit card limits for a certain group of customers.
4.Should we provide additional tuition for language courses to science students.
To figure out answers to these question at the very basic level, you don´t really need any hardcore statistics knowledge. For e.g., lets take first and last questions from the ones listed above.

Text mining on the command line

For the last couple of days, I have been thinking to write something about my recent experience on the usages of raw bash command and regex to mine text. Of course, there are more sophisticated tools and libraries online to process text without writing so many lines of codes. For example, Python has built-in regex module ‘re’ that has many rich features to process text. ‘BeautifulSoup’ on the other hand has nice built-in features to clean raw web pages. I use these tool for faster processing of large text corpus and when I feel lazy to write codes. Most of the times, I prefer to use the command line. I feel at home on the command line especially when I work with text data. In this tutorial, I use bash commands and regex to process raw and messy text data. I assume readers have the basic familiarity of regex and bash commands. I show how bash commands like ‘grep,’ ‘sed,’ ‘tr,’ ‘column,’ ‘sort,’ ‘uniq,’ ‘awk’ can be used with regex to process raw and messy texts and then extract information. As an example, I use the complete works of Shakespeare provided by Project Gutenberg, which is in cooperation with World Library, Inc.

Dimensionality Reduction – Does PCA really improve classification outcome

I have come across a couple resources about dimensionality reduction techniques. This topic is definitively one of the most interesting ones, it is great to think that there are algorithms able to reduce the number of features by choosing the most important ones that still represent the entire dataset. One of the advantages pointed out by authors is that these algorithms can improve the results of classification task. In this post, I am going to verify this statement using a Principal Component Analysis ( PCA ) to try to improve the classification performance of a neural network over a dataset. Does PCA really improve classification outcome Let´s check it out.

Why I rarely use apply

In this short post, I talk about why I´m moving away from using function apply.

Seasonal decomposition of short time series

Many users have tried to do a seasonal decomposition with a short time series, and hit the error ‘Series has less than two periods’. The problem is that the usual methods of decomposition (e.g., decompose and stl) estimate seasonality using at least as many degrees of freedom as there are seasonal periods. So you need at least two observations per seasonal period to be able to distinguish seasonality from noise. However, it is possible to use a linear regression model to decompose a time series into trend and seasonal components, and then some smoothness assumptions on the seasonal component allow a decomposition with fewer than two full years of data.

Producing web content with R

Introducing the Kernelheaping Package II

In the first part of Introducing the Kernelheaping Package I showed how to compute and plot kernel density estimates on rounded or interval censored data using the Kernelheaping package. Now, let´s make a big leap forward to the 2-dimensional case. Interval censoring can be generalised to rectangles or alternatively even arbitrary shapes. That may include counties, zip codes, electoral districts or administrative districts. Standard area-level mapping methods such as chloropleth maps suffer from very different area sizes or odd area shapes which can greatly distort the visual impression. The Kernelheaping package provides a way to convert these area-level data to a smooth point estimate. For the German capital city of Berlin, for example, there exists an open data initiative, where data on e.g. demographics is publicly available.

Using Machine Learning for Causal Inference

Machine Learning (ML) is still an underdog in the field of economics. However, it gets more and more recognition in the recent years. One reason for being an underdog is, that in economics and other social sciences one is not only interested in predicting but also in making causal inference. Thus many ‘off-the-shelf’ ML algorithms are solving a fundamentally different problem. We here at STATWORX are also facing a variety of problems e.g. dynamic pricing optimization.

Machine learning in finance: Why, what & how

Machine learning in finance may work magic, even though there is no magic behind it (well, maybe just a little bit). Still, the success of machine learning project depends more on building efficient infrastructure, collecting suitable datasets, and applying the right algorithms. Machine learning is making significant inroads in the financial services industry. Let´s see why financial companies should care, what solutions they can implement with AI and machine learning, and how exactly they can apply this technology.

Practical Apache Spark in 10 minutes. Part 1 – Ubuntu installation

Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics. It has originally been developed at UC Berkeley in 2009, while Databricks was founded later by the creators of Spark in 2013. The Spark engine runs in a variety of environments, from cloud services to Hadoop or Mesos clusters. It is used to perform ETL, interactive queries (SQL), advanced analytics (e.g., machine learning) and streaming over large datasets in a wide range of data stores (e.g., HDFS, Cassandra, HBase, S3). Spark supports a variety of popular development languages including Java, Python, and Scala. In this article, we are going to walk you through the installation process of Spark as well as Hadoop which we will need in the future. So follow the instructions to start working with Spark.

Best Machine Learning Tools: Experts’ Top Picks

The best trained soldiers can´t fulfill their mission empty-handed. Data scientists have their own weapons – machine learning (ML) software. There is already a cornucopia of articles listing reliable machine learning tools with in-depth descriptions of their functionality. Our goal, however, was to get the feedback of industry experts. And that´s why we interviewed data science practitioners – gurus, really – regarding the useful tools they choose for their projects. The specialists we contacted have various fields of expertise and are working in such companies as Facebook and Samsung. Some of them represent AI startups (Objection Co, NEAR.AI, and Respeecher); some teach at universities (Kharkiv National University of Radioelectronics). The AltexSoft data science team joined the discussion, too.
• Python: a popular language with high-quality machine learning and data analysis libraries
• C++: a middle-level language used for parallel computing on CUDA
• R: a language for statistical computing and graphics
• pandas: enhancing data analysis and modeling in Python
• matplotlib: a Python library for quality visualizations
• Jupyter notebook: collaborative work capabilities
• Tableau: powerful data exploration capabilities and interactive visualization
• NumPy: an extension package for scientific computing with Python
• scikit-learn: easy-to-use machine learning framework for numerous industries
• NLTK: Python-based human language data processing platform
• TensorFlow: flexible framework for large-scale machine learning
• TensorBoard: a good tool for model training visualization
• PyTorch: easy to use tool for research
• Keras: lightweight, easy-to-use library for fast prototyping
• Caffe2: deep learning library with mobile deployment support
• Apache Spark: the tool for distributed computing
• MemSQL: a database designed for real-time applications

Smoothing Time Series Data

You are conducting an exploratory analysis of time-series data. To make sure you have the best picture of your data, you’ll want to separate long-trends and seasonal changes from the random fluctuations. In this article, we’ll describe some of the time smoothers commonly used to help you do this. These include both global methods, which involve fitting a regression over the whole time series; and more flexible local methods, where we relax the constraint by a single parametric function.

Data engineering: A quick and simple definition

As the the data space has matured, data engineering has emerged as a separate and related role that works in concert with data scientists. Ian Buss, principal solutions architect at Cloudera, notes that data scientists focus on finding new insights from a data set, while data engineers are concerned with the production readiness of that data and all that comes with it: formats, scaling, resilience, security, and more. Ryan Blue, a senior software engineer at Netflix and a member of the company´s data platform team, says roles on data teams are becoming more specific because certain functions require unique skill sets. ‘For a long time, data scientists included cleaning up the data as part of their work,’ Blue says. ‘Once you try to scale up an organization, the person who is building the algorithm is not the person who should be cleaning the data or building the tools. In a modern big data system, someone needs to understand how to lay that data out for the data scientists to take advantage of it.’

The System Design Primer

Learning how to design scalable systems will help you become a better engineer. System design is a broad topic. There is a vast amount of resources scattered throughout the web on system design principles. This repo is an organized collection of resources to help you learn how to build systems at scale.

Intro to optimization in deep learning: Busting the myth about batch normalization

The Myth we are going to tackle is whether Batch Normalization indeed solves the problem of Internal Covariate Shift. Though Batch normalization been around for a few years and has become a staple in deep architectures, it remains one of the most misunderstood concepts in deep learning. Does Batch Norm really solve internal covariate shift If not, then what does it do Is your entire deep learning education a lie Let’s find out!

In Linear regression statistical modeling we try to analyze and visualize the correlation between 2 numeric variables (Bivariate relation). This relation is often visualize using scatterplot. The aim of understanding this relationship is to predict change

The first step which is involved after data gathering, manipulation is creating your linear model by selecting the 2 numeric variables. For a small dataset and with a little bit of domain knowledge we can find out such critical variables and start our analysis. But at times doing some pre-examination can make our variable selection simpler. This step can be referred to as a selection of prominent variables for our ‘LM’ model. So in this post, we will discuss 2 such very straightforward methods which can visually help us to identify a correlation between all available variables in our data set.

Beginners Ask “How Many Hidden Layers/Neurons to Use in Artificial Neural Networks?”

Beginners in artificial neural networks (ANNs) are likely to ask some questions. Some of these questions include what is the number of hidden layers to use How many hidden neurons in each hidden layer What is the purpose of using hidden layers/neurons Is increasing the number of hidden layers/neurons always gives better results I am pleased to tell we could answer such questions. To be clear, answering such questions might be too complex if the problem being solved is complicated. By the end of this article, you could at least get the idea of how these questions are answered and be able to test yourself based on simple examples.