One of the very first learning algorithms that you’ll encounter when studying data science and machine learning is least squares linear regression. Linear regression is one of the easiest learning algorithms to understand; it’s suitable for a wide array of problems, and is already implemented in many programming languages. Most users are familiar with the lm() function in R, which allows us to perform linear regression quickly and easily. But one drawback to the lm() function is that it takes care of the computations to obtain parameter estimates (and many diagnostic statistics, as well) on its own, leaving the user out of the equation. So, how can we obtain these parameter estimates from scratch? In this post, I will outline the process from first principles in R. I will use only matrices, vectors, and matrix operations to obtain parameter estimates using the closed-form linear algebraic solution. After reading this post, you’ll see that it’s actually quite simple, and you’ll be able to replicate the process with your own data sets (though using the lm() function is of course much more efficient and robust).
word2vec is an algorithm that transforms words into vectors, so that words with similar meaning end up laying close to each other. Moreover, it allows us to use vector arithmetics to work with analogies, for example the famous king – man + woman = queen. I will try to explain how it works, with special emphasis on the meaning of vector differences, at the same time omitting as many technicalities as possible.
A goodness of fit test are a collection of statistical techniques used to summarize the differences between observed values and expected values. These tests can be used be used for hypothesis testing, testing residual normality, distribution comparisons, and outcome frequency distribution comparisons. It can be difficult to know which statistical test to select given a problem domain and often the results won’t point you to any mistakes if the wrong assumptions are made. In this post, I have detailed three common goodness of fit tests, the R code to perform the tests, interpretation of the results while using beer (Statistics + Analytics) in the examples.
In two weeks, the third year of the Data Science for Actuaries program will start. I will be there for the introduction to R. The slides are available online (created with slidify)
Data visualization is a key part of any data science workflow, but it is frequently treated as an afterthought or an inconvenient extra step in reporting the results of an analysis. Taking such a stance is a mistake — as the cliché goes, a picture is worth a thousand words. Data visualization should really be part of your workflow from the very beginning, as there is a lot of value and insight to be gained from just looking at your data. Summary statistics often don’t tell the whole story; Anscombe’s quartet is an unforgettable demonstration of this principle. Furthermore, the impact of an effective visualization is difficult to match with words and will go a long way toward ensuring that your work gets the recognition it deserves.
JSON has become part of the SQL Server in the same version as R. Both were very highly anticipated and awaited from the community. JSON has very powerful statements for converting to and from JSON for storing into / from SQL Server engine (FOR JSON and JSON VALUE, etc). And since it is gaining popularity for data exchange, I was curious to give it a try with R combination. I will simply convert a system table into array using for json clause.
When query execution performance analysis is to be done, there are many ways to find which queries might cause any unwanted load or cause stall on the server. By encouraging DBA community to start practicing the advantage or R Language and world of data science, I have created a demo to show, how statistics on numerous queries can be stored for later analysis. And this demo has unsupervised (or undirected) method for grouping similar query statistics (or queries) to easier track and find where and which queries might be potential system stoppers. Before we run some queries to generate the running statistics for these queries, we clean the cache and prepare the table for storing statistics from sys.dm_exec_query_stats.
In this article we will discuss the machine learning method called “decision trees”, moving quickly over the usual “how decision trees work” and spending time on “why decision trees work.” We will write from a computational learning theory perspective, and hope this helps make both decision trees and computational learning theory more comprehensible. The goal of this article is to set up terminology so we can state in one or two sentences why decision trees tend to work well in practice.
The last month, yep, the past year I’ve working on create maps easily with highcharter?.