6 Practices to enhance the performance of a Text Classification Model

1. Domain Specific Features in the Corpus
2. Use An Exhaustive Stopword List
3. Noise Free Corpus
4. Eliminating features with extremely low frequency
5. Normalized Corpus
6. Use Complex Features: n-grams and part of speech tags


Stream Processing and Streaming Analytics – How It Works

Understanding the basic operating characteristics of Stream Processing and In-Stream Analytics will help clarify whether it’s useful in your situation.


How to Compare Distribution by Using Density Plots in R

Similar to histogram, the density plots are used to show the distribution of data. Additionally, density plots are especially useful for comparison of distributions. For example I often compare the levels of different risk factors (i.e. cholesterol levels, glucose, body mass index) among individuals with and without cardiovascular disease. Also, with density plots we can illustrate how the distribution of a particular variable change over time.


A Neural Network in 11 lines of Python (Part 1)

I learn best with toy code that I can play with. This tutorial teaches backpropagation via a very simple toy example, a short python implementation.


R and Impala: it’s better to KISS than using Java

Interacting with Impala from R is pretty straightforward: just install and load the RImpala package, which uses the JDBC driver to communicate with Impala. It does the job very well for fetching aggregated data form the database, but gets extremely slow when loading more than a thousand or so row — that you cannot resolve buy throwing more hardware on the problem.


Instrumental Variables

We all ‘know’ that correlation does not imply causation, that unmeasured and unknown factors can confound a seemingly obvious inference. But, who has not been tempted by the seductive quality of strong correlations?