Named Entity Recognition for Twitter Microposts using Distributed Word Representations

Simple feedforward neural network which has as input a word2vec model trained on 400 million tweets for Named Entity Recognition (10 entity types), as part of the ACL W-NUT 2015 NER shared task paper.

Essentials of Machine Learning Algorithms (with Python and R Codes)

We are probably living in the most defining period of human history. The period when computing moved from large mainframes to PCs to cloud. But what makes it defining is not what has happened, but what is coming our way in years to come. What makes this period exciting for some one like me is the democratization of the tools and techniques, which followed the boost in computing. Today, as a data scientist, I can build data crunching machines with complex algorithms for a few dollors per hour. But, reaching here wasn’t easy! I had my dark days and nights.

Data Science for Competitive Intelligence

Competitive Intelligence (CI) is the process of collecting, aggregating and analyzing external data for the benefit of a company. A good introduction to the subject can be found in Competitive Intelligence Advantage: How to Minimize Risk, Avoid Surprises, and Grow Your Business in a Changing World. I particularly appreciate the use cases showing the distinction between competitive intelligence and competitor analysis. The below image provides a good view of CI.

Tutorial: How to establish quality and correctness of classification models? Part 5 – Lift curve

In this last part of the tutorial we will discuss the LIFT curve.

38 Seminal Articles Every Data Scientist Should Read

Here is selection containing both external and internal papers, focusing on various technical aspects of data science and big data. Feel free to add your favorites.

Semantic Data Modeling For Fun and Profit

… This is where it’s useful to start pulling out a bit of RDF for modeling. Turtle is a very terse notation for expressing triples assertions, and its frequently a useful way to test a model before getting deep into the logical guts of that model. For instance, I will typically identify a generic Entity class that contains useful descriptor properties), then will build up from there. …

Analyzing Flight Data: A Gentle Introduction to GraphX in Spark

Graphs are probably one of the coolest representations of information. And they’re not just your typical line chart (although those to do qualify in a sense), these are networks of inter-connected vertices. Now what’s so cool about networks, is that they represent a lot of both tangible and abstract things. Take a social network (this is a common example of a graph). Each person in a social network is connected by some sort of relationship be it a friendship or marriage. That’s a pretty simple one, now let’s think a bit more abstractly.

Frequentism and Bayesianism V: Model Selection

Here I am going to dive into an important topic that I’ve not yet covered: model selection. We will take a look at this from both a frequentist and Bayesian standpoint, and along the way gain some more insight into the fundamental philosophical divide between frequentist and Bayesian methods, and the practical consequences of this divide.

The neural networks behind Google Voice transcription

Over the past several years, deep learning has shown remarkable success on some of the world’s most difficult computer science challenges, from image classification and captioning to translation to model visualization techniques. Recently we announced improvements to Google Voice transcription using Long Short-term Memory Recurrent Neural Networks (LSTM RNNs)—yet another place neural networks are improving useful services. We thought we’d give a little more detail on how we did this.

Statistics – Understanding Basic Concepts and Dispersion

In the last post titled, Statistics – Understanding the Levels of Measurement, we have seen what variables are, and how do we measure them based on the different levels of measurement. In this post, we will talk about some of the basic concepts that are important to get started with statistics and then dive deep into the concept of dispersion.

R: The Keep Function

Occasionally when I am jotting some code I find myself creating several temporary variables with the intention of later getting rid of them. These variables involve quick names that are defined in a local scope and get quite confusing out of their context. If the project expands, however, I find myself with three options for playing with the next level of the project: remember what I’ve already used and try to avoid it, rename all of the temporary variables (e.g. rewrite the code of the base level of the project), or wipe the variables for use later. This decision is usually made based on how the project is going. If the project is going well, I’ll go back and dutifully rewrite the initial code to track variables in a more unique way. If the project is still a bit shaky, I will clear the variable names that I tend to use as temporary variables and keep exploring. Remembering variable names never turns out well for me; I inevitably forget that a variable was defined in a previous section, use it thinking I had redefined it (when I didn’t), and wonder at the strange results I get.

Turning your R (or Python) models into APIs

More and more real-world systems are relying on data science and analytical models to deliver sophisticated functionality or improved user experiences. For example, Microsoft combined the power of advanced predictive models and web services to develop the real-time voice translation feature in Skype. Facebook and Google continuously improve their deep learning models for better face recognition features in their photo service. Some have characterised this trend as a shift from Software-as-a-Service (SaaS) to an era of Models-as-a-Service (MaaS). These models are often written in statistical programming languages (e.g., R, Python), which are especially well suited to analytical tasks. With analytical models playing an increasingly important role in real-world systems, and with more models being developed in R and Python, we need powerful ways of turning these models into APIs for others to consume.

How do you know if Your Data has Signal?

An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and “let the algorithm sort it out.” This is tempting when you are not sure what are the true causes or predictors of the phenomenon you are interested in, but it presents dangers, too. Very wide data sets are computationally difficult for some modeling procedures; and more importantly, they can lead to overfit models that generalize poorly on new data. In extreme cases, wide data can fool modeling procedures into finding models that look good on training data, even when that data has no signal. We showed some examples of this previously in our “Bad Bayes” blog post. In this latest “Statistics as it should be” article, we will look at a heuristic to help determine which of your input variables have signal.