Sentiment Analysis: Going Beyond Positive and Negative
At AYLIEN, we gathered 11 million+ tweets mentioning ‘Apple’, ‘iPhone’, ‘iOS’, ‘iPad’, ‘Mac’, ‘iPod’, ‘Macbook’, ‘iCloud’, ‘OS X’, ‘iWatch’ and ‘#AppleLive’ from the 4th of September to the 10th of September with a view of analyzing the tweets to gain insight into the voice of Apple Followers.
Ask a Data Scientist: Confounding Variables
This week’s question is from a reader who asks for an explanation of confounding variables and why they’re important in data science projects.
Data science without statistics is possible, even desirable
The purpose of this article is to clarify a few misconceptions about data and statistical science.
1. Data science heavily uses new statistical science
2. Data science uses a bit of old statistical science
3. Data science uses some operations research statistical science
4. Data science does not use most old statistical science
5 basic rules of data organization
The 5 rules, in 1999
1.All data must be numeric
2.Each variable must occupy the same location for each case
3.All codes for all variables must be mutually exclusive
4.Each variable should contain maximum information
5.For each case, there should be a numeric code for every variable
Updated 5 rules, 2014
1.Most data is not numeric, and not formatted (raw text such as user reviews)
2.Fixed-length format is now obsolete: it uses way too much space, when a field has a size that varies between 0 and 100 kilobytes, or could be an image. Besides, using spaces as field separators is a terrible idea, as field values contain spaces themselves.
3.Non exclusive codes provide for a richer data set. Should someone be either Asian or Hispanic? Why not both?
4.Don’t ask or collect too detailed information, because of privacy issues. Better have 3 age groups rather than asking the exact age (people will lie today). Also, any survey question should allow for answers such as “not available”, “other”, “do not want to answer”.
5.Numeric codes is not a good idea, when the number of potential cases is very large (big data), and you may run out of space if you use too few bytes to store them. It also makes clustering of cases more difficult. Use a good tagging system instead, and text-based tagging or coding is good, and will help business analysts work more efficiently.
GraphLab: Getting started with text analytics
So how do data scientists get started with text data? Regardless of the ultimate goal, the first step in text processing is typically feature engineering …
A “did you mean?” Feature for R
Most search engines have a “did you mean?” feature, where suggestions are given in the presence of likely typos. And while search engines use sophisticated NLP methods on their vast amounts of user-generated data to create accurate suggestions, you can get by with some ancient spellchecker techniques. So a little while ago, I did just that with the Rdym package for R.
Spatial visualization with R – Tutorial
The visualization of spatial data is one of the most popular applications when using R. This tutorial is an introduction to the visualization of spatial data and map making in R.
LDA on Ferguson Grand Jury I
I spent few hours learning about LDA–Latent Dirichlet Allocation from a package called Mallet. The Mallet machine learning package provides an interface to the Java implementation of latent Dirichlet allocation. To process a text file into mallet though a stopping list of words is required in the same path. Once again, I’m benefitted because there are quite a few of such a list over the internet, typically containing unimportant words and tag marks that can instruct the algorithm to skip them.
Tim Bowles on multivariate stats with vegan
Tim Bowles gave this presentation on the vegan package to the Davis R Users’ Group. The screencast and slides are below. You can also download Tim’s RStudio project with all the code, data, figures, and slides presented here. Thanks to Tim for a great session!
Predictive modelling fun with the caret package
I recently read through the excellent Machine Learning with R ebook and was impressed by the caret package and how easy it made it seem to do predictive modelling that was a little more than just the basics.
Identifying Position Change Groupings in Rank Ordered Lists
The title says it all, doesn’t it?! Take the following example – it happens to show race positions by driver for each lap of a particular F1 grand prix, but it could be the evolution over time of any rank-based population.
Five Big Data Trends for 2015
1. A Connected Future: The Internet of Things Taking Off
2. A Shift Towards Data-Driven Cultures
3. Owning Up to Your Own Identity – Claiming Your Personal Data
4. Big Data Security Analytics Gaining Traction
5. Time to Experiment with Data Lakes