Data Scientists: Making Sense of Big Data
Currently, the data scientist has one key priority: explain the applications and benefits of Big Data to the enterprise. This article contains insights from an HR consultant, an analyst, and a researcher on the challenge of grasping the fundamentals of Big Data and communicating them to the organization. – See more at: http://www.news-sap.com/data-scientists-making-sense-big-data/#sthash.fGrP3VFN.dpuf
Tableau 9.0 Connects Directly to R Data Files
Tableau 9.0 will be released soon. Tableau 8 already integrates with some R functionality, but 9.0 actually allows direct connection to R data files. Tableau continues to remove friction between itself and R, further justifying its superior Gartner position.
Data Integrity – A Sequence of Words Lost in the World of Big Data
The subject of this blog might seem rather rudimentary for those who fully understand the importance of properly managing data. For those people, hopefully, you will find the post worth reading and provide constructive feedback and augment the discussion. The purpose of this post is to highlight the necessity to keep data clean and orderly so that the results of the analysis are reliable and trustworthy – if data integrity is intact, information derived from this data will be trustworthy resulting in actionable information.
LDA automatically assigns topics to text documents. How is it done? Which are its limitations? What is the best open-source library to use in your code?
In this post we’re going to describe how topics can be automatically assigned to text documents; this process is named, unsurprisingly, topic-modelling. It works like fuzzy (or soft) clustering since it’s not a strict categorisation of the document to its dominant topic.
A Word is Worth a Thousand Vectors
Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to recommend the maternity line when she says she’s in her ‘third trimester’, identify a medical professional when she writes that she ‘used to wear scrubs to work’, and distill ‘taking a trip’ into a Fix for vacation clothing. Word vectors (also referred to as distributed representations) are an amazing alternative that sweep away most of the issues of dealing with NLP. They let us ignore the difficult-to-understand grammar & syntax of language while retaining the ability to ask and answer simple questions about a text. The goal of this post is to be a motivating introduction to word vectors and demonstrate their real-world utility.
R/dplyr: Extracting Data Frame Column Value for Filtering With %in%
I’ve been playing around with dplyr over the weekend and wanted to extract the values from a data frame column to use in a later filtering step. …
Machine Learning Table of Elements Decoded
Machine learning packages for Python, Java, Big Data, Lua/JS/Clojure, Scala, C/C++, CV/NLP, and R/Julia are represented using a cute but ill-fitting metaphor of a periodic table. We extract the useful links.
NYC Motor Vehicle Collisions – Street-Level Heat Map
In this post I will extend a previous analysis creating a borough-level heat map of NYC motor vehicle collisions. The data is from NYC Open Data. In particular, I will go from borough-level to street-level collisions. The processing of the code is very similar to the previous analysis, with a few more functions that map streets to colors. Below, I load the ggmap package, and the data, and only keep collisions with longitude and latitude information.