The Stanford Natural Language Processing Group – Software
The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs. These packages are widely used in industry, academia, and government.
7 Skills/Attitudes to Become a Better Data Scientist
1. Cultivate ‘intellectual curiosity’
2. Solid understanding about the business
3. Have good communication skills
4. Know more than one programming language for data analysis
5. Know SQL and relational database
6. Participate in competitions
7. Stay up to date reading books, blogs, news, journals, MOOCS and listening to podcasts
TrendVis is a plotting package that uses matplotlib to create information-dense, sparkline-like, quantitative visualizations of multiple disparate data sets in a common plot area against a common variable. This plot type is particularly well-suited for time-series data. We discuss the rationale behind and the challenges associated with adapting matplotlib to this particular plot style, the TrendVis API and architecture, and various features available for users to customize and enhance the readability of their figures while walking through a sample workflow.
Generation of handwritten Text based on Recurrent Neural Networks
Type a message into the text box, and the network will try to write it out longhand (this paper explains how it works). Be patient, it can take a while!
CrowdFlower Winners’ Interview: 3rd place, Team Quartet
The goal of the CrowdFlower Search Results Relevance competition was to come up with a machine learning algorithm that can automatically evaluate the quality of the search engine of an e-commerce site. Given a query (e.g. ‘tennis shoes’) and an ensuing result (‘adidas running shoes’), the goal is to score the result on relevance, from 1 (least relevant) to 4 (most relevant). To train the algorithm, teams had access to a set of 10,000 queries/result pairs that were manually labeled by CrowdFlower using the aforementioned classes.
Count data: To Log or Not To Log
Count data are widely collected in ecology, for example when one count the number of birds or the number of flowers. These data follow naturally a Poisson or negative binomial distribution and are therefore sometime tricky to fit with standard LMs. A traditional approach has been to log-transform such data and then fit LMs to the transformed data. Recently a paper advocated against the use of such transformation since it led to high bias in the estimated coefficients. More recently another paper argued that log-transformation of count data followed by LM led to lower type I error rate (ie saying that an effect is significant when it is not) than GLMs. What should we do then?
Doodling With 3d Animated Charts in R
Doodling with some Gapminder data on child mortality and GDP per capita in PPP$, I wondered whether a 3d plot of the data over the time would show different trajectories over time for different countries, perhaps showing different development pathways over time. Here are a couple of quick sketches, generated using R (this is the first time I’ve tried to play with 3d plots…)