Course Announcement: 36-401, Modern Regression, Fall 2015

This course is the first one in our undergraduate sequence where the students have to bring together probability, statistical theory, and analysis of actual data. I have mixed feelings about doing this through linear models. On the one hand, my experience of applied problems is that there are really very few situations where the ‘usual’ linear model assumptions can be maintained in good conscience. On the other hand, I suspect it is usually easier to teach people the more general ideas if they’ve thoroughly learned a concrete special case first; and, perhaps more importantly, whatever the merits of (e.g.) Box-Cox transformations might actually be, it’s the sort of thing people will expect statistics majors to know…

Free Resources for Beginners on Deep Learning and Neural Network

Machines have already started their march towards artificial intelligence. Deep Learning and Neural Networks are probably the hottest topics in machine learning research today. Companies like Google, Facebook and Baidu are heavily investing into this field of research. Researchers believe that machine learning will highly influence human life in near future. Human tasks will be automated using robots with negligible margin of error. I’m sure many of us would never have imagined such gigantic power of machine learning. To ignite your desire, I’ve listed the best tutorials on Deep Learning and Neural Networks available on internet today. I’m sure this would be of your help! Take your first step today.

Topic Modeling of Twitter Followers

In this post, we explore LDA an unsupervised topic modeling method in the context of twitter timelines. Given a twitter account, is it possible to find out what subjects its followers are tweeting about? Knowing the evolution or the segmentation of an account’s followers can give actionable insights to a marketing department into near real time concerns of existing or potential customers. Carrying topic analysis of followers of politicians can produce a complementary view of opinion polls. The goal of this post is to explore my own followers, 698 at time of writing and find out what they are tweeting about through Topic Modeling of their timelines.

Introducing agate: a Better Data Analysis Library for Journalists

In greater depth, agate is a Python data analysis library in the vein of numpy or pandas, but with one crucial difference. Whereas those libraries optimize for the needs of scientists—namely, being incredibly fast when working with vast numerical datasets—agate instead optimizes for the performance of the human who is using it. That means stripping out those technical optimizations and instead focusing on designing code that is easy to learn, readable, and flexible enough to handle any weird data you throw at it.

Environmental Monitoring using Big Data

In this post, I will cover in-depth a Big Data use case: monitoring and forecasting air pollution. A typical Big Data use case in the modern Enterprise includes the collection and storage of sensor data, executing data analytics at scale, generating forecasts, creating visualization portals, and automatically raising alerts in the case of abnormal deviations or threshold breaches. This article will focus on an implemented use case: monitoring and analyzing air quality sensor data using Axibase Time-Series Database and R Language.

Applications of Chi-Square Tests

This morning, in our mathematical statistical class, we’ve seen the use of the chi-square test. The first one was related to some goodness of fit of a multinomial distribution.

USFD2: Annotating Temporal Expresions and TLINKs for TempEval-2

We describe the University of Sheffield system used in the TempEval-2 challenge, USFD2. The challenge requires the automatic identification of temporal entities and relations in text. USFD2 identifies and anchors temporal expressions, and also attempts two of the four temporal relation assignment tasks. A rule-based system picks out and anchors temporal expressions, and a maximum entropy classifier assigns temporal link labels, based on features that include descriptions of associated temporal signal words. USFD2 identified temporal expressions successfully, and correctly classified their type in 90% of cases. Determining the relation between an event and time expression in the same sentence was performed at 63% accuracy, the second highest score in this part of the challenge.

Understanding the Bayesian approach to false discovery rates (using baseball statistics)

In my last few posts, I’ve been exploring how to perform estimation of batting averages, as a way to demonstrate empirical Bayesian methods. We’ve been able to construct both point estimates and credible intervals based on each player’s batting performance, while taking into account that some we have more information about some players than others. But sometimes, rather than estimating a value, we’re looking to answer a yes or no question about each hypothesis, and thus classify them into two groups. For example, suppose we were constructing a Hall of Fame, where we wanted to include all players that have a batting probability (chance of getting a hit) greater than .300. We want to include as many players as we can, but we need to be sure that each belongs. In the case of baseball, this is just for illustration- in real life, there are a lot of other, better metrics to judge a player by! But the problem of hypothesis testing appears whenever we’re trying to identify candidates for future study. We need a principled approach to decide which players are worth including, that also handles multiple testing problems. (Are we sure that any players actually have a batting probability above .300? Or did a few players just get lucky?) To solve this, we’re going to apply a Bayesian approach to a method usually associated with frequentist statistics, namely false discovery rate control. This approach is very useful outside of baseball, and even outside of beta/binomial problems. We could be asking which genes in an organism are related to a disease, which answers to a survey have changed over time, or which counties have an unusually high incidence of a disease. Knowing how to work with posterior predictions for many individuals, and come up with a set of candidates for further study, is an essential skill in data science.