Statistics@Springer

Infographic with Links

GitXiv: Text Understanding from Scratch

Text classification from character-level using convolutional networks

Get Knowledge from Best Ever Data Science Discussions on Reddit

But there is a ton of useful knowledge on Reddit, which you can learn. People who are active on Reddit would agree to it. Hence, in this article, we have summarized some of the best discussions related to Machine Learning, Deep Learning, Neural Networks, Artificial Intelligence, Python, R Programming, Big Data and Statistics. I hope that you benefit out of it, if you don’t follow Reddit religiously or can not fit in the community.

Book Review: Practical Text Analytics

Text analytics recently has become a useful adjunct to other machine learning methods, and is of great interest to both data scientists and big data engineers alike. With “Practical Text Analytics: Interpreting text and unstructured data for business intelligence,” Dr. Steven Struhl provides a timely and lucid discussion of the topic, highlighting the fundamental issues involved in preparing, analyzing, and presenting textual data for meaningful interpretations. The book is a very relevant and timely contribution to the field that should be of interest to a wide range of audiences.

Getting Started with Spark DataFrames

Spark is a really awesome tool to easily do distributed computations in order to process large-scale data. To be honest, most people probably don’t need spark for their own side projects – most of these data will fit in memory or work well in a traditional database like PostgreSQL. That being said, there is a good chance you might need Spark if you are doing data science type work for your job. A lot of companies have a tremendous amount of data and Spark is a great tool to help effectively process these large data. In case you are not familiar with the map reduce structure, here is a very brief introduction. Spark is based on this map reduce paradigm, but has made some nice improvements to the open source version Hadoop. A few of these improvements include the ability to cache data to memory, a simpler API supported in multiple languages (scala, python, and java I believe), and some really nice libraries – including a machine learning and SQL library. In my opinion, these additions really make Spark a powerful tool with a realtively easy learning curve. My goal today is to show you how to get started with Spark and get introduced to Spark data frames.

Tufte in R

The idea behind Tufte in R is to use R – the most powerful open-source statistical programming language – to replicate excellent visualisation practices developed by Edward Tufte. It’s not a novel approach – there are plenty of excellent R functions and related packages wrote by people who have much more expertise in programming than myself. I simply collect those resources in one place in an accessible and replicable format, adding a few bits of my own coding discoveries.

What does 90% accuracy mean?

If 100,000 people are given the test and 13 have early-stage pancreatic cancer, about 10 or 11 of the 13 cases will have positive tests, but so will 11,000 healthy people. Of those who test positive, 99.9% will not have pancreatic cancer. This might still be useful, but it’s not what most people would think of as 90% accuracy.

Microsoft Launches Its First Free Online R Course on edX

Today, Microsoft and DataCamp launched an exciting new course on edX.org covering the basics of the statistical programming language R. This four week course is free for everyone, and no prior knowledge in programming or data science is required.

A Short Introduction to Bioconductor

One of the keys to R’s success as a software environment for data analysis is the availability of user-contributed packages. Most useRs will be familiar with (and very grateful for) the Comprehensive R Archive Network (CRAN). The packages available on CRAN, nearly 7000 at last count, cover common data analysis tasks, such as importing data and plotting, through to more specialised tasks, such as packages for parsing data from the web, analysing financial time series data, or analysing data from clinical trials. What may be less familiar to useRs is another large R package repository and software development project, Bioconductor.

Achieve Charting Zen With TauCharts

There was some chatter on the twitters this week about a relatively new D3-based charting library called TauCharts (also @taucharts). The API looked pretty clean and robust, so I started working on an htmlwidget for it and was quickly joined by the Widget Master himself, @timelyportfolio. TauCharts definitely has a “grammar of graphics” feel about it and the default aesthetics are super-nifty While the developers are actively adding new features and “geoms”, the core points (think scatterplot), lines and bars (including horizontal bars!) geoms are quite robust and definitely ready for your dashboards. Between the two of us, we have a substantial part of the charting library API covered. I think the only major thing left unimplemented is composite charts (i.e. lines + bars + points on the same chart) and some minor tweaks around the edges.

Patterns for Streaming Realtime Analytics

We did a tutorial on DEBS 2015 (9th ACM International Conference on Distributed Event-Based Systems), describing a set of realtime analytics patterns. Following is the summary of the tutorial.