“1. The nature of statistics
Statistics is the original computing with data. It is the field that deals with data with the most portability (it isn’t dependent on one type of physical model) and rigor. Statistics can be a pessimal field: statisticians are the masters of anticipating what can go wrong with experiments and what fallacies can be drawn from naive uses of data. Statistics has enough techniques to solve just about any problem, but it also has an inherent conservatism to it.
I often say the best source of good statistical work is bad experiments. If all experiments were well conducted, we wouldn’t need a lot of statistics. However, we live in the real world; most experiments have significant shortcomings and statistics is incredibly valuable.
Another aspect of statistics is it is the only field that really emphasizes the risks of small data. There are many other potential data problems statistics describes well (like Simpson’s paradox), but statistics is fairly unique in the information sciences in emphasizing the risks of trying to reason from small datasets. This is actually very important: datasets that are expensive to produce (such as drug trials) are necessarily small.
It is only recently that minimally curated big datasets became perceived as being inherently valuable (the earlier attitude being closer to GIGO). And in some cases big data is promoted as valuable only because it is the cheapest to produce. Often a big dataset (such as logs of all clicks seen on a search engine) is useful largely because they are a good proxy for a smaller dataset that is too expensive to actually produce (such as interviewing a good cross section of search engine users as to their actual intent).
If your business is directly producing truly valuable data (not just producing useful proxy data) you likely have small data issues. If you have any hint of a small data issue, you want to consult with a good statistician.
2. The nature of machine learning
In some sense machine learning rushes where statisticians fear to tread. Machine learning does have some concept of small data issues (such as knowing about over-fitting), but it is an essentially optimistic field.
The goal of machine learning is to create a predictive model that is indistinguishable from a correct model. This is an operational attitude that tends to offend statisticians who want a model that not only appears to be accurate but is in fact correct (i.e. also has some explanatory value).
My opinion is the best machine learning work is an attempt to re-phrase prediction as an optimization problem (see for example: Bennett, K. P., & Parrado-Hernandez, E. (2006). The Interplay of Optimization and Machine Learning Research. Journal of Machine Learning Research, 7, 1265/1281). Good machine learning papers use good optimization techniques and bad machine learning papers (most of them in fact) use bad out of date ad-hoc optimization techniques.
3. The nature of data mining
Data mining is a term that was quite hyped and now somewhat derided. One of the reasons more people use the term “data science” nowadays is they are loath to say “data mining” (though in my opinion the two activities have different goals).
The goal of data mining is to find relations in data, not to necessarily make predictions or come up with explanations. Data mining is often what I call “an x’s only enterprise” (meaning you have many driver or “independent” variables but no pre-ordained outcome or “dependent” variables) and some of the typical goals are clustering, outlier detection and characterization.
There is a sense that when it was called exploratory statistics it was considered boring, but when it was called data mining it was considered sexy. Actual exploratory statistics (as defined by Tukey) is exciting and always an important “get your hands into the data” step of any predictive analytics project.
4. The nature of informatics
Informatics and in particular bioinformatics are very hot terms. A lot of good data scientists (a term I will explain later) come from the bioinformatics field.
Once we separate out the portions of bioinformatics that are in fact statistics and the ones that are in fact biology we are left with data infrastructure and matching algorithms. We have the creation and management of data stores, data bases and design of efficient matching and query algorithms. This isn’t meant to be a left handed compliment: algorithms are a first love of mine and some of the matching algorithms bioinformaticians uses (like online suffix trees) are quite brilliant.
5. The nature of big data
Big data is a white-hot topic. The thing to remember is: it is just the infrastructure (MapReduce, Hadoop, noSQL and so on). It is the platform you perform modeling (or usually just report generation) on top of.
6. The nature of predictive analytics
The Wikipedia defines Predictive analytics as the variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events. It is a set of goals and techniques emphasizing making models. It is very close to what is also meant by data science.
I don’t tend to use the term predictive analytics because I come from a probability, simulation, algorithms and machine learning background and not from an analytics background. To my ear analytics is more associated with visualization, reporting and summarization than with modeling. I also try to use the term modeling over prediction (when I remember) as prediction often in non-technical English implies something like forecasting into the future (which is but one modeling task).
7. The nature of data science
The Wikipedia defines data science as a field that incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.
Data science is a term I use to represent the ownership and management of the entire modeling process: discovering the true business need, collecting data, managing data, building models and deploying models into production.
8. Conclusion
Machine learning and statistics may be the stars, but data science the whole show.”
John Mount