The blogs in the category “ArXiv Papers” titled “Whats new on arXiv” are based on (daily) publications of the arXiv server.

Because there are a lot of different categories for the papers it makes sense to filter on specific categories relevant for our context.

The current categories of papers considered for the “ArXiv Papers” blog category are:

stat.AP

Statistics

Applications

stat.ML

Statistics

Machine Learning

stat.ME

Statistics

Methodology

stat.TH

Statistics

Theory

stat.CO

Statistics

Computation

cs.AI

Computer Science

Artificial Intelligence

cs.CL

Computer Science

Computation and Language

cs.CC

Computer Science

Computational Complexity

cs.GT

Computer Science

Computer Science and Game Theory

cs.CV

Computer Science

Computer Vision and Pattern Recognition

cs.DS

Computer Science

Data Structures and Algorithms

cs.DC

Computer Science

Distributed; Parallel; and Cluster Computing

cs.IR

Computer Science

Information Retrieval

cs.IT

Computer Science

Information Theory

cs.LG

Computer Science

Learning

cs.MA

Computer Science

Multiagent Systems

cs.NE

Computer Science

Neural and Evolutionary Computing

cs.NA

Computer Science

Numerical Analysis

nlin.AO

Nonlinear Sciences

Adaptation and Self-Organizing Systems

nlin.CG

Nonlinear Sciences

Cellular Automata and Lattice Gases

math.CO

Mathematics

Combinatorics

math.OC

Mathematics

Optimization and Control

math.PR

Mathematics

Probability

math.ST

Mathematics

Statistics

cond-mat.dis-nn

Physics

Disordered Systems and Neural Networks

physics.data-an

Physics

Data Analysis; Statistics and Probability

I am going through the resulting list of papers and choose the ones which are of general interest for a broader audience. For those papers the abstract is added, the rest of papers is just listed by title.

What is the definition of Data Science? So the first thing I started to do is read as many posts on the topic as I could get my hands on and also lookup definitions of related topics such as Data Mining and Machine Learning. Looking at the discussions and posts around Data Science it seems to span everything needed to understand data, to derive something out of data and communicate the finding. This does not really help to answer the original question, so let´s take a closer look.

Data Science Venn Diagrams

There have been discussions whether a Data Scientist is somebody who knows about everything needed to analyze data or if this task has to be done by teams of specialists. This is found in the Unicorn-Discussionor the “Venn-Diagram-Thread. Let´s have a look at some Venn Diagrams:

The first 2 of them may appear similar at first glance, but actually they make totally different statements. In the first one (Data Science Venn Diagram 2.0) for example, mathematics and statistics are a real subset of Data Science, whereas in the second one Data Science is a real subset of Mathematics, which is a completely different statement. Another question from the first one is, whether Data Science really covers all of computer science, mathematics and statistics etc. This does not sound reasonable. The third diagram tries to explain what exactly is meant by mathematics and statistics. It lists specific tasks, properties, questions etc. to specify the areas in the diagram. The most interesting question, what is Data Science, is only partially answered by listing some aspects which are assigned to it. A lot of questions cannot be answered. For example, to which region or set would one assign an algorithm found to predict earthquakes? Is it mathematics, IT skills, business skills, or traditional research?

This could be a starting point to find a comprehensive Venn Diagram and you could start collecting areas to add to the Venn Diagram and to analyze their set relation (e.g. Machine Learning Í Data Science or $ T Î Data Science \ Machine Learning). A list of areas from which you could learn something to analyze your data could be very long. This is because a good model for predicting earthquakes could also be useful to predict burglaries. There is also the idea of learning on a character basis rather than on a word basis in “Text Understanding from Scratch” inspired by image analysis based on pixels. What about additional techniques to clean and maintain data such as ETL? Where are they to be assigned in a Venn diagram? With thoughts like these, we end up with a large list of topics:

{Actuarial Science, Advanced Analytics; Algebraic Statistics; Analytics; Anticipatory Analytics; Applied Mathematics; Artificial General Intelligence; Artificial Intelligence; Aspect-Oriented Software Development; Big Data; Big Data Analytics; Big Data Management; Business Analytics; Business Intelligence; Cognitive Analytics; Competitive Intelligence; Computational Intelligence; Computational Linguistics; Computer Science; Concept Mining; Condition Monitoring; Data Acquisition; Data Aggregation; Data Analysis; Data Analytics; Data Archaeology; Data Impartment; Data Journalism; Data Driven Journalism; Data Mining; Data Pattern Processing; Data Profiling; Data Science; Data Stream Mining; Data Visualization; Data-to-Decisions; Decision Support System; Decision Theory; Descriptive Statistics; Econometrics; Exploratory Data Analysis; Game Theory; Hybrid Intelligent System; Inferential Statistics; Information Extraction; Information Harvesting; Information Retrieval; Information Visualization; Knowledge Discovery; Knowledge Extraction; Knowledge Management; Knowledge Worker; Machine Learning; Mathematical Statistics; Mathematics; Multivariate Statistics; Neuro-Informatics; Operational Analytics; Operational Intelligence; Operations Research; Optimization; Predictive Analytics; Predictive Analysis;
Prescriptive Analytics; Probability Theory; Reporting; Semantic Analysis; Semantic Analytics; Soft Computing; Statistical Decision Theory; Statistical Learning; Statistical Relational Learning; Statistical Theory; Statistics; Stochastic; Visual Analytics, … }

If you would like to see the current, but daily growing, detailed list of terms I collect you can have a look at it on http://advanceddataanalytics.net/what-is/ with a size of currently |What is …|>1600.

This should tell us that the Venn-Diagram-Approach is not really the way Data Science should be defined and these diagrams are not suitable for defining Data Science. Even the attempt to find an exact definition for it might not be reasonable. An example for this could be Artificial Intelligence.

Data Science Aspects

Looking at the discussions and posts in the Data Science community we could use a statement like “Everything needed to derive something out of your data, with whatever tool, algorithm, technique, method or programming language necessary or appropriate to achieve this.” as a starting point to outline Data Science and to try to answer the question, how do I become a Data Scientist?

What is out of the question is, that the theoretical language of data analysis is mathematics and the practical language is, at least for another few decades, programming in whatever dialect (programming language) and the execution platform is computers and the subject of analysis is data in whatever form and wherever and however it might be produced or coming from. This is independent of whether you might assign the processing of a matrix to mathematics, programming or statistics and, with this, we should be able to avoid the problems of the Venn diagram approach. It depends on your point of view. The theoretical or abstract description of problems and their solution is mathematics, the practical side, for example the usage of BLAS, is a computer program and the statistical aspect of it could be the algorithm which led you to the processing of the matrix and the subject matter expertise is the meaning of the algorithm and what the entries of the matrix stands for. So the question now reads, what are the aspects of Data Science? Let´s summarize the above thoughts in a table:

Aspect

Description

Keywords

Mathematics

The Lingua Franca of the related science

(Linear) algebra, probability theory, statistics, analysis
etc.

Algorithms & Abstract Knowledge

Abstract derivations, results and algorithms usually
formulated in the language of mathematics.

Neural networks, SVM, distribution, mean, stochastic process,
probabilistic graphical model, boosting, data augmentation & training set
strategies etc.

Data & Knowledge Representation

The data and the form the knowledge is given or expected
by algorithms.

Matrix, table, graph, time series, CSV etc.

Programming & Tools

Programming languages & tools to use to derive
findings from data and to implement and deploy the abstract concepts.

R, Python, Julia, C++, Java, SQL etc.

RStudio, GraphLab, DataBase, IMSL, BLAS etc.

Visualization & Conveyance

How to display findings and convey the results of analysis
to the target group for which the results are of interest and to document the
process and the results.

Diagrams, charts, infographics, reporting, markdown, d3,
reproducibility, simulation, animation etc.

Subject Area(s)

The specific business you are trying to generate findings
for / of.

CRM, retail, supply chain, logistics, maintenance, social
media etc.

(Topics like “Data Quality”, “Performance”, “Efficiency”, “Product Standards”, “Reproducibility” I would treat as the “Quality” of how you take care about the listed aspects / concepts.)

If we put these aspects together graphically, it could look like this “Data Analytics Aspect Diagram”:

The difference to the above Venn diagrams is that we do not have sets representing areas, but icons representing aspects. One important aspect when we want to answer the question, how do I become a Data Scientist?, is the person who wants to become a Data Scientist. This person stands in the middle of all the concepts because this person has to combine the aspects to derive “findings”. Here you might note that I did not use the word Data Scientist, rather Data Analyst instead. The reason for this is, that due to my opinion a scientist produces abstract knowledge, whereas we are mainly talking about people using this kind of knowledge to produce findings which we also could call content.

So the difference is in the kind of output you produce.

“The process we are interested in is the deployment of useful data driven models into production.”John Mount (April 19, 2013)

The overlapping regions could be interpreted as the knowledge of the person regarding the related aspect and the Data Analyst represents the need to combine the aspects. Take for example “Non-Negative Matrix Factorization (NMF)”: When you know about this algorithm, you might be able to formulate it mathematically and know about it from an abstract point of view. Next, it is important, to really comprehend what you can do with it. Means: Be able to formulate subject area problems also in an abstract way and be able to recognize if a specific problem can be solved using NMF, hence to map your abstract knowledge to your knowledge about the subject area and therefore to combine the aspects. If you then can lay down this knowledge andwrite a computer program and visualize the problem as well as the findings and then also communicate and document what you found / have done, then all aspects finally came together.

This said, we can now come back to the original question, how do I become a Data Scientist? Instead I would like to ask the question, how do I become a Data Analyst? This depends on where you actually come from what is your coverage of the Data Analytics aspects in the diagram above or the main area of your expertise. This specifies the best way how to start the journey towards Data Science / Data Analytics. For example, if you are a Programmer, you have programming knowledge and knowledge about the aspect Data & Knowledge Representation. So you would ideally expand your knowledge in learning some new programming languages like R and Julia, for example by means of statistics and how they are related to data formats like graphs etc. and additionally by looking at tools discussed in the Data Science / Data Analytics communities. From there you might expand your knowledge into the area of algorithms using data analytics tools, programming languages and more until you cover enough data analytic aspects to put your hands on the solution of real data analytics issues and questions.

You might start from a strong knowledge in a specific business area. Then you might try formulate your problems mathematically or to play around with illustration of your data and produce your own data visualization using programming. But…if it gets difficult, don’t hesitate to go into details.

“You do not know how to model: Learn it dude! There is no short-cut to learning. Your organization needs to learn it, even yourself. Leverage every single person of your organization that has any glimpse of experiences in dealing with the data. Combine that quant dude with a domain expert, let them fight and muddle through the journey. The organization needs it. So does you to learn how it brings value exactly to different internal clients.”Jeffrey Ng (December 26, 2014)

Data Scientist vs. Software Developer

Again looking at the “Data Science Venn Diagram v2.0” where Data Science covers, for example all of mathematics, the question is, do I have to cover all of the programming languages, algorithms etc? Let´s take on another point of view and have a look at the question, what is software development? or how do I become a Software Developer? Are these questions easy to answer? When is somebody a Software Developer? When he can write code on a specific operating system and at least one programming language? What about knowledge about user interfaces, data bases, other programming languages, data formats like XML. When is somebody really a Software Developer? The question is not so easy to answer. Some people argue that if you do not have a strong knowledge of software development processes, product standards and guidelines they would not call you a Software Developer, but maybe a hacker. What we could say is, “The more knowledge one has on the different aspects of software development, all the more we would call him a Software Developer”. So if somebody knows how to write a Software Program in whatever language on whatever platform in whatever area, he might already be a Software Developer. We now could similarly think about Data Science or Data Analytics. There are people in Data Science who are specialized, for example in RetailAnalytics; where they only need a subset of tools and concepts discussed around Data Science, but they are still Data Scientists.

“I prefer somebody who has done ten different things in ten different domains because they will have hopefully learned something new about data from each of different places and domains.”Claudia Perlich

The Languages of Data Science

If we look at the aspects from a “Language” point of view and treat the languages as follows:

Mathematics as the language of abstraction (with dialects like algebra, probability theory etc. with vocabulary like “AA^{T }= Y” or “P(B|A) = P(B)”),

Programming languages as the language of computers (with dialects like R, Python, Java, Julia etc. and with vocabulary like “for each”, “if” etc.),

Data representations as the language of object- or fact representation (like XML, CSV, relational DB, Graph etc.)

Human percipience as the language to convey information to humans (“Story Telling”, “Dashboards”, etc.)

Subject matter languages as the language of the business area or the area representing the meaning (with vocabulary e.g. like “Cobb-Douglas-Function”, “Revenue” etc.)

Then you could say that a Data Scientist is a person able to speak at least one language in each of these language areas and to be able to translate the dialects he knows into one another.

AdvancedDataAnalytics.net ~ Broaden your Horizon

We can now again come back to the original question, how do I become a Data Scientist? The above Data Analytics aspects just give you a guideline where you can start and what might be the next area to dive into, but it does not tell you actually what exactly to look at or to even get a notion of what could be looked at. For example, you might ask, what are exemplary keywords, books, feeds etc. related to Data Science to start with an investigation? What already exist out there related to Data Science aspects?

For this purpose I created the site “AdvancedDataAnalytics.net” which contains lists to give you this kind of entry points to Data Science. Currently there are:

Site

Description

Link

Books

A list of books, book series collections etc. to categories
like Analytics, Semantic Analytics, R, Statistics, Aggregation, Mathematics,
Visualization, Miscellaneous, Programming, BI

A list of terms, algorithms, topics, keywords, buzzwords
etc. including their official abbreviation. One idea here is to enable you to
look up abbreviations used in the context of Data Science. Another idea is to
provide entry points for algorithms and to good explanations.

A link collection for Use Cases, exemplary Analysis,
Resources for Data, Code Snippets, Programming Languages, Visualizations etc.

planned

Tools

A list of tools worth a look at. E.g. RapidMiner, Rattle,
Circos, WEKA etc.

planned

R Packages

A list of packages related to entries on other sites, e.g.
“What-Is”

planned

etc.

E.g. Videos,

With this at hand you should be able to find your path to become a Data Scientist. But there is still curiosity needed. If you, for example have to predict server downtimes you have to have at least a glimpse on available concepts, tools algorithms etc. to find entry points to answer your question. In this case maybe Survival Analysis, Time Series Analysis, Anomaly Detection, Outlier Detection etc..

Additionally to the sites, I make daily posts in several categories. The idea behind this is that you cannot learn everything in one day and you have to get to know about points of entry. Ideally you are permanently following the path to Data Science and these posts provide something to look at on a daily basis: a quote, some R Packages, a document, a term, some news. Therefore there are the following daily posts to increase your knowledge base and inspire you for further investigation:

Blog / Phrase / Subject

Blog Category

Related Site

Magister Dixit

Magister Dixit

Quotes

R Packages worth a look

R Packages

R Packages

Document worth reading

Documents

Documents

If you did not already know

What is …

What-Is

Distilled News

Distilled News

–

Finally, because Data Analytics is crucial for the coming decades dominated by data, it is important that even those people who do not explicitly want to be Data Scientist have to understand the concepts. For example if you are confronted with the results of a neural network you should know, that you have to ask which training data has been used to train the model. Only then your company can become Data Analytics driven or data driven and stay competitive.

“It is important to understand data science even if you never intend to do it
yourself, because data analysis is now so critical to business strategy. Businesses increasingly are driven by data analytics, so there is great professional advantage in being able to interact competently with and within such businesses. Understanding the fundamental concepts, and having frameworks for organizing data-analytic thinking not only will allow one to interact competently, but will help to envision opportunities for improving data-driven decision-making, or to see data-oriented competitive threads.”Foster Provost & Tom Fawcett (2013)

Stay curious: Broaden your horizon.

“Current employees who are able to develop an analytics skill set and combine that with their knowledge of the business can be invaluable when moving analytical insights across the ‘last mile’ to decision makers.” Robert Holland (May 1, 2015)