|Quotes| = 959
Jake Porway Data is new eyes.
Yoshua Bengio The data decides.
Kamil Bartocha
(26. Apr 2015)
Data is never clean.
Satya Nadella
(February 21, 2017)
Bots are the new apps.
Kamil Bartocha
(26. Apr 2015)
Big Data is just a tool.
Martyn Jones
(March 12, 2015)
Big Data is a dopey term.
Daniel Tunkelang Failure is a great teacher.
Brian Mitchell
(08 Dec 15)
Work Smarter and Not Harder.
David Hilbert
(8 Sept. 1930)
We must know – we will know!
Edward R. Tufte
Above all else show the data.
Jonathan Lenaghan Having no competitors is bad.
Robin Bloor
Data Analysis is Business R&D.
James M. Connolly
Analytics won’t cure Stupidity.
Anna Smith Work on positivity and patience.
Mark Barrenechea
(September 11, 2015)
Digital leaders know their data.
Pete Werner
(March 14, 2015)
R will not teach you statistics.
Riley Newman
(July 07, 2015)
Data Isn’t Numbers, It’s People.
Jonathan Lenaghan Being self-critical is important.
Martyn Jones
(March 9, 2015)
Big Data Is Not Data Warehousing.
Miss Piggy Never eat more than you can lift.
Adrian Colyer
(May 25, 2017)
Moving data doesn’t come for free.
Anna Smith It’s okay to ask people questions.
Chris Taylor
IoT Is About Analysis, Not Things.
John W. Tukey
Statistics work is detective work!
Michael Bryan
(October 16, 2014)
Face it – ‘visual’ sells software.
Rapid Insight
Models are mathematical Equations.
Dr. Pradeep Mavuluri
(October 14, 2014)
Data-Driven Strategies = Analytics.
Eric Colson, Brad Klingenberg, Jeff Magnusson
(March 31, 2015)
Facts speak louder than statistics.
Mr. Justice Streatfield
Facts speak louder than statistics.
Good data analysis is reproducible.
Rob Douglas
A paper a day keeps ignorance away.
Daniel Tunkelang Always put talent before technology.
Isabelle Nuage
(February 24, 2015)
Big Data by itself is of little use.
n.n. A picture is worth a thousand words.
Ray Major
(November 6, 2014)
Better Predictions = Higher Profits.
Arthur Choi, Ruocheng Wang, Adnan Darwiche
A neural network computes a function.
Aristotle Quality is not an Act, it is a habit!
Ed Burns
(June 2015)
Machine learning automates analytics.
Karl Pearson Statistics is the grammar of science.
John Langford
Prefer simplicity in algorithm design.
Simeon Strunsky Statistics are the heart of democracy.
Eric Jonas Life is too short to not be having fun.
Kira Radinsky Working with data is like an adventure.
William Edwards Deming
In God we trust, all others bring data.
Jon Greenberg
(January 2, 2015)
It’s a good time to be a data scientist!
Zachary Hay
AI ‘thinks’ like those who designed them.
Exegetic Analytics Sapientia ex numero. Wisdom from numbers.
Justin Locke
(January 30, 2015)
In Management, Stupidity Is An Advantage.
Kamil Bartocha
(26. Apr 2015)
You should embrace the Bayesian approach.
Manoj Sharma
(December 30, 2014)
A picture is worth a thousand statistics.
David Cearley
Every app now needs to be an analytic app.
Dwight David Eisenhower The uninspected (inevitably) deteriorates.
Henry Clay Statistics are no substitute for judgment.
Kamil Bartocha
(26. Apr 2015)
95% of tasks do not require deep learning.
Caitlin Smallwood It’s very obvious how different people are.
Erin Shellman Data’s just the world making noises at you.
Claudia Perlich Intuition develops with a lot of experience.
Kira Radinsky A data scientist should be a great engineer.
Patrick Gerald
(May 22, 2015)
No good business succeeds without analytics.
Waqar Hasan Predictive is the ‘killer app’ for Big Data.
Erin Shellman Presentation is the ability to craft a story.
Joey Zwicker
(12. February 2015)
Hadoop has an irreparably fractured ecosystem.
n.n. You sell generalization at the end of the day.
Timothy E. Carone
(January 30, 2015)
Big Data is the oxygen for autonomous systems.
Amy Heineike Data science is already kind of a broad church.
Daniel Tunkelang Where things get interesting is in the details.
John Finley Maturity is the capacity to endure uncertainty.
Kamil Bartocha
(26. Apr 2015)
Academia and business are two different worlds.
Ryan Irwin
(August 20, 2014)
Have a sense of humor, and never stop learning!
Jake Porway Data scientists can apply their skills for good.
Bill Schmarzo
(November 18, 2017)
Do you want to report, or do you want to prevent?
Rick Collins Every modern enterprise will need an AI strategy.
Eric Jonas I basically can’t hire people who don’t know Git.
John Keats Nothing ever becomes real till it is experienced.
Ilya Kashnitsky
(November 07, 2017)
R is an incredible tool for reproducible research.
Chris Wiggins The key is usually to just keep asking ‘So what ?’
Daniel Tunkelang Anything that looks interesting is probably wrong.
Emil Gumbel It’s impossible for the improbable to never occur.
John Foreman I find it tough to find and hire the right people.
One of Barbara Doyle’s Professors If you need statistics to prove it, it isn’t true.
Alan Kay
The best way to predict the future is to invent it.
Caitlin Smallwood You should always have more questions than answers.
Kira Radinsky The person you hire has to understand the business.
Miguel de Cervantes from Don Quixote By a small sample, we may judge of the whole piece.
Pradyumna S. Upadrashta
(February 13, 2015)
Not all paths to good Data Strategy lead to Hadoop.
Kiyoto Tamura
(April 21, 2015)
The value of log data for business is unimpeachable.
Mitch Barns
(June 22, 2015)
The secret is not just ‘big’ data, but ‘right’ data.
Christoph Nieberding
Intuition and gut feeling alone are no longer enough.
Caitlin Smallwood You should never quite feel satisfied with an answer.
Joel Cadwell
So when I turn to R, I am looking for more than code.
Kirk Borne
Data science: It’s greater than the sum of the parts.
Light, Singer and Willett You can’t fix by analysis what you bungled by design.
Plato The beginning is the most important part of the work.
Ronald Coase If you torture the data long enough, it will confess.
Victor Hugo Nothing is stronger than an idea whose time has come.
Yann LeCun Diversity of point of view is a very important thing.
H. Barlow The brain is nothing but a statistical decision organ.
Larry Hardesty
(August 15, 2014)
In the age of big data, visualization tools are vital.
Scott Zeger A model is a lense through which to look at your data.
Brillat-Savarin Tell me what you eat, and I will tell you what you are.
Daniel Tunkelang Intuition is really a well-trained association network.
George Box Essentially, all models are wrong, but some are useful.
Herbert Simon A wealth of information creates a poverty of attention.
James Gleick When information is cheap, attention becomes expensive.
Joel Cadwell
(August 21, 2014)
R makes it so easy to fit many models to the same data.
Mitchell A. Sanders
(August 27, 2013)
If you can’t use the tools, you can’t analyze the data.
William S. Cleveland
Data are the heat engine for invention in data science.
Aditi Joshi
(August 4, 2015)
Behind every successful person, there is tons of coffee.
Attributed to Einstein Models should be as simple as possible, but not more so.
DataScience+ Visualizing the data is a key function of R programming.
Jeff Bezos It is not an experiment if you know it is going to work.
Rob Kitchin Big data should complement small data, not replace them.
Sherlock Holmes (Arthur Conan Doyle) It is a capital mistake to theorize before one has data.
Geoffrey Hinton If you’re not overfitting, your network isn’t big enough.
James Gleick Information is not knowledge and knowledge is not wisdom.
John Naisbitt
We are drowning in information but starved for knowledge.
Andre Karpistsenko Everyone is right, depending on the situation and context.
Ed Burns
(September 2015)
Big data analytics architecture requires integration push.
James Faghmous
(Nov 7, 2013)
Think of overfitting as memorizing as opposed to learning.
Niels Bohr Prediction is very difficult, especially about the future.
Chris Wiggins The great thing about predictions is that you can be wrong.
John McCarthy He who refuses to do arithmetic is doomed to talk nonsense.
Werner Vogels You can never have too much data – bigger is always better.
Chris Wiggins Scientists are human beings too whether they know it or not.
Daniel Tunkelang As data scientists, our job is to extract signal from noise.
Pete Werner
(March 14, 2015)
R has really opened up a new world of statistical computing.
Christopher Bishop
Half of what we do at Microsoft Research is Machine Learning.
Hannu Toivonen, Oskar Gross
(2. October 2015)
Data mining and machine learning in computational creativity.
Kamil Bartocha
(26. Apr 2015)
You will spend most of your time cleaning and preparing data.
Robin Bloor
(April 13, 2015)
The ‘great age of silo building’ ended some time around 2005.
Daniel Tunkelang Search is the problem at the heart of the information economy.
Daniel Tunkelang The best way to become a data scientist is to do data science.
Eric Jonas What really matters is who’s actually using and paying for it.
Erin Shellman If you talk to somebody who has something you want, follow up.
Foster Provost & Tom Fawcett
Computing similarity is one of the main tools of data science.
Victor Hu It is hard to know what you really need until you dig into it.
Hans Rosling The idea is to go from numbers to information to understanding.
Ed Burns
Visualization makes sense of messy data – if you don’t mess up.
Isaiah, XXX 8 Now go, write it before them in a table, and note it in a book.
Miguel A. Hernán, John Hsu, Brian Healy
(July 12, 2018)
There will be no AI worthy of the name without causal inference.
Brian Liou
(March 2, 2015)
Why Does Everyone Need to Learn How to Code? To Understand Data.
Daniel Tunkelang Our goal is to fail fast. Most crazy ideas are just that: crazy.
Kamil Bartocha
(26. Apr 2015)
In 90% of cases generalized linear regression will do the trick.
n.n. Generalization is the sole product of a machine learning system.
Rishi Shah
(September 24, 2014)
Big data profitability depends on your employee’s data literacy.
Bill Gates A breakthrough in machine learning would be worth ten Microsofts.
Ghandi If one takes care of the means, the end will take care of itself.
Caitlin Smallwood The data speaks for itself. That’s the easiest measure of success.
I wish I could use data analytics more, but I don’t have Big Data.
Chris Wiggins The main driver of my ideas has been seeing people doing it ‘wrong’
Jake Porway Every company has data that can help make the world a better place.
Rasmus Bååth
(Jan 8th, 2015)
Behind every great point estimate stands a minimized loss function.
Satya Nadella
I think we are at the dawn of a new generation of business systems.
Albert Einstein If you can’t explain it simply, you don’t understand it well enough.
Caitlin Smallwood Mix of algorithms is almost always better than one single algorithm.
Geoffrey Moore Without big data, you are blind and deaf in the middle of a freeway.
Mark van Rijmenam
Great insights are achieved when you combine different data sources.
Newspaper headline posted on Maya Bar Hillel’s board. Every third person in Israel saw 1.8 public theater shows last year.
Gil Allouche
(August 26, 2014)
Real time data (Analytics) isn’t just a good idea, it is a necessity.
Jake Porway
(October 1, 2015)
To tackle sector-wide challenges, we need a range of voices involved.
Lewis Platt If we only knew what we know, we would be 30 percent more productive.
Ralph Waldo Emerson All life is an experiment. The more experiments you make, the better.
All analysis starts with an understandable set of data and algorithms.
Jake Porway There’s almost no limit to where data and data science can be applied.
James Faghmous
(Nov 7, 2013)
The simplest hypothesis that fits the data is also the most plausible.
John Foreman You don’t have to know everything, but you should have a general idea.
Lara Ponomareff Sometimes, what customers say they want isn’t what they actually want.
It´s no longer good enough to make decisions based on intuition alone.
Shayne Miel You have to turn your inputs into things the algorithm can understand.
Matthew Zeiler
Google is not really a search company. It’s a machine-learning company.
Tomasz Malisiewicz Feature engineering is manually designing what the input x’s should be.
Andre Karpistsenko There is a big part of intuition in choosing the most important problem.
Caitlin Smallwood The top things for people in hiring are hunger and insatiable curiosity.
Chris Howden Ignoring Statistics means you aren’t learning from the wisdom of others.
Josh Bloom
The first rule of data science is: don’t ask how to define data science.
Jake Porway Data scientists in the business world are all generally well-compensated.
John Foreman Talking to users is crucial because they point you in the right direction.
Michael L. Brodie
Doubt everything. Use evidence-based methods to verify things that matter.
Tim Berners-Lee Data is a precious thing and will last longer than the systems themselves.
Unix Philosophy Do one thing. Do it really well. Work together with everything around you.
Caitlin Smallwood You imagine a data set & you salivate at just thinking about that data set.
Charles Babbage Errors using inadequate data are much less than those using no data at all.
Claudia Perlich Learning how to do data science is like learning to ski. You have to do it.
Daniel Tunkelang Experience is not only the best teacher, but also perhaps the only teacher.
Hans Reichenbach If an improbable coincidence has occurred, there must exist a common cause.
Kamil Bartocha
(26. Apr 2015)
There is no fully automated Data Science. You need to get your hands dirty.
Steve Jobs
(May 25, 1998)
A lot of times, people don’t know what they want until you show it to them.
G S Praneeth Reddy
Statistics is one of the most important skills required by a data scientist.
Caitlin Smallwood Our closest reliable measure of customer satisfaction is customer retention.
Caitlin Smallwood People’s intuition is wrong so often, even when you’re an expert in an area.
Claudia Perlich In the real world data is not like data they saw in classrooms and in books.
Daniel Tunkelang One thing we’ve learned is that there’s no such thing as over-communicating.
Martyn Jones Without a grounding in statistics, a Data Scientist is a Data Lab Assistant.
Michael Walker
(October 14, 2014)
If something appears too good to be true, it usually is too good to be true.
Karen Grace-Martin
We all tend to think of ‘Statistical Analysis’ as one big skill, but it’s not
Caitlin Smallwood Discovering things about people through their data was a really a cool thing.
Jeff Dean
(November 2014)
Anything humans can do in 0.1 sec, the right big 10-layer network can do too.
Jeffrey Fry Having more data does not always give you the power to make better decisions.
John Foreman Twitter is probably the best place to start conversations about data science.
n.n. Any fool can make something complicated. It takes a genius to make it simple.
Peter Drucker Today knowledge has power. It controls access to opportunity and advancement.
Victor Hu People are more interested in their projects because they have selected them.
Vinod Khosla In the next 20 years, machine learning will have more impact than mobile has.
Kaiser Fung Their ‘data’ have done no arguing; it is the humans who are making this claim.
Scott Mongeau
(December 17)
Successful analytics depends upon a unified people-process-technology context.
Valentine Svensson
If you are expecting clusters in your data, PCA is probably not the right tool.
Erik Brynjolfsson, Andy McAffee
(January 16, 2018)
AI won’t replace managers, but managers who use AI will replace those who don’t.
Brandon Rohrer
(Dec 19, 2015)
Data science is a fancy way to say using numbers and names to answer a question.
Deepak Mohapatra
Anytime you can correlate a person, location and time, you can identify schemes.
Evans and Richardson
Powerful language of graphs might make formulations and models more transparent.
Kaiser Fung One of the biggest myth of Big Data is that data alone produce complete answers.
Lana Klein
Analytics today is at the point of high awareness and very little understanding.
Lord Ernest Rutherford If your experiment needs statistics, you ought to have done a better experiment.
Andre Karpistsenko The idea or the initial enthusiasm is just a small part of doing something great.
Bob McDonald Data modeling, simulation, and other digital tools are reshaping how we innovate.
Jonathan Lenaghan Your location history that is important, not necessarily where you are right now.
Mirko Krivanek
(March 5, 2015)
We need to remain constantly open to new mathematical models of machine learning.
n.n. Multi-Criteria Decision Making is the aim to order multidimensional alternatives.
Nikhil Buduma
(29 December 2014)
The algorithm’s effectiveness relies heavily on how insightful the programmer is.
Philip Russom
When you can, bring the analytic algorithm to big data, not the other way around.
ATKearney Is Big Data the 21st century equivalent of the Industrial Revolution? We think so.
Carl Sagan If you wish to make an apple pie from scratch, you must first invent the universe.
R is in the process of becoming the multi-platform lingua franca of data analysis.
Foster Provost & Tom Fawcett
Increasingly, business decisions are being made automatically by computer systems.
Kirk Borne
Extracting Knowledge, Insights, and Data-to-Decisions (D2D) from Big Data is hard!
R. Bhargava, C. D’Ignazio
Data literacy includes the ability to read, work with, analyze and argue with data.
John Hayes We tend to overvalue the things we can measure and undervalue the things we cannot.
Larry Page
If you’re not doing some things that are crazy, then you’re doing the wrong things.
Manoj Sharma
(December 30, 2014)
One of the most important steps in the Data Analytics process is Feature Selection.
Peter Sondergaard Information is the oil of the 21st century, and analytics is the combustion engine.
Richard Pugh
(25 June 2015)
In my opinion, the single most important skill for a data scientist is … Empathy.
Andre Karpistsenko The core lesson from tool-and-method explorations is that there is NO silver bullet.
Daniel Keys Moran You can have data without information, but you cannot have information without data.
Foster Provost & Tom Fawcett
… that the more data-driven a firm is, the more productive it is (statistically) …
William James We must be careful not to confuse data with the abstractions we use to analyze them.
Brian Caffo Like nearly all aspects of statistics, good modeling decisions are context dependent.
Jake Porway
(October 1, 2015)
We must convey what constitutes data, what it can be used for, and why it’s valuable.
Joel Cadwell
(August 21, 2014)
Naming is an art, yet be careful not to add surplus meaning by being overly creative.
Stephen William Hawking God not only plays dice. He also sometimes throws the dice where they cannot be seen.
To become valuable for statistics and machine learning the data has to be centralized.
Jake Porway The world will be more effective if everyone can at least converse about data science.
James Faghmous
(Nov 7, 2013)
For every Apple that became a success there were 1000 other startups that died trying.
Jonathan Lenaghan Keeping your eyes on the final deliverable is essential to solving the right problems.
Manish Saraswat
(January 13, 2016)
Data can’t make your past better. However, it you surely can create an awesome future.
Tess Nesbitt The more data you have for testing and validating your models, it’s only a good thing.
Tyler Brûlé It’s a daily effort to adhere to set standards. (But) you need to aspire to something.
Frank Lloyd Wright You can use an eraser on the drafting table or a sledgehammer on the construction site.
Jonathan Lenaghan Losing somebody else’s money is one of the most horrible sinking feelings in the world.
Kiyoto Tamura
(April 21, 2015)
Over the last decade, the primary consumer of log data shifted from humans to machines.
Pete Werner
(March 14, 2015)
R is a niche language, but if your work falls within that niche it is a wonderful tool.
John Boyle O’Reilly The organized charity, scrimped and iced, In the name of a cautious, statistical Christ.
John Tukey Numerical quantities focus on expected values, graphical summaries on unexpected values.
n.n. Give a man a fish – feed him for a day Teach him how to fish – feed him for a life time!
Tony Fisher
(May 15, 2015)
Today, big data is considered a differentiator. Soon, it will be considered a commodity.
Tim O’Reilly
(Apr 27, 2018)
In 2018, we still think that we can organize our companies in old ways but do new things.
Data Science Heroes The future is undoubtedly attached to uncertainty, and this uncertainty can be estimated.
Jeff Stanton
Employers are looking for generalists who can create bridges between the different areas.
Peter Bühlmann
(June 2015)
Causal components remain the same for different sub-populations or experimental settings.
Kevin Gray Model’ means different things to different people and different things at different times.
Andrew Lang He uses statistics as a drunken man uses lamp-posts–for support rather than illumination.
Confucius Tell me, and I will forget. Show me and I may remember. Involve me, and I will understand.
Pete Werner
(March 14, 2015)
Much of R is built around the assumption you are working with a table-like data structure.
Randy Au
(15.02.2019 19:27)
Know your data, where it comes from, what’s in it, what it means. It all starts from there.
George Bernard Shaw You see things and you say ‘why’. But I dream things that never were; and I say, ‘why not’?
Andre Karpistsenko These days, if you build a community around yourself, the news and people start to find you.
Erin Shellman The most interesting types of data are those collected for one purpose and used for another.
Jeffrey Heer It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.
Kristian Hammond
(November 25, 2015)
The point of the technology is to humanize the machine so we don’t have to mechanize people.
Mitch Barns
(June 22, 2015)
Making big data useful requires calibration with data that is both clean and representative.
Mohammad Pezeshki Actually the success of all Machine Learning algorithms depends on how you present the data.
R. A. Fisher Natural selection is a mechanism for generating an exceedingly high degree of improbability.
For deep learning to work well, it needs lots of data. So first of all, you should have that.
Fayyad, Piatetsky-Shapiro, Smyth Complexity arises from anomalies such as discontinuity, noise, ambiguity, and incompleteness.
Google Analytics Team
(August 27, 2015)
We’ve said it before and we’ll say it again: great analytics can only happen with great data.
John W. Tukey
The greatest value of a picture is when it forces us to notice what we never expected to see.
Tamara Dull
(March 20, 2015)
The data lake is essential for any organization who wants to take full advantage of its data.
Thomas J. Watson The great accomplishments of man have resulted from the transmission of ideas and enthusiasm.
Daniel Tunkelang Data scientists need to have strong critical-thinking skills and a healthy dose of skepticism.
Davis Buckingham Data is the new oil. We need to find it, extract it, refine it, distribute it and monetize it.
Zachary Chase Lipton
(January 2015)
The problems that can be solved with machine learning are often not solvable by other methods.
John Mount
(April 19, 2013)
Machine learning and statistics may be the stars, but data science orchestrates the whole show.
Kirk Borne
(January 9, 2015)
Randomness refers to the absence of patterns, order, coherence, and predictability in a system.
Nate Silver Distinguishing the signal from the noise requires both scientific knowledge and self-knowledge.
Big data can significantly increase top-line revenues and markedly reduce operational expenses.
Hal Varian I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.
John Mount
(April 19, 2013)
The process we are interested in is the deployment of useful data driven models into production.
Mark van Rijmenam
(September 2, 2014)
A prerequisite for a successful Big Data strategy is a data-driven, information-centric culture.
Max Kuhn, Kjell Johnson Unfortunately, the predictive models that are most powerful are usually the least interpretable.
Vladimir N. Vapnik
A good minimization of the empirical risk depends in many respects on the art of the researcher.
When you staff a project with people who are skilled and fascinated by the problem, you get gold.
Andre Karpistsenko Financial gain is a second-order result: if you do the right things, everything else will follow.
Henri Poincaré
Mit Logik kann man Beweise führen, aber keine neuen Erkenntnisse gewinnen, dazu gehört Intuition.
John Cook
(26 March 2015)
Statistics aims to build accurate models … Machine learning aims to solve problems more directly.
Mike Barlow
Most large software vendors […] are convinced that AI will become a major force in the economy.
Pierre Simon, Marquis de Laplace The most important questions of life are, for the most part, really only problems of probability.
Rollin Ford Every day I wake up and ask, how can I flow data better, manage data better, analyse data better?
Shivon Zilis
Yes, it’s true, machine intelligence is transforming the enterprise, industries and humans alike.
Focus on getting talented people to work together; when you do, you’ll see truly inspired results.
Erin Shellman What companies want is a person who can rigorously define problems and design paths to a solution.
Kaushik Pal
(August 2015)
Big data analytics is still in infancy, and we haven’t yet embraced a data-driven decision making.
Anna Smith Being patient with other people and really respecting other people is very important in succeeding.
Daniel Tunkelang Our computers, mobile devices, and web-based services are witnesses to many of our daily decisions.
Daniele Medri Statisticians are data scientists by definition but data scientists do not [all] have … training.
Jacob Harris
(Dec. 2014)
The wave of bullshit data is rising, and now it’s our turn to figure out how not to get swept away.
n.n. Just because you have a ‘hammer’, doesn’t mean that every problem you come across will be a ‘nail’.
It’s hard to overstate the importance of data for businesses today. It’s the lifeline of any company.
Pradyumna S. Upadrashta
(February 13, 2015)
You shouldn’t be collecting Big Data under the premise that more data is better, cooler, sexier, etc.
Yann LeCun It’s useful for a company to have its scientists actually publish what they do. It keeps them honest.
BI Community What is the most used feature in any business intelligence solution? It is the Export to Excel button.
Elizabeth D. Liddy
My goal with big data is to learn something to improve the next version of whatever it is we are doing.
Brian Fanzo
(January 9, 2015)
If you’re not doing more listening and analytics than pushing or posting than you’re doing social wrong!
Chris Wiggins The most exciting thing is realizing that something everybody thinks is new is actually really damn old.
David Hilbert Mathematics knows no races or geographic boundaries; for mathematics, the cultural world is one country.
Gabriel Lowy
(February 24, 2015)
Big data does not change the relationship between data quality and decision outcomes. It underscores it.
John Foreman What we focus on, and this is going to sound goofy for a data scientist – is the happiness of our users.
Milton Friedman The only relevant test of the validity of a hypothesis is comparison of its predictions with experience.
Randy Bartlett
(July 2015)
Data science is nothing but the old wine in new bottle versions of the statistics with different fields.
… Note that playing back data from Hadoop into SAP Sybase ESP can occur much faster than in real time, …
W. Edwards Deming The only useful function for a statistician is to make predictions, and thus provide a basis for action.
(February 6, 2019)
By 2020, 50% of organizations will lack sufficient AI and data literacy skills to achieve business value.
Anon If all the statisticians in the world were laid head to toe, they wouldn’t be able to reach a conclusion.
Armand Trousseau You should treat as many patients as possible with the new drugs while they still have the power to heal.
Krzysztof Zawadzki
(August 30, 2014)
Finding a data scientist is hard. Finding people who understand who a data scientist is, is equally hard.
Larry Greenemeier
(March 13, 2014)
Context is often lacking when info is pulled from disparate sources, leading to questionable conclusions.
Southard Jones
We’re at a tipping point where using data to inform all types of business decisions is becoming the norm.
William Vorhies
(August 19, 2014)
Predictive analytic suites are not tools that your middle managers are going to be able to learn quickly.
The hallmark of being a good manager or collaborator is that everyone wants you involved in their project.
Abraham Maslow
I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.
Eric Jonas The biggest thing people should be working on is problems they find interesting, exciting, and meaningful.
Foster Provost & Tom Fawcett
… the day-to-day work of a data scientist – especially an entrylevel one – may be largely data processing.
Linus Torvalds Bad programmers worry about the code. Good programmers worry about data structures and their relationships
A. N. Whitehead The aim of science is to seek the simplest explanation of complex facts… Seek simplicity and distrust it.
Amy Heineike In general, it’s very hard to hire people who are a complete package, who know what to do and how to do it.
Eran Levy
Mashing up multiple data sources to generate a single source of truth is an integral part of data analysis.
Alexander Linden A data scientist must possess the knack of being able to ‘identify business value from mathematical models.’
Geoffrey Moore Without big data analytics, companies are blind and deaf, wandering out onto the Web like deer on a freeway.
TJ Laher
(November 14, 2014)
Leading organizations have already begun to see serious returns on deploying a pervasive analytics strategy.
If you separate out your science team, they can’t do their best work. They’ll get bored or spin their wheels.
Atul Butte Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world.
H.G. Wells/Samuel S. Wilks
Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.
Michael Greene
To find new trends and strong patterns from large complex data sets, a strong analytics foundation is needed.
Michael Jordan [Machine Learning is simply a] ‘loose confederation of themes in statistical inference (and decision-making)’
Yann LeCun Management skills are a little overrated in the sense that managing research scientists is like herding cats.
Andrew Maynard
Ethics are essential to establishing a guiding basis for how powerful new technologies are developed and used.
John Foreman If you’re solving problems appropriately and you can explain yourself well, you’re not going to lose your job.
Jonathan Lenaghan It is very important to be self-critical: always question your assumptions and be paranoid about your outputs.
Andre Karpistsenko Maybe the most important thing is to surround yourself with people greater than you are and to learn from them.
Don’t Be Overwhelmed by Big Data. Focus Only on the Right Data Delivered to the Right People at the Right Time.
Andrew Gelman
(28 April 2015)
Measurement, measurement, measurement. It’s central to statistics. It’s central to how we learn about the world.
Hadley Wickham
Writing well and describing things well is very valuable to a good programmer, and even more to a data scientist.
Ivan Vasilev The hidden layer is where the (neural) network stores it’s internal abstract representation of the training data.
John Foreman Data scientists are kind of like the new Renaissance folks, because data science is inherently multidisciplinary.
Mike Urbonas
(April 11, 2015)
The best technology tools will be those that empower subject matter experts to quickly apply intuitive reasoning.
P. Dawid
Causal inference is one of the most important, most subtle, and most neglected of all the problems of Statistics.
R. A. Fisher … the actual and physical conduct of an experiment must govern the statistical procedure of its interpretation.
Xavier Conort The algorithms we used are very standard for Kagglers. […] We spent most of our efforts in feature engineering.
Roy Amara We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.[
Edward Tufte Excellence in statistical graphics consists of complex ideas communicated with clarity, precision, and efficiency.
Sean McClure
(July 2015)
Understanding the distinction between Data Science and Big Data is critical to investing in a sound data strategy.
Yann LeCun Most of the knowledge in the world in the future is going to be extracted by machines and will reside in machines.
Yuhang Song, Jianyi Wang, Thomas Lukasiewicz, Zhenghua Xu, Mai Xu, Zihan Ding, Lianlong Wu
(17 May 2019)
Learning agents that are not only capable of taking tests but also innovating is becoming the next hot topic in AI.
Mark van Rijmenam
(September 2, 2014)
Developing a Big Data strategy is not all about the required IT, but also about the culture within an organisation.
(25 April 2013)
Too few budding scientists receive adequate training in statistics and other quantitative aspects of their subject.
Thomas Carlyle A judicious man looks on statistics not to get knowledge, but to save himself from having ignorance foisted on him.
Aimée Gott
At Mango we define data science as the proactive use of data and advanced analytics to drive better decision making.
Dikesh Jariwala
There are four basic presentation types for charts:
1. Comparison
2. Composition
3. Distribution
4. Relationship
John Tukey An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.
Nina Zumel
(January 5, 2015)
The true purpose of a test procedure is to estimate how well a classifier will work in future production situations.
Big data is about infrastructure, while analytics is about enabling informed decisions and measuring business impact.
Pelin Thorogood
(August 21, 2014)
We really need people who have the left brain and right working in balance, while also knowledgeable of the business.
Taylor Meadows
(December 17, 2015)
Yoda might need The Force to see the future – but data analysts can also predict it in an efficient and reliable way.
Buddha Wise ones do not rely on theories. Wise ones do not tie themselves up. They only watch carefully and listen carefully.
Cassius J. Keyser Absolute certainty is a privilege of uneducated minds-and fanatics. It is, for scientific folk, an unattainable ideal.
Michael Gilliland
(March 6, 2015)
Whenever you adjust a forecast two things can happen – you can improve the accuracy of the forecast, or make it worse.
Mike Pluta
Moore’s Law for BigData: The amount of nonsense packed into the term ‘BigData’ doubles approximately every two years .
Mithun Sridharan
(September 4, 2014)
The number of companies that are delivering Analytics as a Service via cloud-based platforms is increasing day-by-day.
Anima Anandkumar
[For] topic models and social network analysis tensor methods proved to be faster and more accurate than other methods.
Arthur Samuel
[Machine learning is the] field of study that gives computers the ability to learn without being explicitly programmed.
Chris Lynch Big data is at the foundation of all the megatrends that are happening today, from social to mobile to cloud to gaming.
Hadley Wickham Any real data analysis involves data manipulation (sometimes called wrangling or munging), visualization and modelling.
n.n. Data does replace heuristics, hard-coded rules, assumptions and beliefs. Machine learning only enables data to do that.
Norm Matloff Finding your bug is a process of confirming the many things you believe are true, until you find one which is not true.
Eric Colson, Brad Klingenberg, Jeff Magnusson
(March 31, 2015)
A Data Scientist should look for a company that actually uses data science to set themselves apart from the competition.
John Foreman
Whether or not you increase complexity for additional accuracy is not a data science decision. It’s a business decision.
Sharmistha Sarkar
(December 16, 2017)
Extreme automation and global connectivity are the needs of the hour and therefore artificial intelligence is imperative.
Alex Jones
Creating a hodge-podge of pretty pictures of every datapoint is a guaranteed way to destroy the value of a visualization.
Hilaire Belloc Statistics are the triumph of the quantitative method, and the quantitative method is the victory of sterility and death.
Elizur Wright While nothing is more uncertain than a single life, nothing is more certain than the average duration of a thousand lives.
John Foreman E-mail data is powerful, because as a communications channel it generates more revenue per recipient than Social Channels.
Kluge et al.
Knowledge is at the heart of much of today’s global economy and managing knowledge has become vital to companies’ success.
Ed Burns
(August 2014)
One of the keys to success in big data analytics projects is building strong ties between data analysts and business units.
Jack Vaughan
(January 2015)
Don’t let the buzz around analytics fool you. Solid data management capabilities are required to enable advanced analytics.
R20 Consultancy Big data silos leads to data duplication, high data latency, complex data replication solutions, and data quality problems.
Yann LeCun The data sets are truly gigantic. There are some areas where there’s more data than we can currently process intelligently.
Akshat Thakar
Big Data analytics is now more towards accurately defining data, uniform handling and developing data driven smart products.
(June 2014)
Traditional BI looks at data through a soda straw. Big data analytics looks at data through powerful, wide-angle binoculars.
Joel Cadwell
We use what we want and need, and our wants and needs flow from who we are and the limitations imposed by our circumstances.
Amy Heineike The key is figuring out how you get those three things: the right problem, the right data, and the right methodology to meld.
Mike Barlow
Thanks to a perfect storm of recent advances in the tech industry, AI has risen from the ashes and regained its aura of cool.
Tess Nesbitt To work on our team, where we’re building models all the time and we’re knee-deep in data, you have to have technical skills.
Andrew Long
Wow! so much work to get ready for a model. This is always true in data science. You spend 80-90% cleaning and preparing data.
Daniel Tunkelang Query understanding offers the opportunity to bridge the gap between what the searcher means and what the machine understands.
Joe Caserta
Understanding exactly what technology can and cannot do is essential to the success of big data and deep analytic initiatives.
Kelly Sheridan
Companies see the analysis of more and different data streams as their route to boosting efficiency and creating new products.
Mithun Sridharan
(September 4, 2014)
The strategic goal of a Big Data Strategy is to derive insights and make better data-driven decisions faster than competition.
Antoine de Saint-Exupéry A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away.
Michael Walker
(October 14, 2014)
Beware of tech firms selling you data tech with fantastic claims of finding meaning in data and creating competitive advantage.
Donald Feinberg
Whether we like it or not, algorithmic decision making is coming. How much we should fear or rejoice is a matter of great debate
Jason Brownlee Whenever you are modeling a problem, you are making a decision on the trade-off between model accuracy and model interpretation.
P.J. Flynn
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters).
Vincent Granville
(August 26, 2014)
The success of any big data or data science initiative is determined by the kind of data that you collect, and how you analyze it.
Wojciech Bolanowski
Big Data is the concept that those “gatherability”, storability and usability of enormous data can be turn into its profitability.
Daniel Tunkelang Technology is like exercise equipment in that buying the fanciest equipment won’t get you in shape unless you take advantage of it.
Paul Roehrig, Ben Pring
It’s a new era in business, one in which growth will be driven as much by insight and foresight as by physical products and assets.
Eric Jonas My list of goals is to learn everything, be able to build anything, save everyone, and have fun doing it. That’s a nice simple list.
Eric Jonas Startup culture teaches you to be like Steve Jobs, in that you’re right, everyone else is wrong, and your vision will power through.
Suetonia Palmer
Innovative statistical techniques’ are important, but the key to getting good results here is a mind-boggling amount of actual work.
Claudia Perlich Data scientist’ is a completely undefined job description. Today, if you hire a data scientist, you do not know what you are getting.
Foster Provost & Tom Fawcett
Data-driven decision making (DDD) refers to the practice of basing decisions on the analysis of data rather than purely on intuition.
The mindset shift required for AI can lead to ‘cultural anxiety’ because it calls for a deep change in behaviors and ways of thinking.
Diego Kuonen The key element for a successful (big) data analytics and data science future is statistical rigor and statistical thinking of humans.
Donnie Berkholz
(March 13, 2015)
Regardless, it’s clear that Spark is a technology you can’t afford to ignore if you’re looking into modern processing of big datasets.
Foster Provost & Tom Fawcett
Take big data to mean datasets that are too large for traditional data-processing systems and that therefore require new technologies.
Jake Porway
(October 1, 2015)
Data is not truth, and tech is not an answer in-and-of-itself. Without designing for the humans on the other end, our work is in vain.
Claudia Perlich The conversation is based around how to properly deal with even more sensitive information about where exactly people spend their lives.
Gary King Big data is not about the data. (Making the point that while data is plentiful and easy to collect, the real value is in the analytics.)
Mark Smith
(February 3, 2015)
Big data has great promise for many organizations today, but they also need technology to facilitate integration of various data stores.
Nikhil Buduma
(29 December 2014)
Feeding the algorithm raw data rarely ever works, so feature extraction is a critical part of the traditional machine learning workflow.
Pedro Domingos …some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.
Claudia Perlich My primary challenge as a data scientist is to use the right algorithm to connect the right data to the problem you actually want solved.
Mike Urbonas
(April 11, 2015)
The best analytic tools will directly empower those with high intuitive reasoning, who probably will not have formal data science skills.
Franz Färber
The goal of the SAP HANA database is the integration of transactional and analytical workloads within the same database management system.
Matthias Feurer et al.
The success of machine learning relies heavily on selecting the right algorithm for a problem at hand, and on setting its hyperparameters.
Pradyumna S. Upadrashta
(February 13, 2015)
Before jumping on the Big Data bandwagon, I think it is important to ask the question of whether the problem you have requires much data.
David Puglia, FrontRange
(30. December 2014)
In comparison to IPv4’s 4.3 billion IP addresses, IPv6 can assign about 340 trillion trillion trillion addresses and corresponding devices.
Rob J Hyndman
Big data is now endemic in busi­ness, indus­try, gov­ern­ment, envi­ron­men­tal man­age­ment, med­ical sci­ence, social research and so on.
Yann LeCun You don’t want to just hire clones of the same person, because then they will all want to explore the same things. You want some diversity.
Andre Karpistsenko We are positioning ourselves to be lucky. We follow the adage that luck is being prepared for an opportunity and seizing it when it appears.
Josh Wills A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.
Ross Perot Business is not just doing deals; business is having great products, doing great engineering, and providing tremendous service to customers.
Andrew Ng Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering.
Andrew A. Kramer
(Jan 23, 2014)
But at the bottom line is the fact that if you want to do serious data mining, then rolling up one’s sleeves and writing code is unavoidable.
Christophe Bourguignat
(Sep 16, 2014)
Complex models are to Data Science, what “haute couture” is to the clothing industry : they are not made to be daily used, but are necessary.
Jake Porway
(October 1, 2015)
We must foster environments in which people can speak openly, honestly, and without judgment. We must be constantly curious about each other.
Joel Cadwell
(November 14, 2014)
R provides a common language through which we can visit foreign disciplines and see the same statistical models from a different perspective.
Allen Downey
(December 8, 2015)
There are two kinds of people who violate the rules of statistical inference: people who don’t know them and people who don’t agree with them.
Andre Karpistsenko Success is when what you do is adopted by someone else. The ability to actually deliver something tangible – that’s the main index of success.
John Foreman
If your goal is to positively impact the business, not to build a clustering algorithm that leverages storm and the Twitter API, you’ll be OK.
Joyce Jackson The need to navigate the rapidly growing universe of digital data will rely heavily on the ability to effectively manage and mine the raw data.
Yann LeCun Imagine a box with 500 million knobs, 1,000 light bulbs, and 10 million images to train it with. That’s what a typical Deep Learning system is.
Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, Cass R. Sunstein
Algorithms are not only a threat to be regulated; with the right safeguards in place, they have the potential to be a positive force for equity.
Gabriel Lowy
(July 28, 2014)
Execution, not strategy creates value. Execution is not possible without measurable KPIs. And good quality data provide the context behind KPIs.
Joyce Jackson Actors in Data Mining:
• the Project Leader
• the Data Mining Client
• the Data Mining Analyst
• the Data Mining Engineer
• the IT Analyst.
Markus Vattulainen
Preprocessing is often the most time-consuming phase in knowledge discovery and preprocessing transformations interdependent in unexpected ways.
Pallab Ghosh
(16 February 2019)
Machine-learning techniques used by thousands of scientists to analyse data are producing results that are misleading and often completely wrong.
Chris Wegrzyn It’s one thing to build up some technology and hire some people. It’s another thing entirely to transform how your operation works fundamentally.
Ivan Vasilev A linear composition of a bunch of linear functions is still just a linear function, so most neural networks use non-linear activation functions.
Ivan Vasilev (Autoencoder Neural Network) We want a few small nodes in the middle to learn the data at a conceptual level, producing a compact representation.
Joachim Sauter In Information visualization it is often helpful to see data from a different angle and compare it with other information, update it, network it.
McAfee, Brynjolfsson The more companies characterized themselves as data-driven, the better they performed on objective measures of financial and operational results.
O. G. Sutton A technique succeeds in mathematical physics, not by a clever trick, or a happy accident, but because it expresses some aspect of physical truth.
Yann LeCun There are just not enough brain cells on the planet to even look or even glance at that data, let alone analyze it and extract knowledge from it.
Eric Schmidt
There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.
Jake Porway
(October 1, 2015)
We must scale the process of problem discovery through deeper collaboration between the problem holders, the data holders, and the skills holders.
James Guszcza
(January 26, 2015)
If we want to act on data to get fit or reduce heating bills, we need to understand not just the analytics of the data, but how we make decisions.
Randy Bartlett
The choice of tools in applied statistics is driven by the objective, the structure of the data, and the nature of the uncertainty in the numbers.
André Sionek
(18.12.2018 15:39)
Just ‘plugging in’ a data scientist in your databases won’t deliver the expected results. First, you need to ensure your data is actually valuable.
Claudia Perlich Data is the footprint of real life in some form, and so it is always interesting. It is like a detective game to figure out what is really going on.
Erin Shellman As a data scientist, even if you don’t have the domain expertise you can learn it, and can work on any problem that can be quantitatively described.
John W. Tukey The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
Eric Jonas You don’t really know if what you’re working on is the right problem to be solving sometimes until years out, but you hope you’re on the right track.
Joel Cadwell
(November 14, 2014)
R also becomes the interface to a diverse range of applications. This is R’s unique selling proposition. It is where one goes for new ways of seeing.
John Langford
The reason why we test on natural datasets is because we believe there is some structure captured by the past problems that helps on future problems.
John Straw The value is not in software, the value is in data, and this is really important for every single company, that they understand what data they’ve got.
Michele Nemschoff
(August 30, 2014)
Big data isn’t just for developers and analysts in the technical arena. In today’s digital age, big data has become a powerful tool across industries.
Jiawei Han, Micheline Kamber, Jian Pei
The capability of OLAP to provide multiple and dynamic views of summarized data in a data warehouse sets a solid foundation for successful data mining.
European Union’s General Data Protection Regulation (GDPR)
(Dec. 2016)
Organizations that use ML to make user-impacting decisions must be able to fully explain the data and algorithms that resulted in a particular decision.
George Udny Yule In our lust for measurement, we frequently measure that which we can rather than that which we wish to measure… and forget that there is a difference.
Joyce Jackson Data mining is only tool; it does not eliminate the need to know the business, to understand the data, or to understand the analytical methods involved.
Kuhn, Johnson As long as complex models are properly validated, it may be improper to use a model that is built for interpretation rather than predictive performance.
Yann LeCun The only way to make intelligent machines was to get into learning, because every animal is capable of learning. Anything with a brain basically learns.
(07 May 2018)
We all know that machine learning is about handling data, but it also can be seen as: The art of finding order in data by browsing its inner information.
Erin Shellman Presentation skills are undervalued, but is actually one of the most important factors contributing to personal success and creating successful projects.
Foster Provost & Tom Fawcett
At a high level, data science is a set of fundamental principles that support and guide the principled extraction of information and knowledge from data.
H. James Harrington
If you can’t measure something, you can’t understand it. If you can’t understand it, you can’t control it. If you can’t control it, you can’t improve it.
John W. Tukey Far better an approximate answer to the right question which is often vague, than an exact answer to the wrong question which can always be made precise.
(August 20, 2015)
Every number has a story. As a data scientist, you have the incredible job of digging in and analyzing massive sets of numbers to find what that story is.
RStudio Data science is the process of turning data into understanding and actionable insight. Two key data science tools are data manipulation and visualization.
Jake Porway Data science is a way to see the world through the lens of this new macroscope to learn the patterns of society and nature so we can all live better lives.
Michael Young
For many organisations, the accessibility of the tools and products to deliver analytics and data mining has led to an increased awareness of the benefits.
Ashish Jain
(March 29, 2015)
The next breakthrough in data analysis may not be in individual algorithms, but in the ability to rapidly combine, deploy, and maintain existing algorithms.
Assaf Resnick We believe that only through leveraging data science can IT teams tackle the scale of machines, events and dependencies that must be understood and managed.
Eric Jonas Graduate students, perhaps because of an adherence to sunk cost fallacy, often write really great surveys of the field at the beginning of their PhD thesis.
Foster Provost & Tom Fawcett
The volume and variety of data have far outstripped the capacity of manual analysis, and in some cases have exceeded the capacity of conventional databases.
Why do we use exploratory graphs in data analysis?
• Understand data properties
• Find patterns in data
• Suggest modeling strategies
• “Debug” analyses
Simon Moss
In today’s competitive environment, enterprises require fast, effective and agile business solutions. In other words, not traditional business intelligence.
Andre Karpistsenko Getting through life, through those uncertainties in a way, when you look back and see things still connect and exist, that’s the biggest measure of success.
Jonathan Lenaghan People under pressure to find patterns are prone to fall into the common human fallacies of over insufficient data and over-reading correlation as causation.
Boris Tvaroska
The most important step for applying machine learning to DevOps is to select a method (accuracy, f1, or other), define the expected target, and its evolution.
Kelly Sheridan
Before implementing an advanced analytics strategy, you might have plenty of questions. The key is to be sure you’re asking, and addressing, the correct ones.
Yann LeCun Knowledge is some compilation of data that allows you to make decisions, and what we find today is that computers are making a lot of decisions automatically.
Foster Provost, Tom Fawcett
However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept diffuses into meaningless buzz.
Third Nature
Data warehouses have not been able to keep up with business demands for new sources of information, new types of data, more complex analysis and greater speed.
Eric Jonas When I evaluate machine learning papers, what I am looking to find out is whether the technique worked or not. This is something that the world needs to know …
Tess Nesbitt Sampling – analyzing representative portions of the available information – can help speed development time on models, enabling them to be deployed more quickly.
GE is investing more than $1 billion in building up its data science capabilities to provide data and analytics services across business functions and geographies.
William Jackson
The point of big data is to be able to extract usable information – knowledge – from large volumes of data that do not have any immediately apparent relationships.
Zachary Chase Lipton
The whole reason we turn to machine learning and not handcraft decision rules is that for many problems, simple, easily understood decision theory is insufficient.
John Foreman It’s essential for a data science team to hire people who can really speak about the technical things they’ve done in a way that nontechnical people can understand.
David Lewis-Williams Scientists do not collect data randomly and utterly comprehensively. The data they collect are only those that they consider ‘relevant’ to some hypothesis or theory.
Fred Brooks Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowchart; it’ll be obvious.
Joseph Schumpeter Innovations imply, by virtue of their nature, a big step and a big change … and hardly any ‘ways of doing things’ which have been optimal before remain so afterward.
Ronald Fisher To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem – he may be able to say what the experiment died of.
Kaan Turnali
(Feb 21, 2015)
Passion Matters: Some people go to work. Others get up each morning and work with a desire to make a difference. We don’t need to save the world to make a difference.
Kaiser Fung Before getting into the methodological issues, one needs to ask the most basic question. Did the researchers check the quality of the data or just take the data as is?
Brian Hopkins
(June 27, 2015)
I saw it coming last year. Big data isn’t what it used to be. Not because firms are disillusioned with the technology, but rather because the term is no longer helpful.
Eric Jonas I have a clock that shows the estimated number of days until I’m 80, which is a reasonable life expectancy. It helps to remind me that each day actually really matters.
Philip Russom
The point is to embrace Big Data Management for analytics as a unique practice that doesn’t follow all the strict rules we’re taught for reporting and data warehousing.
Sir Francis Galton [Statistics are] the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the science of man.
(18.02.2019 10:28)
The imminent danger with Artificial Intelligence has nothing to do with machines becoming too intelligent. It has to do with machines inheriting the stupidity of people.
Dan Ariely Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.
Gabriel Lowy
(January 22, 2015)
Success with predictive analytics relies on choosing the right data set, assuring the quality of the data, and validating models and algorithms used to analyze the data.
W. H. Auden Thou shalt not answer questionnaires Or quizzes upon world affairs, Nor with compliance Take any test. Thou shalt not sit with statisticians nor commit A social science.
Hadley Wickham I think there are three main steps in a data science project: you collect data (and questions), analyze it (using visualization and models), then communicate the results.
Justin Washtell
(November 3, 2014)
The central premise of predictive modeling is precisely that one size does not fit all – otherwise we would just assign the same outcome to all cases and be done with it.
Eugene Dubossarsky
(February 15, 2018)
There´s only one purpose to data science, and that is to support decisions. And more specifically, to make better decisions. That should be something no one can argue with.
Brian Liou
(March 2, 2015)
Learning to code for the purpose of analyzing data is a more practical and employable application of coding skills for the majority of those interested in learning to code.
Gabriel Lowy
(January 22, 2015)
The ability to integrate historical data with newer sources of information to predict market trends and rapidly implement strategies is becoming the primary differentiator.
Mike Barlow
In the near future, most companies will rely increasingly on AIsupported agents and bots to manage customer relationships, leading to a radical transformation of commerce.
Naima Chouikhi, Boudour Ammar, Adel M. Alimi
It is a widely accepted fact that data representations intervene noticeably in machine learning tools. The more they are well defined the better the performance results are.
Davenport & Beck
Attention is focused mental engagement on a particular item of information. Items come into our awareness, we attend to a particular item, and then we decide whether to act.
Inside Analysis … big data can only provide value if organizations employ the proper technologies and processes, and figuring out exactly which tools you’ll need can be a serious challenge.
John Foreman Vendors are there to sell you a tool for a problem you may or may not have yet, and they’re very good at convincing you that you need it whether you actually need it or not.
Victor Hu Hiring data scientists is very exciting at this time because in some ways there are no established guidelines on how to do it. People have skills in so many different areas.
William S. Cleveland
Data analysis needs to be part of the blood stream of each department and all should be aware of the workings of subject matter investigations and derive stimulus from them.
Foster Provost & Tom Fawcett
Every good chemist had to be a competent lab technician. Similarly, it is hard to imagine a working data scientist who is not proficient with certain sorts of software tools.
Stephen Pratt Whether you know it or not, as an executive in a company, your company is doing machine learning and artificial intelligence. It’s just in hidden pockets within your company.”
Daniel Tunkelang It’s easy to be lazy and look at aggregates. Drilling down into the differences and looking at specific examples is often what gives us a real understanding of what’s going on.
Istvan Hajnal
(February 23, 2015)
My advice to the market research world is to stop conceptualizing so much when it comes to Big Data and Data Science and simply apply the new techniques there were appropriate.
Kune, Konugurthi, Agarwal, Chillarige, Buyya
Big Data computing is an emerging data science paradigm of multi dimensional information mining for scientific discovery and business analytics over large scale infrastructure.
Exploratory data analysis’ is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.
Dr. Olivier Lichtarge A computer certainly may not reason as well as a scientist but the little it can, logically and objectively, may contribute greatly when applied to our entire body of knowledge.
ATKearney Although Big Data processes large, diverse data sets to reveal complex relationships, humans are the crucial ingredient for interpreting the data and relationships into insights.
Tom Davenport The application of big data now encompasses not only powerful data collection and analytics to support operations but also the embedding of ‘smartness’ in products and services.
Colorado Reed
What should I do if I want to get ‘better’ at machine learning, but I don’t know what I want to learn? Excellent question! My answer: consistently work your way through textbooks.
Prasant Misra, Yogesh Simmhan, Jay Warrior
Analytic and decision making have to be probabilistic; and the system and application has to be conscious of what is “good enough” and not fail in the absence of perfect behavior.
William C. Blackwelder … a hypothesis test tells us whether the observed data are consistent with the null hypothesis, and a confidence interval tells us which hypotheses are consistent with the data.
Andreas Blumauer
(October 28, 2014)
Whenever relational databases cannot fulfill requirements about performance and simplicity, due to the complexity of database queries, graph databases can be used as an alternative.
Eric Jonas In academics or industry, if you’re not actually speaking in a language that your customers understand, then you will have a nice time talking, but no one will really listen to you.
S. N. D. North The science of statistics is the chief instrumentality through which the progress of civilization is now measured, and by which its development hereafter will be largely controlled.
Valdis Krebs Innovation happens at the intersection of two or more different, yet similar, groups. Where one technology meets another, one discipline meets another, one department meets another.
Mark Cuban Artificial Intelligence, deep learning, machine learning – whatever you’re doing if you don’t understand it – learn it. Because otherwise you’re going to be a dinosaur within 3 years.
Mrs. Dillman
We didn’t know in the past that strawberry Pop-Tarts increase in sales, like seven times their normal sales rate, ahead of a hurricane, and the pre-hurricane top-selling item was beer.
Claudia Perlich I prefer somebody who has done ten different things in ten different domains because they will have hopefully learned something new about data from each of different places and domains.
Eric Jonas The hope is that if we can start building the right models to find the right patterns using the right data, then maybe we can start making progress on some of these complicated systems.
(June 2015)
IBM will educate one million data scientists and data engineers on Apache Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC.
Martyn Jones
(March 12, 2015)
Is Big Data really about high volumes, high velocity and high variety, or is it in fact about much noise, too much pomposity and abundant similarity leading to unnecessary high anxiety?
n.n. We Learn . . .
10% of what we read
20% of what we hear
30% of what we see
50% of what we see and hear
70% of what we discuss
80% of what we experience
95% of what we teach others.
n.n. Learning new tools and techniques in data science is sort of like running on treadmill – you have to run continuously to stay on top of it. The minute you stop, you start falling behind.
Thomas Speidel I almost feel that folks in data science [excluding statisticians] are suddenly realizing that this kind of work is not new and are desperately looking for ways to justify a distinction.
Gary Orenstein
Given enough time in computing, you can do just about anything, but only recently have people been able to apply these machine learning models in real time to critical business processes.
Foster Provost & Tom Fawcett
Success in today’s data-oriented business environment requires being able to think about how these fundamental concepts apply to particular business problems – to think data-analytically.
Tavish Srivastava
(May 19, 2015)
Machine Learning algorithms are like solving a Rubik Cube. You grapple at the beginning to figure out the hidden algorithm, but once learnt, some can even solve it in less than 7 seconds.
Tom Gilley
(April 21, 2015)
It’s easy to underestimate your data. And even more so – not think of your IoT data as valuable. However, if you examine your data, you never know the sort of insights you could discover.
Yann LeCun The idea that somehow you can put a bunch of research scientists together and then put some random manager who’s not a scientist directing them doesn’t work. I’ve never ever seen it work.
Christine L. Borgman
Data have no value or meaning in isolation; they exist within a knowledge infrastructure – an ecology of people, practices, technologies, institutions, material objects, and relationships.
Sundar Pichai Machine learning is a core, transformative way by which we’re rethinking everything we’re doing. We’re thoughtfully applying it across all our products, be it search, ads, YouTube or Play.
Anjul Bhambhri A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a Renaissance individual who really wants to learn and bring change to an organization.
To quickly detect and respond to issues, organizations need an analytics platform that offers rich statistical process control (SPC) functionality as well as real-time monitoring and alerting.
Julie Hunt
(April 7, 2015)
The need to analyze data is at the foundation of every effective data management strategy, whether the analysis is handled from the business perspective or the technology side of the equation.
Nikhil Buduma
(29 December 2014)
In general, choosing smart training cases is a very good idea. There’s lots of research that shows that by engineering a clever training set, you can make your neural net a lot more effective.
Scott Mongeau
(December 17)
Analytics professionals should keep a persistent eye open for opportunities to create cross-functional insights from data. Insights can frequently be leveraged across multiple business domains.
Sebastian Raschka
(August 24, 2014)
When we are dealing with a new dataset, it is often useful to employ simple visualization techniques for explanatory data analysis, since the human eye is very powerful at discovering patterns.
David McCandless By visualizing information, we turn it into a landscape that you can explore with your eyes, a sort of information map. And when you’re lost in information, an information map is kind of useful.
Kaiser Fung
(May 2015)
Story time is the moment in a report on data analysis when the author deftly moves from reporting a finding of data to the telling of stories based on assumptions that do not come from the data.
Robert Neuhaus Feature engineering and feature selection are not mutually exclusive. They are both useful. I’d say feature engineering is more important though, especially because you can’t really automate it.
PP Zhu
Data collection and compression are performed at the device level while advanced processes such as semantic analysis, natural language processing, and machine learning are performed in the cloud.
William S. Cleveland
A collection of models and methods for data analysis will be used only if the collection is implemented in a computing environment that makes the models and methods sufficiently efficient to use.
William S. Cleveland
Model building is complex because it requires combining information from exploring the data and information from sources external to the data such as subject matter theory and other sets of data.
Yann LeCun The amount of human brainpower on the planet is actually increasing exponentially as well, but with a very, very, very small exponent. It’s very slow growth rate compared to the data growth rate.
David Lillis
One data manipulation task that you need to do in pretty much any data analysis is recode data. It’s almost never the case that the data are set up exactly the way you need them for your analysis.
SBS documentary “The Age of Big Data” Data is becoming a powerful and most valuable commodity in 21st century. It is leading to scientific insights and new ways of understanding human behaviour. Data can also make you rich. Very rich.
Seth Mottaghinejad
(December 15, 2015)
We will intentionally avoid using proper case for ‘big data’, because (1) the term has been somewhat hackneyed, and (2) … we can think of big data as any dataset too large to fit into memory …
n.n. Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.
Pete Werner
(March 14, 2015)
If you can agree with this sentence: “I have some data that makes sense to represent in a tabular like structure, and I want to do some cool statistics stuff with it” R is definitely a good choice.
Timo Elliott
(January 7, 2015)
The layer-cake best-practice model of analytics (operational systems and external data feeding data marts and a data warehouse, with BI tools as the cherry on the top) is rapidly becoming obsolete.
Hal R.Varian
The ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it – that’s going to be a hugely important skill in the next decades.
Lord Kelvin When you can measure what you are speaking
about and express it in numbers, you know
something about it. When you cannot express it in
numbers, your knowledge is of a meagre and
unsatisfactory kind.
Michael Berthold
(Dec 2014)
Data Warehouses were supposed to solve the need for, I guess all of ETL, once and for all; they ended up being a solution for fairly static data structures but never really captured all of the data.
Roger Ehrenberg The biggest lesson is to have a very clear set of customers that you’re going to serve, notwithstanding the fact you may be building something that can ultimately help many different types of customers.
Tony Fischetti
Model building is very often an iterative process that involves multiple steps of choosing an algorithm and hyperparameters, evaluating that model / cross validation, and optimizing the hyperparameters.
 Fei Fei Li
Little by little, we’re giving sight to the machines. First, we teach them to see. Then, they help us to see better. For the first time, human eyes won’t be the only ones pondering & exploring our world.
Enric Junqué de Fortuny, David Martens, Foster Provost
… big data thinking opens our view to nontraditional data for predictive analytics – datasets in which each data point may incorporate less information, but when taken in aggregate may provide much more.
John D. Cook
(15 December 2010)
Enormous data sets often consist of enormous numbers of small sets of data, none of which by themselves are enough to solve the thing you are interested in, and they fit together in some complicated way.
Eric Jonas Some of the best scientists out there are the ones who are extremely opportunistic – when they see novel ideas and how things suddenly fit together, they drop everything else and work on that for a while.
European Union’s General Data Protection Regulation (GDPR)
(Dec. 2016)
How could a result be explained, especially a result of a machine learning model, without a versioned record of what data was input to generate the result and what data was output representing the result?
Lyndsay Wise
(February 21, 2015)
One of the benefits of cloud analytics and computing in general is the ability for small and mid-sized companies to take advantage of technology and applications that may have previously been out of reach.
Neal Ford
Applications of the future will take advantage of the polyglot nature of the language world. … We should embrace this idea. … It’s all about choosing the right tool for the job and leveraging it correctly.
John Foreman
What’s better : A simple model that’s used, updated, and kept running ? Or a complex model that works when you babysit it but the moment you move on to another problem no one knows what the hell it’s doing?
Martin Blumenau Visualization now allows everyone to see, understand and especially communicate data clearly in order to make the right business decisions and align the entire company with regards to the target and vision.
William S. Cleveland
Theory, both mathematical and non-mathematical theory, is vital to data science. … Tools of data science – models and methods together with computational methods and computing systems – link data and theory.
Andrew Pease
(November 3, 2014)
Without statistics, big data is never going to reach its full potential. And perhaps even more importantly, without informed consumers, big data can be used for misleading and downright dangerous conclusions.
Jeroen Janssens Data scientists love to create interesting models and exciting data visualizations. However, before they get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data.
Mark Hammond
There are 18 million developers in the world, but only one in a thousand have expertise in artificial intelligence. To a lot of developers, AI is inscrutable and inaccessible. We’re trying to ease the burden.
Michal Klos
(January 28, 2015)
We are in the Golden Age of Data. For those of us on the front-lines, it doesn’t feel that way. Every step forward this technology takes, the need for deeper analytics takes two. We’re constantly catching up.
Enric Junqué de Fortuny, David Martens, Foster Provost
Our results suggest that it is worthwhile for companies with access to such fine-grained data, in the context of a key predictive task, to gather data both on more data instances and on more possible features.
Alex Jones
(September 18, 2014)
One of the advantages of open-source software (R, KNIME, Rapidminer etc.) is that when new methods come out, they are often quickly released. Whereas, on traditional software, development can be a long process.
Richard Bellman
An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
Jean-Paul Isson Being a data scientist is not only about data crunching. It’s about understanding the business challenge, creating some valuable actionable insights to the data, and communicating their findings to the business.
Robert Holland
(May 1, 2015)
Current employees who are able to develop an analytics skill set and combine that with their knowledge of the business can be invaluable when moving analytical insights across the ‘last mile’ to decision makers.
Simon Moss
A key differentiator between heterogeneous analytics and traditional BI is the ability to rapidly deploy ideas into solutions, adapt to changes in the environment and maintain flexibility of one’s various assets.
Tina Porter Data Analytics brings together components of memory function and interconnects the relations through holding patterns. So when we process Analytics and Act on them, we essentially create a ‘new permanent record.’
David Smith
(January 21, 2016)
Data Science is a strategic initiative for most companies today, who seek to understand the wealth of data now available to them to understand patterns, make forcecasts, and build data-driven products and process.
Fatih Hamurcu
(May 7, 2015)
On a sequential computer, the fast algorithm is the best algorithm, but for new science area, I believe we need more creative approaches for algorithm design in order to extract more valuable insight in real-time.
Kaiser Fung We are not saying that statisticians should not tell stories. Story-telling is one of our responsibilities. What we want to see is a clear delineation of what is data-driven and what is theory (i.e., assumptions).
Kaiser Fung
(July 2015)
One of the most misguided and dangerous ideas floated around by a group of Big Data enthusiasts is the notion that it is not important to understand why something happens, just because ‘we have a boatload of data’.
Kune, Konugurthi, Agarwal, Chillarige, Buyya
Big Data and traditional data warehousing systems, however, have the similar goals to deliver business value through the analysis of data, but, they differ in the analytics methods and the organization of the data.
Suman Malekani
(January 29, 2015)
While working on Big Data & planning to implement it for the benefit of business, it is very important to explain the insights & valuable knowledge in a way that non-technical business user can actually understand.
Baiju Devani
Data Scientists are people who can reason through data using inferential reasoning, think in terms of probabilities, be scientific and systematic, and make data work at scale using software engineering best practices.
Ed Burns
(August 2014)
Before starting the analytical modeling process for big data analytics applications, organizations need to have the right skills in place and figure out how much data needs to be analyzed to produce accurate findings.
Dr. Olly Downs
(May 18, 2015)
Most of the big data investment focus to date has been on the underlying infrastructure, while development of the applications that make use of that infrastructure – and that deliver actual business value – has lagged.
Data integration features have gained prominence during the last year as companies struggled to incorporate new data sources in their analysis, a process that can consume a sizable percentage of the total project time.
Yanir Seroussi
(July 22, 2018)
We can define data science as a field that deals with description, prediction, and causal inference from data in a manner that is both domain-independent and domain-aware, with the ultimate goal of supporting decisions.
Jeff Leek
To evaluate a person’s work or their productivity requires three things:
1. To be an expert in what they do
2. To have absolutely no reason to care whether they succeed or not
3. To have time available to evaluate them.
Michael Pearmain Naively at lot of businesses think all it takes is to hit the big red button marked ‘Data Science’ get the answer of 42 and sit back and watch the $$$’s roll in, if you’ve been in this world you know this isn’t the case.
Colin Strong
Data now feels so central to business success that without an ongoing, data-mediated relationship with their customer base, we may be looking at an environment where data-poor brands will struggle to compete effectively.
Jeroen Janssens
(August 20, 2014)
We data scientists love to create exciting data visualizations and insightful statistical models. However, before we get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data.
Christine L. Borgman Having the right data is usually better than having more data; little data can be just as valuable as big data. In many cases, there are no data – because relevant data don’t exist, cannot be found, or are not available.
Jeff Leek Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience.
R. A. Fisher … the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only to give the facts a chance of disproving the null hypothesis.
Amir Hajian
A good data scientist in my mind is the person that takes the science part in data science very seriously; a person who is able to find problems and solve them using statistics, machine learning, and distributed computing.
Eric Jonas One way of understanding the data issue we face is by imagining that you are in a stadium and you can listen to 500 people at once. Your goal is to figure out what’s going on in the game just from listening to those people.
Gregory Piatetsky; Anmol Rajpurohit
(June 2014)
Good data science is on the leading edge of scientific understanding of the world, and it is data scientists responsibility to avoid overfitting data and educate the public and the media on the dangers of bad data analysis.
William Vorhies
(October 8, 2014)
Many first time users of predictive models are happy to have the benefit of a good model with which to target their marketing initiatives and don’t ask the equally important question, is this the best model we can be using?
Daniel Gutierrez
(June 5, 2014)
One problem with machine learning is too much data. With today’s big data technology, we’re in a position where we can generate a large number of features. In such cases, fine-tuned feature engineering is even more important.
William Vorhies
(August 19, 2014)
Developing predictive analytic models is a different process from responding to requests for charts and graphs based on existing information. It is much more intuitive and creative and requires the mastery of different tools.
Julia Evans Cleaning up data to the point where you can work with it is a huge amount of work. If you’re trying to reconcile a lot of sources of data that you don’t control like in this flight search example, it can take 80% of your time.
Yanir Seroussi
Contrary to common belief, the hardest part of data science isn’t building an accurate model or obtaining good, clean data. It is much harder to define feasible problems and come up with reasonable ways of measuring solutions.
Gabriel Lowy
(January 22, 2015)
Efficiency is becoming as important a differentiator as speed and scale. As a result, firms are delving deeper into predictive analytics to realize faster time to value and improve operational performance and decision outcomes.
One robust way to determine if two times series, xt and yt, are related is to analyze if there exists an equation like yt=βxt+ut such us residuals (ut) are stationary (its mean and variance does not change when shifted in time).
Joyce Jackson With the current focus on technology and the automated techniques, rather than on the actual processes of exploration and analysis, many people perceive data mining as a product rather than as a discipline that must be mastered.
Eric Jonas Academic culture teaches you that you’re dumb and that you’re probably wrong because most things never work, nature is very hard, and the best you can hope for is working on interesting problems and making a tiny bit of progress.
James Kobielus
Companies may not be able to precisely predict the mix of workloads they’ll need to run in the future. But investing in the right family of big data cloud platforms and applications will give them the right foundation for change.
Diogo Ribeiro
Feature engineering is the process of transforming raw, unprocessed data into a set of targeted features that best represent your underlying machine learning problem. Engineering thoughtful, optimized data is the vital first step.
Simon Garland
(September 16, 2014)
To get meaningful results from big data, companies need to slice and dice their data in many different ways to work out which parts are worth using. Considering the enormity of the task, for many, this requires– above all – speed.
Miguel A. Hernán, John Hsu, Brian Healy
(July 12, 2018)
The validity of causal inferences depends on structural knowledge, which is fallible, to supplement the information in the data. As a consequence, no algorithm can quantify the accuracy of causal inferences from observational data.
Rick Delgado
(January 2015)
Myths change with understanding. Misunderstandings on some of the current myths surrounding big data as follows will fade away: big data is made for big business, big data adoption is high and machine learning overcomes human bias.
Donald Rumsfeld There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don´t know. But there are also unknown unknowns. There are things we don´t know we don´t know.
Marcia Kaufman, Daniel Kirsch
It is no longer sufficient for businesses to understand what has happened in the past, rather it has become essential to ask what will happen in the future, to anticipate trends and to take action that optimize results for business.
Michael L. Brodie
In the end, there is no truth, no ultimate ground truth, no lie-free utterances, as everything is contextual based on incomplete facts and knowledge. All world models are flawed, but Data Science has 2 power tools: Doubt and Verify.
Claudia Perlich So a large part of how things are presented, communicated, and represented carry very different messages from very different angles, depending on what you are reading, so you probably need a very broad depth to understand the issues.
Eric Jonas The right thing to do is to not build a tool company but to build a consultancy based on the tools. Identify the company, identify the market, and build a consultancy. Later, if that works, you can then pivot to being a tool company.
Stephan Duquesnoy
Comment of a DeepLearning user: As a side-note, even though I’m good with pattern-based thinking, I do not have an academic background. I lack patience and feel the need to create, rather than to completely understand what I’m doing.
(November 16, 2017)
AI is when a machine performs a task that human beings find interesting, useful and difficult to do. Your system is artificially intelligent if, for example, machine-learning algorithms infer a customer’s need and recommend a solution.
Jeff Stanton
… even five years ago people would not label themselves as geeks because it was stigmatizing. The tide has turned and we are beginning to see a groundswell certainly among undergraduates who are okay identifying themselves as data geeks.
Kirk Borne
(14. Jan. 2015)
Now is the time to begin thinking of Data Science as a profession not a job, as a corporate culture not a corporate agenda, as a strategy not a stratagem, as a core competency not a course, and as a way of doing things not a thing to do.
Eric Colson, Brad Klingenberg, Jeff Magnusson
(March 31, 2015)
Data science can directly enable a strategic differentiator if the company’s core competency depends on its data and analytic capabilities. When this happens, the company becomes supportive to data science instead of the other way around.
Eric Jonas I actually think a lot of the future is in small data …. As the big data hype cycle crests, we’re going to see more and more people recognizing that what they really want to be doing is asking interesting questions of smaller data sets.
Unlike SAP HANA, Hadoop won’t help you understand your business at ‘the speed of thought.’ But it lets you store and access more voluminous, detailed data at lower cost so you can drill deeper and in different ways into your business data.
Steve Lohr
(Aug. 17, 2014)
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
Jonathan Lenaghan I always try to look at the problem from the end. When you start from the beginning and everything is blue sky, there are hundreds of ideas to chase as well as thousands of ideas to try and, since everything is possible, nothing ever gets done.
Boris Tvaroska
The traditional software engineering approach is to run tests every time there is a change in the codebase. But applying machine learning to DevOps, you need to remember that the quality of machine learning can change with no changes in the code.
Portability of code and environment is one of the challenge every data scientist faces. The code can be framework dependent or it can be machine dependent. The end result – A model that works like a charm on one machine might not do so on another.
Christian Ward
In some cases this is accurate, but in many cases, data isn’t the answer. Actually, it doesn’t even describe the answer. Instead, it is a massive glut of content, numbers, and unstructured signals that require a lot more work to actually be useful.
Alex Jones
(September 18, 2014)
Over time, more industries will fundamentally change or be disrupted as companies begin to leverage analytics, enhance efficiency, and allow data to drive decisions. Simply put, the competitive environment will necessitate data science capabilities.
Amy Heineike There are a lot of different roles that are going under the name ‘data science’ right now, and there are also a lot of roles that are probably what you would think of data science but don’t have a label yet because people aren’t necessarily using it.
Jonathan Symonds
(January 8, 2015)
The standard Machine Learning workflow is:
1. Get the data
2. Transform the data to create meaningful entities
3. Transform data for Machine Learning algorithms
4. Build supervised/unsupervised models/representations
5. Deploy the model in production.
Analise Polsky
Improving Visual Data Discovery:
1. Always have new data sources.
2. Always have new techniques.
3. Always have new tools and platforms.
Visual data discovery is not once and done. It is an iterative process that requires communication and exploration.
Nikhil Buduma
(29 December 2014)
So what’s the idea behind backpropagation? We don’t know what the hidden units ought to be doing, but what we can do is compute how fast the error changes as we change a hidden activity. Essentially we’ll be trying to find the path of steepest descent!
Cheng-Tao Chu
(March 2015)
In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data. For Big Data, it pays off to analyze the data upfront and then design the modeling pipeline accordingly.
E. T. Jaynes
It would be very nice to have a formal apparatus that gives us some ‘optimal’ way of recognizing unusual phenomena and inventing new classes of hypotheses that are most likely to contain the true one; but this remains an art for the creative human mind.
Kunal Jain
(August 28, 2014)
Cleaning data sets is a very crucial step in any kind of data mining. However, it is many times more important while dealing with unstructured data sets. Understanding the data and cleaning the data consumes the maximum time of any text mining analysis.
Kune, Konugurthi, Agarwal, Chillarige, Buyya
Big Data technology is evolving and changing the present traditional data bases with effective data organization, large computing and data workloads processing with new innovative analytics tools bundled with statistical and machine learning techniques.
Ryan Gould
(27.02.2019 10:31)
The misconception about big data is that the value lies in the collecting mass amounts of information and interpreting it. This really isn’t the case. The ultimate value derives from the specific actions’ businesses take after analyzing such data. And if
Nikhil Garg
Most would agree that the single biggest bottleneck for all machine learning is software engineering. We all collectively in the tech industry are still figuring out the best practices, tools, abstractions, and systems that can enable large organizations
Ray Major
(August 22, 2014)
You can’t predict anything with 100% certainty, and your predictive power wanes the farther out you gaze. The study of KPIs over time is all about finding patterns and signals, then applying intelligence in order to make better decisions and gain wisdom.
(March 4th, 2015)
Data Science has its own language. So, if you want to have at least a slight chance of surviving in the enterprise world of tomorrow -with its obsessive focus on collecting and analyzing data- you better have started yesterday with learning this terminol.
Higinio “H.O.” Maycotte, Umbel
Big data is an umbrella term. It encompasses everything from digital data to health data (including your DNA and genome) to the data collected from years and years of paperwork issued and filed by the government. And that’s just what it officially covers.
Kaiser Fung
(September 2014)
This is the norm in statistical analysis. Every time you sit down to write something up, you notice additional nuances or nits. Sometimes, the problem is severe enough I have to re-run everything. Other times, you just decide to gloss over it and move on.
Kevin Kelly Machines are for answers; humans are for questions. The world that Google is constructing – a world of cheap and free answers – having answers is not going to be very significant or important. Having a really great question will be where all the value is.
Sir Ronald Fisher
The Theory of Estimation discusses the principles upon which observational data may be used to estimate, or to throw light upon the values of theoretical quantities, not known numerically, which enter into our specification of the causal system operating.
Bill Franks
(December 10, 2015)
One of the legendary events in the history of analytics was the original Netflix prize. The event led to a terrific example of the need to focus on not only theoretical results, but also pragmatically achievable results, when developing analytic processes.
Brandon Rohrer
(Dec 19, 2015)
Before data science can build the solution to simplify your life or make you lots of money, you have to give it some high quality raw materials to work with. Just like making a pizza, the better the ingredients you start with, the better the final product.
Richard Fichera Part of Hadoop’s appeal is that it is not specifically optimized for any specific solution or data type but rather a general framework for parallel processing, so your developers and data scientists can add any relevant data, whatever its format or source.
E.S. Woolard
(Feb. 1991)
To compete and win, we must redouble our efforts – not only in the quality of our goods and services, but in the quality of our thinking, in the quality of our response to customers, in the quality of our decision making, in the quality of everything we do.
Richard A. Becker, William S. Cleveland
Making graphs is very basic to data analysis. Whether you use the leading edge of statistical methods, or whether you want to quickly see the main features of your data, graphs are a must. They are the single most powerful class of tools for analyzing data.
Enric Junqué de Fortuny, David Martens, Foster Provost
Modeling (especially linear modeling) tends to find the larger generalities first; modeling with larger datasets usually helps to work out nuances, ‘small disjuncts’, and other nonlinearities that are difficult or impossible to capture from smaller datasets.
Eric Jonas In reality, almost no one actually cares about predictive accuracy because in almost all the cases, their starting point is nothing. The number of industries where the difference between 85 versus 90 percent accuracy is the rate-limiting factor is very small.
Alon Hazan, Yoel Shoshan, Daniel Khapun, Roy Aladjem, Vadim Ratner
(29 May 2018)
Deep neural networks have demonstrated impressive performance in various machine learning tasks. However, they are notoriously sensitive to changes in data distribution. Often, even a slight change in the distribution can lead to drastic performance reduction.
Michael Palmer
Data is just like crude. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc., to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.
Lisa Morgan
If you haven’t considered how intelligent automation will impact your industry and company, start now. Automation is going to impact every industry and every business in some way, first as a competitive differentiator and later as a matter of economic necessity.
Guerrilla Analytics
(July 21, 2015)
Data Science done well tells you:
• what you didn’t already know about the data
• what an appropriate algorithm should be, given what you now know about the data
• what the measurable expectations of that algorithm should be when it is automated in production
Tavish Srivastava
(May 19, 2015)
Boosted algorithms are used where we have plenty of data to make a prediction. And we seek exceptionally high predictive power. It is used to for reducing bias and variance in supervised learning. It combines multiple weak predictors to a build strong predictor.
Ben Recht
We’re trying to put (machine learning systems) in self-driving cars, power networks … If we want machine learning models to actually have an impact in everyday experience, we’d better come out with the same guarantees as one of these complicated airplane designs.
T. Alan Keahey Analytics plays a key role by helping to reduce the size and complexity of big data to a point where it can be effectively visualized and understood. In the best scenario, the visualization and analytics are integrated so that they work seamlessly with each other.
Michael L. Brodie
Data Science is a body of principles and techniques for applying data-intensive analysis to investigate phenomena, acquire new knowledge, and correct and integrate previous knowledge with measures of correctness, completeness, and efficiency of the derived results.
Marc Botha
(21.01.2019 18:07)
The hype about the possibilities and possible applications of artificial intelligence (AI) seems currently unlimited. The AI procedures and solutions are praised as true panaceas. However, when viewed soberly, they are just another tool in the toolbox of IT experts.
Kaiser Fung The standard claim is that the observed effect is so large as to obviate the need for having a representative sample. Sorry – the bad news is that a huge effect for a tiny non-random segment of a large population can coexist with no effect for the entire population.
Nathan Yau What is good visualization? It is a representation of data that helps you see what you otherwise would have been blind to if you looked only at the naked source. It enables you to see trends, patterns and outliers that tell you about yourself and what surrounds you.
Steve Ballmer I think it’s the dawn of an exciting new era of info and computer science … It’s a new world in which the ability to understand the world and people and draw conclusions will be really quite remarkable… It’s a fundamentally different way of doing computer science.
Robert Kosara
Raw numbers are easy to report and analyze, but without the proper context, they can be misleading. Is the effect you’re seeing real, or a simple result of the underlying, obvious distribution? Too many analyses and news stories end up reporting things we already know.
Vladimir N. Vapnik
After the success of the SVM in solving real-life problems, the interest in statistical learning theory significantly increased. For the first time, abstract mathematical results in statistical learning theory have a direct impact on algorithmic tools of data analysis.
Zachary Chase Lipton
(January 2015)
Generally, the systems implementation of machine learning methodology and ongoing software maintenance challenges are an understudied area that will continue to grow in importance as machine learning systems become more commonplace in commercial and open source software.
Kevin Hanegan
The potential for data-informed decision making across all roles and business functions is massive. Data Literacy has been proven to positively impact organizations enterprise value by up to 5 percent, so it is little surprise that this skill-set is becoming highly valued.
Big data isn’t merely an extension of BI and it isn’t necessarily about sifting through huge volumes of data. What sets organizations that are extracting meaningful insights from their big data apart from others is their ability to tap the right data in the right situation.
Speier, Cheri; Valacich, Joseph; Vessey, Iris
Information overload occurs when the amount of input to a system exceeds its processing capacity. Decision makers have fairly limited cognitive processing capacity. Consequently, when information overload occurs, it is likely that a reduction in decision quality will occur.
Foster Provost & Tom Fawcett
Understanding the fundamental concepts, and having frameworks for organizing data-analytic thinking, not only will allow one to interact competently, but will help to envision opportunities for improving datadriven decision making or to see data-oriented competitive threats.
William Vorhies
(August 11, 2015)
Data Scientist may be a prestigious title but it doesn’t reflect our area of specialization or the depth of our experience. As legions of newly minted Data Scientists are granted degrees over the next few years the problem for both employee and employer will only grow worse.
R. A. Fisher If … we choose a group of social phenomena with no antecedent knowledge of the causation or absence of causation among them, then the calculation of correlation coefficients, total or partial, will not advance us a step toward evaluating the importance of the causes at work.
n.n. A Data Scientist is someone with deliberate dual personality who can first build a curious business case defined with a telescopic vision and can then dive deep with microscopic lens to sift through DATA to reach the goal while defining and executing all the intermittent tasks.
Christian Madsbjerg The truth is, that we need more, not less, data interpretation to deal with the onslaught of information that constitutes big data. The bottleneck in making sense of the world’s most intractable problems is not a lack of data, it is our inability to analyse and interpret it all.
Jasper Snoek, Hugo Larochelle and Ryan P. Adams
(29 Aug 2012)
Machine learning algorithms frequently require careful tuning of model hyperparameters, regularization terms, and optimization parameters. Unfortunately, this tuning is often a ‘black art’ that requires expert experience, unwritten rules of thumb, or sometimes brute-force search.
Scott Locklin
Feature engineering is another topic which doesn’t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. […] Much of the success of machine learning is actually success in engineering features that a learner can understand.
William McKnight
(October 13, 2014)
Depending on your architecture, the data in Hadoop, the data warehouse and other stores will need to be brought together and analyzed as a whole. Combining Hadoop data with the data in your warehouse and relational data stores greatly increase the effectiveness of your analytics.
Robin Fray Carey Predictive analytics and cloud solutions are separately changing the way organizations do business, and bringing these two technologies together opens up new horizons. This timely research study will help organizations make investment decisions and effectively plan for the future.
Andre Karpistsenko To build successful teams and projects, I strongly believe in the Kaizen approach. Kaizen was made famous in part by Japanese car manufacturers involved in continuous improvement. I believe you should always be looking for ways to improve things, just small things. Just try it out.
Ferris Jumah
(Sep 3, 2014)
We see that machine learning, data mining, data analysis and statistics are all highly ranking skills in the (Data Science Skill) network. This indicates that being able to understand and represent data mathematically, with statistical intuition, is a key skill for data scientists.
Foster Provost & Tom Fawcett
Data science involves principles, processes, and techniques for understanding phenomena via the (automated) analysis of data. For the perspective of this article, the ultimate goal of data science is improving decision making, as this generally is of paramount interest to business.
Gregory Yankelovich
(October 20, 2014)
Big Data is not about volume, size or velocity of data – neither of which are easily translated into financial results for most businesses. It is about the integration of external sources of information and unstructured data into a company’s IT infrastructure and business processes.
(March 2011)
Analytics is the use of data and related business insights developed through applied analytical disciplines (e.g., statistical, contextual, quantitative, predictive, cognitive and other models) to drive fact-based planning, decisions, execution, management, measurement and learning.
Kune, Konugurthi, Agarwal, Chillarige, Buyya
Big Data technologies are being adopted widely for information exploitation with the help of new analytics tools and large scale computing infrastructure to process huge variety of multi-dimensional data in several areas ranging from business intelligence to scientific explorations.
H. Simon The aim … is to provide a clear and rigorous basis for determining when a causal ordering can be said to hold between two variables or groups of variables in a model . . . . The concepts refer to a model-a system of equations-and not to the ‘real’ world the model purports to describe.
Ajay Kelkar
(September 2, 2014)
So consumers are happy to share personal information as long as they see a “value add” for themselves. And organisations with trust-based information sharing relationships with customers will have significant competitive advantage over those with traditional data gathering relationships.
Mike Barlow
To be fair, it’s likely that small devices will eventually have enough processing power and data storage capacity to run AI programs “off the grid,” but that day is still far in the future. For the meantime, we’ll need the cloud to take advantage of AI’s potential as a tool for progress.
John Foreman
There is this idea endemic to the marketing of data science that big data analysis can happen quickly, supporting an innovative and rapidly changing company. But in my experience and in the experience of many of the analysts I know, this marketing idea bears little resemblance to reality.
Martyn Jones
(May 16, 2015)
Big Data is not for everyone, not everyone needs it, and even if some businesses benefit from analysing their data, they can do smaller Big Data using conventional rock-solid, high-performance and proven database technologies, well-architected and packaged technologies that are in wide use.
Paul Hudak
Programs written in a DSL [domain specific language] also have one other important characteristic: they can often be written by non-programmers … a user immersed in a domain already knows the domain semantics. All the DSL designer needs to do is provide a notation to express that semantics.
The ability of deep learning to create features without being explicitly told means that data scientists can save sometimes months of work by relying on these networks. It also means that data scientists can work with more complex feature sets than they might have with machine learning tools.
Jane Griffin
Analytics are hot. There is really no debate anymore on whether to add or not to add analytics to the information technology (IT) and business activities within an organization. Instead the debate centers on how to make the best use of the myriad of analytical opportunities that are out there.
Michael Walker
(November 19, 2014)
The goal of data science is to consider multiple scenarios, create meaning from data and provide decision-makers with high value information to make the best possible decisions. Data diversity and integration trumps big data to create valuable, actionable intelligence to make better decisions.
Dean Abbott
(December 06, 2015)
This kind of mindset is not learned in a university program; it is part of the personality of the individual. Good predictive modelers need to have a forensic mindset and intellectual curiosity, whether or not they understand the mathematics enough to derive the equations for linear regression.
Ed Burns
(October 23, 2014)
Predictive modeling can be a powerful tool to help businesses see problems and opportunities that are coming their way, but when done poorly, it can lead them down a path of error and uncertainty. Understanding where the pitfalls lie is a must for getting the most out of your analytical models.
Rao Naveen
There’s been a lot of talk about trying to make AI work on existing infrastructure. But the sad reality is that you’re always going to end up with something that’s far less than state-of-the-art. And I don’t mean it will be 30 or 40 percent slower. It’s more likely to be a thousand times slower
Gilles Louppe
(July 2014)
There is often no need to build single models over immensely large datasets. Good performance can often be achieved by building models on (very) small random parts of the data and then combining them all in an ensemble, thereby avoiding all practical burdens of making large data fit into memory.
Kristen Paral
(August 23, 2014)
The trouble with many companies today, however, is they don’t know or fully understand how data science can benefit their business. Even the definition and utilization of big data itself can be a mystery to many, because of the confusion caused in the marketplace by all the “big data solutions”.
Mark Barrenechea
(September 11, 2015)
Digital leaders know their data. They convert their information into actionable business insight. Considering that more data is shared online every second today than was stored in the entire Internet 20 years ago, it’s no wonder that differentiating products and services requires advanced tools.
Jonas Salk Reason alone will not serve. Intuition alone can be improved by reason, but reason alone without intuition can easily lead the wrong way … both are necessary. For myself, that’s how my mind works, and that’s how I work … It’s this combination that must be recognized and acknowledged and valued.
Shawn Masters
(April 2, 2015)
As organizations adopt machine-learning techniques, they will see immediate competitive advantages in automation, workflow efficiency and human augmentation. Now is the golden time to consider how machine learning could help your organization and implement this science to improve overall efficiency.
Victor Hu One of the big challenges of being a data scientist that people might not usually think about – is that the results or the insights you come up with have to make sense and be convincing. The more intelligible you can make them, the more likely it is that your recommendations will be put into effect.
Zygmunt Z.
Disributed computation generally is hard, because it adds an additional layer of complexity and communication overhead. The ideal case is scaling linearly with the number of nodes; that’s rarely the case. Emerging evidence shows that very often, one big machine, or even a laptop, outperforms a cluster.
Manish Saraswat
(December 11, 2015)
Data Manipulation is an inevitable phase of predictive modeling. A robust predictive model can’t be just be built using machine learning algorithms. But, with an approach to understand the business problem, the underlying data, performing required data manipulations and then extracting business insights.
Jeffrey Heer, Michael Bostock, Vadim Ogievetsky
Graphical Perception Experiments find that spatial position (as in a scatter plot or bar chart) leads to the most accurate decoding of numerical data and is generally preferable to visual variables such as angle, one-dimensional length, two-dimensional area, three-dimensional volume, and color saturation.
Lavastorm Analytics
Interestingly, the number one concern for people in organizations that are just experimenting or planning for big data is the shortage of analytic professionals, which could indicate that organizations yet to take the big data plunge may be held back because of the lack of big data skills available to them.
Shelly Farnham
I do not know how you teach someone to love to learn, but being self-motivated is integral to this field (Data Science). Once you have the core concepts, to be able to be really excited about, and continue to seek out, new information is something that I look for, for example, when we are recruiting people.
Hernán Resnizky
(May 15, 2015)
Sometimes some data scientists seem to ignore this: you can think of using the most sophisticated and trendy algorithm, come up with brilliant ideas, imagine the most creative visualizations but, if you do not know how to get the data and handle it in the exact way you need it, all of this becomes worthless.
Kevin Daly
Big data is not for the feint of heart, you and your team must be willing to master many disciplines in order to be successful. You’ll need understanding of code, hardware, Virtualization, networking, databases (SQL & NoSQL), ETL, Cloud, and more. Don’t fool yourself, you’ll need some serious skills on-board.
Suman Malekani
(January 29, 2015)
Data Scientist communities have their own complex jargon; multivariate regression models, Big data engineering, Hadoop, Map Reduce, Deep Learning etc. But, unfortunately businesses do not seem to care about how complex the term is or how impressive the math is! They want the results explained in non-tech terms.
Lana Klein
Remember that the most critical thing is not building analytic solution but making sure that your organization starts using it: that means creating buy-in, working to build adoption, educating and training, redesigning processes to include analytics. Give it time, be persistent, improve and results will follow!
The SAP Real-Time Data Platform, with SAP HANA at its core combines Hadoop with SAP Sybase IQ and other SAP technologies to provide a single platform for OLTP and analytics, with common administration, operational management, and lifecycle management support for structured, unstructured, and semistructured data.
Andre Karpistsenko Conversations that happen in machines are different from the ones that happen in the physical world. In the physical world, it lasts a long time and we are able to use a lot of cues other than just text or audio. In computers, interactions are usually very short and many times there are many more people involved.
Christopher Bishop
(September 2013)
Pretty much every application in machine learning will constantly dealing with uncertainty and thats increasingly a trend of computing. Computing is increasingly moving away from computing with logic to computing with uncertainty and moving away from pen crafted solutions to solutions which are learned from data.
Marc Andreessen
Algorithms are everywhere, starting with what you receive in your mail, to the ads that follow you online. Recommendation engines are just small evidences of algorithms at work – if you look around, you will realize that a complete transition of the world’s functioning is in full swing, from humans to algorithms.
Randy Bartlett
The ‘information rush’ is producing a sense of urgency; a great deal of opportunity; and spectacular breakthroughs coming from everywhere. Meanwhile, the combination of low statistics literacy and overzealous promotional hype is facilitating dysfunctional data analysis, which is more detrimental than UFO sightings.
Daniel Gutierrez
(December 30, 2014)
Data Science is the key to unlocking insight from Big Data: by combining computer science skills with statistical analysis and a deep understanding of the data and problem we can not only make better predictions, but also fill in gaps in our knowledge, and even find answers to questions we hadn’t even thought of yet.
Andrej Karpathy
People refer to neural networks as just ‘another tool in your machine learning toolbox’…. Unfortunately, this interpretation completely misses the forest for the trees. Neural networks are not just another classifier; they represent the beginning of a fundamental shift in how we write software. They are Software 2.0.
Tamara Dull
(September 24, 2014)
Today, we live in an always-on digital world. We work online. We socialize online. We shop online. We bank online. We support causes online. Not to mention, we drive on toll roads with our EZPasses, go to Disney World with our MagicBands, and check our personal stats with our Fitbits. We are living in a big data world.
Farhad Malik
(December 28, 2018)
Data Science Is Trial And Error, It’s Research And Recursive, It’s Practical And Theoretical, It Requires Domain Knowledge, It Boosts Your Strategic Skills, You Learn About Statistics And Master Programming Skills. But Most Importantly, It Teaches You To Remain Patient As You Are Always Close To Finding A Better Answer.
Dr. Justin Washtell
(February 26, 2015)
Predictive modeling tools and services are undergoing an inevitable step-change which will free data scientists to focus on applications and insight, and result in more powerful and robust models than ever before. Amongst the key enabling technologies are new hugely scalable cross-validation frameworks, and meta-learning.
Enric Junqué de Fortuny, David Martens, Foster Provost
This study provides a clear illustration that larger data indeed can be more valuable assets for predictive analytics. This implies that institutions with larger data assets – plus the skill to take advantage of them – potentially can obtain substantial competitive advantage over institutions without such access or skill.
Eran Levy
(December 17, 2014)
In the near future we might see BI take another direction: Rather than companies merely purchasing dashboard reporting software for the purposes of internal usage, we’ll be seeing a surge in companies looking to integrate advanced analytics and reporting into their own products. Welcome to the world of embedded analytics.
Nikhil Buduma
(29 December 2014)
[In Neural Networks] It is not required that a neuron has its outlet connected to the inputs of every neuron in the next layer. In fact, selecting which neurons to connect to which other neurons in the next layer is an art that comes from experience. Allowing maximal connectivity will more often than not result in overfitting.
Rahul Kavale
(Nov 16, 2014)
Welcome to Spark model, the beautiful thing about Spark is, it does not restrict us to traditional just Map and reduce operations. It allows us to apply collection like operations on a RDD, giving us another RDD. Since its just a RDD, it can queried via SQL interface, applied machine learning algorithms, and lot of fancy stuff.
Advanced Analytics is “the analysis of all kinds of data using sophisticated quantitative methods (for example, statistics, descriptive and predictive data mining, simulation and optimization) to produce insights that traditional approaches to business intelligence (BI) – such as query and reporting – are unlikely to discover.”
Simon Garland
(September 16, 2014)
Big data is noisy, messy and, frequently, inconsistent. To make sense of it, companies are investing time and money in an array of technology solutions. However, some much-touted approaches are actually quite costly and in many ways, have not yet addressed the expanding set of historical information that companies want to mine.
Bill Schmarzo
(June 17, 2019)
For many companies, their technology architecture is becoming more of a hindrance than an enabler of value creation – it resembles an archeological dig, with layers of ancient technologies layered on top of even more ancient technologies. The result: added weight, obstructed vision, and lack of flexibility, agility and mobility.
Christophe Bourguignat
(Sep 16, 2014)
In real organizations, people need dead simple story-telling – Which features are you using ? How your algorithms work ? What is your strategy ? etc. … If your models are not parsimonious enough, you risk to lose the audience confidence. Convincing stackeholders is a key driver for success, and people trust what they understand.
David Smith
(August 18, 2014)
While there are projects underway to help automate the data cleaning process and reduce the time it takes, the task of automation is made difficult by the fact that the process is as much art as science, and no two data preparation tasks are the same. That’s why flexible, high-level languages like R are a key part of the process.
Vladimir N. Vapnik
If it happens that at the stage of designing the network one constructs a network too complex (for the given amount of training data), the confidence interval will be large. In this case, even if one could minimize the empirical risk down to zero, the amount of errors on the test set could be big. This case is called overfitting.
Mark van Rijmenam
(October 16, 2014)
Although such Business Intelligence is still quite common and does give you at least some insights, the fast-changing world of today requires a different approach. Organisations today should strive for a holistic overview of their internal and external data that is analysed on the spot and returned graphically via live storylines.
Michael Yamnitsky
(Feb. 5, 2015)
Software Goes Invisible: Software is getting smarter, thanks to predictive analytics, machine learning, and artificial intelligence (AI). Whereas the current generation of software is about enabling smarter decision-making for humans, we’re starting to see “invisible software” capable of performing tasks without human intervention.
Patrick Marshall
The key to a successful big data project isn’t the bigness of the data or the slickness of the dashboard a given tool provides. It’s the quality of the selection and analysis of the data. Unfortunately, in many cases, those who use the big data tools may not even be aware of the underlying logic of the data selection and analysis.
Will Kurt
(February 24, 2015)
There is a very important reason we use math. Our intuitions are often trained from a lifetime of experience, throwing that out in the name of ‘objectivity’ is foolishly excluding important information. But intuition also likes to skips steps, making it very easy to make errors in our reasoning if we don’t pull apart our intuition.
G. Mosteller and J. Tukey
George Box has [almost] said ‘The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively.’ These words of caution about ‘natural experiments’ are uncomfortably strong. Yet in today’s world we see no alternative to accepting them as, if anything, too weak.
John Von Neumann The sciences do not try to explain, they hardly even try to interpret, they mainly make models. By a model is meant a mathematical construct which, with the addition of certain verbal interpretations, describes observed phenomena. The justification of such a mathematical construct is solely and precisely that it is expected to work.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals
(10 Nov 2016)
Indeed, in neural networks, we almost always choose our model as the output of running stochastic gradient descent. Appealing to linear models, we analyze how SGD acts as an implicit regularizer. For linear models, SGD always converges to a solution with small norm. Hence, the algorithm itself is implicitly regularizing the solution.
B. S. Everitt Gauss-Markov Theorem: A theorem that proves that if the error terms in a multiple regression have the same variance and are uncorrelated, then the estimators of the parameters in the model produced by least squares estimation are better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.
Daniel Gutierrez
(December 30, 2014)
The human brain is comprised of 300 million pattern matchers fed with data from our five primary senses and memories. In this age of distributed computing and cheap storage in the cloud, “thinking” without a biological brain is possible for the first time in history. The sensory input into this new, extra-corporeal brain is big data.
Scott Mongeau
(October 7, 2014)
Advances in analytics and artificial intelligence increasingly encompass decision making traditionally associated with middle management. With business analytics systems increasingly able to make complex decisions, the traditional scope of management is being squeezed, forcing contemplation of the future scope of management practice.
The Economist
The end of data scientists. Data science moves from the specialist to the everyman. Familiarity with data analysis becomes part of the skill set of ordinary business users, not experts with “analyst” in their titles. Organizations that use data to make decisions are more successful, and those that don’t use data begin to fall behind.
But with the wider application of advanced analytics comes a new burden: The organizations that use them must also prove that their algorithms are trustworthy. Data, analytics, trust and business relationships are interdependent, so any company that sets store by and predicates decisions on its analytics must also govern their impact.
Guoyun Tu, Yanwei Fu, Boyang Li, Jiarui Gao, Yu-Gang Jiang, Xiangyang Xue
Artificial Intelligence (AI) technologies could be broadly categorised into Analytics and Autonomy. Analytics focuses on algorithms offering perception, comprehension, and projection of knowledge gleaned from sensorial data. Autonomy revolves around decision making, and influencing and shaping the environment through action production.
(December 31, 2014)
An estimated 11 million people are involved in the software industry today (according to IDC). They write complicated, brittle software using a declarative approach that often fails to recognize the complexity of the real world and the intricacy of user behavior. The future of software engineering is not declarative, it’s probabilistic!
Andreas Schmitz, Alexander Linden
(March 11, 2015)
A data scientist must possess the knack of being able to ‘identify business value from mathematical models.’ But that vital business value can only materialize if the data scientist also networks with other departments, understands their objectives, is familiar with their data and processes – and can spot the analysis options they provide.
Daniel Goldman
True AI, also known as artificial general intelligence or AGI, is an artificial system that acts essentially the way a human acts. On the other hand, most of the forms of artificial intelligence in existence these days is simply machine learning systems, and in many cases all the system is doing is fine tuning a complex non-linear function.
Foster Provost & Tom Fawcett
On a scale less grand, but probably more common, data-analytics projects reach into all business units. Employees throughout these units must interact with the data-science team. If these employees do not have a fundamental grounding in the principles of data-analytic thinking, they will not really understand what is happening in the business.
Dan Hirpara
What data fusion brings to the table is the idea that end-users, whether they are humans or machines, are brought into the data processing loop as collaborators. By iteratively combining multiple data streams in new and interesting ways, driven by the changing needs of users, data fusion produces a wide variety of ways to aggregate data streams.
Jonas Salk
At one time we had wisdom, but little knowledge. Now we have a great deal of knowledge, but do we have enough wisdom to deal with that knowledge? I define wisdom as the capacity to make retrospective judgments prospectively. I think these are human qualities, human attributes that need to be brought out, need to be drawn upon, need to be valued.
Arijit Sengupta
(April 2019)
Imagine you are selling cookies. Data Science asks a prediction question: ‘Who is most likely to buy?’ However, the business user actually asks a business optimization question: ‘How do I make the most money?’ The answers to these two questions are not identical and that is the entire problem with evaluating AI without considering business impact.
Gil Allouche
(January 9, 2015)
Improvements in technology and big data trends have given rise to improvements in machine learning. The sheer volume of data is growing exponentially, and companies are looking for faster speeds and real-time analytics. Cognitive computing combines machine learning and artificial intelligence to go beyond data mining and provide actionable insights.
Stephen Pratt When you’re thinking about artificial intelligence and machine learning at the enterprise level, it’s very important to make a road map. We’re making the same mistakes that we made in the advent of CRM systems, ERP systems, in the introduction of client-server technology – even the introduction of the computer into businesses. We need to learn from that.
Francis Diebold
(October 4, 2015)
Why so much PCR, and so little ridge regression? Ridge and PCR are both shrinkage procedures involving PC’s. The difference is that ridge effectively includes all PC’s and shrinks according to sizes of associated eigenvalues, whereas PCR effectively shrinks some PCs completely to zero (those not included) and doesn’t shrink others at all (those included).
Mitchell A. Sanders
(August 27, 2013)
Understanding correlation, multivariate regression and all aspects of massaging data together to look at it from different angles for use in predictive and prescriptive modeling is the backbone knowledge that’s really step one of revealing intelligence…. If you don’t have this, all the data collection and presentation polishing in the world is meaningless.
Suresh Kumar Gorakala
(December 26, 2014)
What is a Prediction Problem? A business problem which involves predicting future events by extracting patterns in the historical data. Prediction problems are solved using Statistical techniques, mathematical models or machine learning techniques. For example: Forecasting stock price for the next week, predicting which football team wins the world cup, etc.
Damian Mingle
(September 15, 2015)
It is important to remember that Data Science techniques are tools that we can use to help make better decisions, with an organization and are not an end in themselves. It is paramount that, when tasked with creating a predictive model, we fully understand the business problem that this model is being constructed to address and ensure that it does address it.
Miguel A. Hernán, John Hsu, Brian Healy
(July 12, 2018)
Sciences are primarily defined by their questions rather than by their tools. We define astrophysics as the discipline that learns the composition of the stars, not as the discipline that uses the spectroscope. Similarly, data science is the discipline that describes, predicts, and makes causal inferences, not the discipline that uses machine learning algorithms.
Bob Horton
(January 20, 2015)
Data science is an interdisciplinary endeavor born of the synergy between computing, statistics, data management, and visualization. This can make it challenging to get started, because you have to know so many things before you get to the good stuff. We’re going to try to ease into it by starting with computational explorations of mathematical and statistical concepts.
Andrew Long
Depending on the learning curve, there are a few strategies we can employ to improve models
High Bias:
• Add new features
• Increase model complexity
• Reduce regularization
• Change model architecture
High Variance:
• Add more samples
• Add regularization
• Reduce number of features
• Decrease model complexity
• Add better features
• Change model architecture
Jim Harris
(October 14, 2014)
Just as it has always been, in between data acquisition and data usage there’s a lot that has to happen. Not just data integration, but data quality and data governance too. Big data technology doesn’t magically make any of these things happen. In fact, big data just makes us even more painfully aware there’s no magic behind data management’s curtain, just a lot of hard work.
Mark van Rijmenam
(September 2, 2014)
All these new Big Data applications require a new way of working. As a result General Motors is currently undergoing a massive, cultural, change to become data-driven; hiring thousands of new employees will have a profound affect on the company culture, but in the end all existing and new employees must learn and adapt to this new, data-driven and information-centric, culture.
Hal Varian If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data.
Joyce Jackson In many applications, particularly in the business domain, the data is not stationary, but rather changing and evolving. This changing data may make previously discovered patterns invalid and as a result, there is clearly a need for incremental methods that are able to update changing models, and for strategies to identify and manage patterns of temporal change in knowledge bases.
Martin Fowler
I’m confident to say that if you starting a new strategic enterprise application you should no longer be assuming that your persistence should be relational. … One of the interesting consequences of this is that we are gearing up for a shift to polyglot persistence where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data.
Miguel A. Hernán, John Hsu, Brian Healy
(July 12, 2018)
Data science is a component of many sciences, including the health and social sciences. Therefore, the tasks of data science are the tasks of those sciences – description, prediction, causal inference. A sometimes overlooked point is that a successful data science requires not only good data and algorithms, but also domain knowledge (including causal knowledge) from its parent sciences.
Within the next five years, big data will become the norm, enabling a new horizon of personalization for both products and services. Wise leaders will soon embrace the game-changing opportunities that big data afford to their societies and organizations, and will provide the necessary sponsorship to realize this potential. Skeptics and laggards, meanwhile, look set to pay a heavy price.
Kevin Daly
So we’ve defined that big data is big, comes from many sources, and is growing at a staggering rate. This is the problem that a data platform is trying to solve. It’s more than just a large storage pool, NoSQL database and map reduce. A data platform is a set of tools that is designed to solve the inherent problems of handling, analyzing and getting actionable intelligence from big data.
Pete Werner
(March 14, 2015)
R is certainly becoming more and more popular, and seems to have found widespread adoption within many statistical research communities. This is a great thing as it means as new statistical methods or practices come out of the research world, they are often implemented and available in R. In many cases they have been written by the person who “wrote the book” (or paper) on a given topic.
Richard D. Quodomine
(May 19, 2015)
Every morning, the analyst, from junior to senior to principal, has a choice to make among three possible paths. Do I: increase my technical proficiency by learning to program or some new statistical methodology, or do I improve my knowledge level of the industry I am working with because of a project that is upcoming, or do I work on soft skills to improve reporting and client relations?
Rohit Adlakha
(November 3, 2017)
The question is no longer whether AI is going to fundamentally change the workplace. According to a recent survey, 85 percent of executives believe that AI will be transformative for their companies, enabling them to enter a new business or competitive advantage. Now, the true question lies in how companies can successfully leverage AI in ways that joins, not replaces, the human workforce.
Mark Dickson
The ‘Age of Automation’ is upon us. Companies strive to reduce their costs by using technology to replace humans at every opportunity. Business executives fight over experts in artificial intelligence and data science in hopes of attaining a competitive edge over their rivals. Even the wary flock to siren calls of ever-greater efficiency via investments in computers, robotics and software.
Rachel Clinton
(January 7, 2015)
1. Think carefully about which projects you take on.
2. Use as much data as you can from as many places as possible.
3. Don’t just use internal customer data.
4. Have a clear sampling strategy.
5. Always use a holdout sample.
6. Spend time on ‘throwaway’ modelling.
7. Refresh your model regularly.
8. Make sure your insights are meaningful to other people.
9. Use your model in the real world.
Chris Diehl
(February 18, 2015)
The ability to collect, store, and process large volumes of data doesn’t confer advantage by default. It’s still common to fixate on the wrong questions and fail to recover quickly when mistakes are made. To accelerate organizational learning with data, we need to think carefully about our objectives and have realistic expectations about what insights we can derive from measurement and analysis.
Paul Kent A lot of existing Big Data techniques require you to really get your hands dirty; I don’t think that most Big Data software is as mature as it needs to be in order to be accessible to business users at most enterprises. So if you’re not Google or LinkedIn or Facebook, and you don’t have thousands of engineers to work with Big Data, it can be difficult to find business answers in the information.
Pat Bakey
(December 31, 2014)
Today, an individualized customer experience isn’t a nice-to-have – it’s a must-have. People don’t want to be treated as just a number in a big group. They don’t want to be subjected to mass, impersonal marketing messages – especially when they’ve given away personal information. To remain competitive, retailers must meet these expectations. The message is to not abuse data – but to use it wisely.
Wojciech Bolanowski
Numerous changes and innovations have come to life recently. The pace of digital revolution is unimaginable concerning it keeps on increasing. There is no doubt most of approaching digital changes are potentially disruptive to older habits, businesses, beliefs. Unconditionally they are changing former way of life on the globe. They push whole humanity into something very new and completely unknown.
John Geer
(May 6, 2015)
There is predictable data as far as the eye can see. Millions of variables quietly tracing the path we thought, and perhaps hoped, they would. Because there are so many, noticing when one of these variables does something unexpected is a task that is unsolvable by diligence alone. In order to spot these rare unexpected observations, we need an often-overlooked statistical analysis: anomaly detection.
Michael Walker
(October 14, 2014)
Unfortunately, like premature claims of real artificial intelligence (not fake AI like brute force calculations), such tech has not yet (knowingly) been invented (may be reality in future). While sophisticated machine learning algorithms can add significant value – they are not artificial intelligence (yet) and still require data scientists to design, execute, constantly modify and interpret meaning.
Pete Werner
(March 14, 2015)
R is a niche language, but if your work falls within that niche it is a wonderful tool. There are some great packages, cutting edge methods, and a generally enthusiastic and welcoming community. You can also find many of these things in other languages like Python and Java, but I still find myself turning to R when I have a bunch of data I want to explore and just try out a bunch of different things.
Tony Baer
(Graph Databases:) No longer need enterprises make tradeoffs between rich and reach; they can submit highly complex queries against enormous, highly varied data sets without concern that new data with varying structure will break their queries. These innovations are also bringing the capability for asking questions that, in the SQL environment, would otherwise require tens or hundreds of table joins.
Wild & Seber In analyzing data, the main criterion for deciding whether to treat a variable as discrete or continuous is whether the data on that variable contains a large number of different values that are seldom repeated or a relatively small number of distinct values that keep reappearing. Variables with few repeated values are treated as continuous. Variables with many repeated values are treated as discrete.
Foster Provost & Tom Fawcett
One of the most critical aspects of data science is the support of data-analytic thinking. Skill at thinking data-analytically is important not just for the data scientist but throughout the organization. For example, managers and line employees in other functional areas will only get the best from the company’s data-science resources if they have some basic understanding of the fundamental principles.
Yann LeCun
We now have unsupervised techniques that actually work. The problem is that you can beat them by just collecting more data, and then using supervised learning. This is why in industry, the applications of Deep Learning are currently all supervised. I agree with you that for the search and advertising industry, supervised learning is used because of the vast amounts of data being generated and gathered.
Gabriel Lowy
(January 22, 2015)
Predictive analytics is an iterative process that begins with an understanding of the question the user wants to answer. By exploring the relationships among different variables using correlation analysis, users can build sophisticated mathematical models that can cut through the complexity of modern computing systems to uncover previously hidden patterns, identify classifications and make associations.
Jeff Leek
(Feb. 14, 2014)
Since most people performing data analysis are not statisticians there is a lot of room for error in the application of statistical methods. This error is magnified enormously when naive analysts are given too many “researcher degrees of freedom”. If a naive analyst can pick any of a range of methods and does not understand how they work, they will generally pick the one that gives them maximum benefit.
John Foreman
If you want to move at the speed of “now, light, big data, thought, stuff,” pick your big data analytics battles. If your business is currently too chaotic to support a complex model, don’t build one. Focus on providing solid, simple analysis until an opportunity arises that is revenue-important enough and stable enough to merit the type of investment a full-fledged data science modeling effort requires.
Daniel Kirsch
R, an open source programming language for computational statistics, visualization and data is becoming a ubiquitous tool in advanced analytics offerings. Nearly every top vendor of advanced analytics has integrated R into their offering and so that they can now import R models. This allows data scientists, statisticians and other sophisticated enterprise users to leverage R within their analytics package.
Frank Alfieri
Data analytics is not magic, or voodoo. Data analytics is a business vertical like any other that has it’s own supply chain dynamics, market place, and laterally integrated processes and operations parameters. If your business is not working at capacity or delivering the ROI that it should be, re-design and re-align your business strategy. Data alone will not solve the underlying business strategy problems.
Data scientists know that it is futile to impose raw math and statistics on people who are not adept at them. The goal is to get an analytics platform into the hands of people who can build the models for use all around the organization. Every analytics platform claims ease of use, but that is not enough. It must be sufficiently powerful to meet the needs of data scientists yet easy enough for LOB staff to use.
Mark Zuckerberg
(September 11, 2013)
People share and put billions of connections into this big graph every day. We don’t want to just add incrementally to that. We want, over the next five or ten years, to take on a road map to try to understand everything in the world semantically and map everything out. These are the big themes for us and is what we are going to try and do over the next five or ten years. That is what I have tried to focus us on …
Daniel Gutierrez
(December 31, 2014)
What hiring companies consider requirements for being a data scientist. Here is a short list for an honest assessment:
– Are you really good at math – undeterred with calculus, differential equations, and linear algebra? Are you also strong in statistics and probability theory?
– Do you also know R and/or Python for developing machine learning algorithms?
– Do you have deep domain knowledge of a particular industry?
Herbert Simon
…in an information-rich world, the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it.
Marissa Mayer
(January 24, 2013)
The Web is so vast … you need to extend categorization and make sense of the content and have a Web ordered for you … One of the key pieces is you have to understand and decide what the Ontology of entities is. Meaning how things are named and how are they organized into hierarchies … By mapping people’s search habits you pull all their content together and have a feed of information that is the web ordered for you.
Some decisions you need to make are big enough to change the course for your business. And your past experiences may not be good predictors of the future. More data are within your reach to understand what was previously unknown. Sophisticated analytical tools are available to you to ‘see’ a wider range of possibilities and evaluate them quickly. Now is a good time for an upgrade in your decision making capabilities.
Neera Talbert
(March 16, 2015)
Unlike a pure statistician, a data scientist is also expected to write code and understand business. Data science is a multi-disciplinary practice requiring a broad range of knowledge and insight. It’s not unusual for a data scientist to explore a fresh set of data in the morning, create a model before lunch, run a series of analytics in the afternoon and brief a team of digital marketers before heading home at night.
R. A. Fisher No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or, ideally, one question at a time. The writer is convinced that this view is wholly mistaken. Nature, he suggests, will best respond to a logical and carefully thought out questionnaire; indeed, if we ask her a single question, she will often refuse to answer until some other topic has been discussed.
Judy Selby
(April 20, 2015)
Big Data’s undeniable impact on companies’ goodwill and reputation has permeated the landscape of corporate valuation. Recent research confirms that companies need to face the new normal whereby corporate reputations suffer after mishaps with data under their control. Today’s companies must appreciate that their use, misuse and governance of Big Data can have an impactful effect on their goodwill and resulting valuation.
Michael Walker
(September 19, 2014)
As many organizations are now learning, it is very difficult to get any value out of large data sets without clear goals, employing sophisticated data science techniques (e.g., machine learning and algorithms) and the right data crunching and analytical technologies. Getting value from data requires special data science methods to differentiate signal from noise to extract meaning and requires special compute systems and power.
TJ Laher
(November 14, 2014)
The end goal of pervasive analytics is simple and will change the way in which the world operates today. By feeding individuals the right information, at the right time, analytics become invisible and embedded into every application and workflow of every user. It is a vision, a goal, a strategy that every individual across every industry can rally around in order to drive the business metrics that matter through the use of data.
Lillian Pierson
(May 11, 2015)
The 4 Types of Data Analytics
• Descriptive: Answers the question, ‘What Happened?’.
• Diagnostic: Commonly used in engineering and sciences to diagnose ‘what went wrong?’.
• Predictive: Used to predict for future trends and events based on statistical or mathematical modeling of current and historical data.
• Prescriptive: Used to tell you what to do to achieve a desired result. Based on the findings of predictive analytics.
Ted Dunning, Ellen Friedman
There is a dizzying array of algorithms from which to choose, and just making the choice between them presupposes that you have sufficiently advanced mathematical background to understand the alternatives and make a rational choice. The options are also changing, evolving constantly as a result of the work of some very bright, very dedicated researchers who are continually refining existing algorithms and coming up with new ones.
James Robert Lloyd
Making sense of data is one of the great challenges of the information age we live in. While it is becoming easier to collect and store all kinds of data, from personal medical data, to scientific data, to public data, and commercial data, there are relatively few people trained in the statistical and machine learning methods required to test hypotheses, make predictions, and otherwise create interpretable knowledge from this data.
Michael Walker
(November 19, 2014)
Simply looking at big data (e.g., total offensive or defensive yards) will not provide the right information – and only focusing on the single data point of pass completion percentage will not provide the valuable intelligence to help reach the goal of improving pass completion percentage. Only integrating and analyzing a variety of smaller smart data points will provide the actionable knowledge to make the best possible decisions.
Brad Hedlund
(September 10, 2011)
Why did Hadoop come to exist? What problem does it solve? Simply put, businesses and governments have a tremendous amount of data that needs to be analyzed and processed very quickly. If I can chop that huge chunk of data into small chunks and spread it out over many machines, and have all those machines processes their portion of the data in parallel – I can get answers extremely fast – and that, in a nutshell, is what Hadoop does.
Paul Barsch
(February 23, 2015)
It is often thought that Apache Hadoop based data lakes are a potential panacea to thorny data management issues long germane to relational databases. After all, the (mistaken) belief goes, you can simply dump all your data into Hadoop’s file system, and via schema on read magic, your desired result sets will appear with very little effort. However, data management – even for Hadoop – isn’t going away and in fact, probably never will.
John Mount
(November 13, 2014)
One effect antithetical to exchangeability is “concept drift.” Concept drift is when the meanings and distributions of variables or relations between variables changes over time. Concept drift is a killer: if the relations available to you during training are thought not to hold during later application then you should not expect to build a useful model. This one of the hard lessons that statistics tries so hard to quantify and teach.
Larry Hardesty
(October 16, 2015)
Big-data analysis consists of searching for buried patterns that have some kind of predictive power. But choosing which “features” of the data to analyze usually requires some human intuition. In a database containing, say, the beginning and end dates of various sales promotions and weekly profits, the crucial data may not be the dates themselves but the spans between them, or not the total profits but the averages across those spans.
Judea Pearl
(July 2018)
Current machine learning systems operate, almost exclusively, in a statistical, or model-free mode, which entails severe theoretical limits on their power and performance. Such systems cannot reason about interventions and retrospection and, therefore, cannot serve as the basis for strong AI. To achieve human level intelligence, learning machines need the guidance of a model of reality, similar to the ones used in causal inference tasks.
Diego Galar Pascual
Many balanced scorecards and key performance indicator solutions are available in the market, and all of them make similar claims – their product will make a manufacturing process run better, faster, more efficently, and with greater returns. Yet, their efficacy is questionable as the necessary information is often scattered across disconnected silos of data in each department of an industry, and it is difficult to integrate these silos.
Julie Hunt
(April 7, 2015)
Within each business there are many people who possess the knowledge of how data needs to be used for business processes and particularly for how the right data can improve business agility, revenue growth and help meet customer expectations. Such employees are invaluable for examining business data: first, to determine its value in context, and second, to understand how it contributes to intelligence being gathered for the organization.
Jeffrey Ng
(December 26, 2014)
You do not know how to model: Learn it dude! There is no short-cut to learning. Your organization needs to learn it, even yourself. Leverage every single person of your organization that has any glimpse of experiences in dealing with the data. Combine that quant dude with a domain expert, let them fight and muddle through the journey. The organization needs it. So does you to learn how it brings value exactly to different internal clients.
Joyce Jackson Monitoring and maintenance are important issues if the data mining result becomes part of the day-to-day business and its environment. A careful preparation of a maintenance strategy helps to avoid unnecessarily long periods of incorrect usage of data mining results. To monitor the deployment of the data mining result(s), the project needs a detailed plan on the monitoring process. This plan takes into account the specific type of deployment.
R. A. Fisher Modern statisticians are familiar with the notion that any finite body of data contains only a limited amount of information on any point under examination; that this limit is set by the nature of the data themselves, and cannot be increased by any amount of ingenuity expended in their statistical examination: that the statistician’s task, in fact, is limited to the extraction of the whole of the available information on any particular issue.
Vitaly Shmatikov
(July 26, 2018)
When training ML models, it is not enough to ask if the model has learned its task well. Creators of ML models must ask what else their models have learned. Are they memorizing and leaking their training data? Are they discovering privacy-violating features that have nothing to do with their learning tasks? Are they hiding backdoor functionality? We need least-privilege ML models that learn only what they need for their task – and nothing more.
Sebastian Raschka
(August 24, 2014)
Distinguishing between feature selection and dimensionality reduction might seem counter-intuitive at first, since feature selection will eventually lead (reduce dimensionality) to a smaller feature space. In practice, the key difference between the terms “feature selection” and “dimensionality reduction” is that in feature selection, we keep the “original feature axis”, whereas dimensionality reduction usually involves a transformation technique.
(November 10, 2014)
The success of analytics lies in finding insights fast. Dashboards and discovery tools can deliver meaningful information, but in order to extend this to a larger audience, organizations need to infuse visual analytics into existing applications and create the right user experience for data. The key is to build an analytical framework that empowers users with seamless access to actionable data and provides powerful visualizations to drive decisions.
Gordon S. Linoff
(September 15, 2014)
In any case, I come to the conclusion that Data Science is just another term in a long-line of terms. Whether called statistics or customer analytics or data mining or analytics or data science, the goal is the same. Computers have been and are gathering incredible amounts of data about people, businesses, markets, economies, needs, desires, and solutions – there will always be people who take up the challenge of transforming the data into solutions.
Anmol Rajpurohit
(May 15, 2014)
For a long time, Predictive Analytics has been primarily the responsibility of the Data Science and Analytics team, but this outlook is changing fast. While Data Science team still remains the primary contributor, the responsibility is increasingly being shared with database management, BI, LOB (Line of Business) analysts and others. This clearly demonstrates the need for better training and support for the non-technical users of Predictive Analytics.
Avi Kalderon
(JAN 27, 2015)
Without effective data governance and data management, big data can mean big problems for many organizations already struggling with more data than they can handle. That ‘lake’ they are building can very easily become a ‘cesspool’ without appropriate data management practices that are adapted to this new platform. The solution? Firms need to actively adapt their data governance and data management capabilities – from implementing to ongoing maintenance.
Scott W. Strong
(April 10, 2018)
Data science, surprisingly perhaps, is not about designing the most advanced machine learning algorithms and training them on all of the data (and then having Skynet). It’s about finding the right data, becoming a quasi-expert on the process, system, or event you are trying to model, and crafting features that will help quirky and sometimes frail statistical algorithms make accurate predictions. Very little time is actually spent on the algorithm itself.
Start small and go big: Analytical projects should not be planned across an entire company, or even division-wide. Initial pilots should focus on small, identifiable challenges and work to resolve those challenges. Once a project has been successfully piloted and measured, other teams within the organization will see the value in the new analytical technologies, and also understand the organizational changes required to adopt a new mindset and technology.
Jeffrey P. Bigham
A machine isn’t a human. It’s not going to necessarily incorporate bias even from biased training data in the same way that a human would. Machine learning isn’t necessarily going to adopt-for lack of a better word-a clearly racist bias. It’s likely to have some kind of much more nuanced bias that is far more difficult to predict. It may, say, come up with very specific instances of people it doesn’t want to hire that may not even be related to human bias.
Mark van Rijmenam
In fact, research by Bain has showed that companies using analytics are 2x more likely to have a high financial performance than organizations not using analytics. Even more, data-driven organizations that have successfully implemented a Big Data Strategy are 5x more like to make faster decisions and are 3x more likely to have a better execution. Of course, these alone would be sufficient reasons to start developing a Big Data strategy for your organization.
Tony Baer
Data sources such as social media text, messaging, log files (such as clickstream data), and machine data from sensors in the physical world present the opportunity to pick up where transaction systems leave off regarding underlying sentiment driving customer interactions; external events or trends impacting institutional financial or security risk; or adding more detail regarding the environment in which supply chains, transport or utility networks operate.
Foster Provost & Tom Fawcett
We should expect a ‘Big Data 2.0’ phase to follow ‘Big Data 1.0’. Once firms have become capable of processing massive data in a flexible fashion, they should begin asking: ‘What can I do that couldn´t do before, or do better than I could do before?’ This is likely to be the golden era of data science. The principles and techniques (introduced currently e.g. due to ‘Predictive Analytics’ and HANA) will be applied far more broadly and deeply than they are today.
Marilyn Matz
(September 2, 2014)
Hadoop is well suited for simple parallel problems but it comes up short for large-scale complex analytics. A growing number of complex analytics use cases are proving to be unworkable in Hadoop. Some examples include recommendation engines based on millions of customers and products, running massive correlations across giant arrays of genetic sequencing data and applying powerful noise reduction algorithms to finding actionable information in sensor and image data.
Nate Oostendorp
(Mar 1, 2019)
Within 10 years, data science will be so enmeshed within industry-specific applications and broad productivity tools that we may no longer think of it is a hot career. Just as generations of math and statistics students have gone on to fill all manner of roles in business and academia without thinking of themselves as mathematicians or statisticians, the newly minted data scientist grads will be tomorrow’s manufacturing engineers, marketing leaders and medical researchers.
Shubhadeep Roychowdhury
(Fr 08.02.2019 22:00)
Regardless of whatever we think about the mysterious subject of Probability, we live and breath in a stochastic environment. From the ever elusive Quantum Mechanics to our daily life (‘There is 70% chance it will rain today’, ‘The chance of getting the job done in time is less than 30%’ … ) we use it, knowingly or unknowingly. We live in a ‘Chancy, Chancy, Chancy world’. And thus, knowing how to reason about it, is one of the most important tools in the arsenal of any person.
Tomasz Malisiewicz Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. You can see the dependencies in this definition:
• The performance measures you’ve chosen (RMSE? AUC?)
• The framing of the problem (classification? regression?)
• The predictive models you’re using (SVM?)
• The raw data you have selected and prepared (samples? formatting? cleaning?)
Gregor Heinrich
The intuition behind ‘latent semantic analysis’ (LSA) is to find the latent structure of ‘topics’ or ‘concepts’ in a text corpus, which captures the meaning of the text that is imagined to be obscured by “word choice” noise. The term ‘latent semantic analysis’ has been coined by Deerwester et al. who empirically showed that the co-occurrence structure of terms in text documents can be used to recover this latent topic structure, notably without any usage of background knowledge.
Shawn Masters
(April 2, 2015)
The amount of data corporations have to confront is so immense we have yet to settle on an environment-wide euphemism to describe it. Whether it’s big data, data lakes, data tsunamis or even the datapocalypse, the message is clear – we must modify our business practices to navigate and survive the data flood. The bad news is data complexity will only continue to grow. The good news is that technology has provided us with a tool we need to keep our head above water – machine learning.
Don Boyd
(September 02, 2014)
We have chosen R because:

– It is extremely flexible, allowing us to do data collection, data management, exploratory data analysis, and other essential non-modeling tasks.

– Manipulating matrices is easy.

– It has sophisticated tools for modeling investment returns and for analyzing and presenting results of simulations. And it has great tools for visualizing results.

– The work can be completely open and reproducible, which is essential to the success of this project.
Neil Lawrence
People now leave a trail of data-crumbs wherever we travel. Supermarket loyalty cards, text messages, credit card transactions, web browsing and social networking. The power of this data emerges, like that of capital, when it’s accumulated. (…) Where does this power come from? Cross linking of different data sources can give deep insights into personality, health, commercial intent and risk. The aim is now to understand and characterize the population, perhaps down to the individual level.
Martin Doyle
(October 23, 2014)
Solving data quality problems requires investment, and it’s an investment no business can avoid. Poor data quality can lead to compliance issues, legal challenges and increased effort from all sides. Over time, as data (or the information derived from it) decays, inefficiency and inaccuracy becomes a serious issue that can hamper progress throughout the organisation. Customers also have higher expectations and feel more empowered to complain if they don’t feel their data is being handled correctly.
Matt Ritter
(January 5, 2016)
Be the Google of Your Organization: Everybody understands that data is key to Google’s dominance. Companies across every industry are trying to establish themselves as the owners and users of the best data available. Watching this data gold rush, it can be easy to forget about the opportunities much closer to home. Within your company, some people are using data to create and advocate effective strategic arguments. How do you ensure that you are a Google within your company, instead of an Alta Vista?
Q Ethan McCallum, Ken Gleason
Analysts will need a proper understanding of math, statistics, algorithms, and other related sciences in order to deliver meaningful results. They must pair that theoretical knowledge with a firm grasp of the modern-day tools that make the analyses possible. That means having an ability to express queries in terms of MapReduce or some other distributed system, an understanding of how to model data storage across different NoSQL-style systems, and familiarity with libraries that implement common algorithms.
• Big data is about the infrastructure – ensuring that the underlying hardware, software and architecture have the ability to enable analytics. Big data is about storage, speed, performance and functionality.
• Analytics is about the data and its impact on the business, about putting in the proper data models, algorithms and tools in place to manipulate and understand the data in a way that drives effective decision-making. In other words, analytics is about enabling informed decisions and measuring impact.
Albert Einstein You believe in a God who plays dice, and I in complete law and order in a world which objectively exists, and which I, in a wildly speculative way, am trying to capture. I firmly believe, but hope that someone will discover a more realistic way, or rather a more tangible basis than it has been my lot to do. Even the great initial success of the quantum theory does not make me believe in the fundamental dice game, although I am well aware that your younger colleagues interpret this as a consequence of senility.
Mkhuseli Mthukwane
(August 27, 2015)
Data Science forms the very substratum of an Analytics Practitioners’ work, it’s what sets us apart from Statisticians or Mathematicians. However in some instances we cannot rely on it alone, we need to employ other measures to increase its definitiveness. In any event I am sure many Data Scientists use math and other means to augment the potency of their Analytics, some not even scientific at all. It is undeniably prudent to do so where necessary, especially in fields that demand a higher standard of accuracy and care.
Mirko Krivanek
(October 1, 2014)
‘The end of the Data Scientist Bubble’. This was the subject of a provocative article posted on Oracle’s blog, two days ago. It certainly shows how far from the reality some big companies are. They confuse people who call themselves data scientists (or get assigned that job title), with those who are true data scientists, and might use a different job title. Many times, the issue is internal politics that create the confusion, and not recognizing a real data scientist with success stories to share, or not leveraging them.
Lyndsay Wise
(December 22, 2014)
Analytics no longer needs to be a separate application but can be accessed within operational systems to help business users gain the visibility they require. Businesses want to have a cohesive view of information. The ability to embed analytics within an application gives them a new way of deploying analytics without limiting who can access data. Better access to data in general, also helps sell its value overall, providing organizations with the basis to justify budgetary allocations to manage their data more effectively.
Seth Grimes
I’ve been describing term usage: A bit of language analysis. Analysis is an examination of structure, composition, and meaning that provides insight to advance some purpose. Analysis may be heuristic, informal, and/or qualitative.
Contrast with analytics, which is algorithmic rather than heuristic. I define analytics as the systematic application of numerical and statistical methods that derive and deliver quantitative information, whether in the form of indicators, tables, or visualizations. Analytics is formal and repeatable.
Patrick Marshall
When a new, powerful tool comes along, there’s a tendency to think it can solve more problems than it actually can. Computers have not made offices paperless, for example, and Predator drones haven’t made a significant dent in the annual number of terrorist acts. … While big data tools are, indeed, very powerful, the results they deliver tend to be only as good as the strategy behind their deployment. A closer look at successful big data projects offers clues as to why they are successful … and why others fall short of the mark.
Anne Russell
(October 27, 2014)
I’ve been thinking a lot about data, where it comes from, and what it looks like. I can’t help it. I’ve been a data geek for almost 15 years. And I find data beautiful. Not necessarily in its raw form, mind you. Then it’s just messy and more often than not a pain to deal with, especially when it gets really, really big. But when smart, creative people start to clean it up and use it in different ways to find the hidden stories that make sense, it can help us learn things in ways that we never expected. And that can be exceptional thing.
Chris Riley It takes a lot of will power, in our data obsessed world to say ‘too much!’ However, there are many ways where too much information is destroying productivity, and actually causing bad decision making, not good. But it is hard to avoid the world of opportunities that has been opened in data collection and analysis. So how do you balance the two? The first step is to understand there is a big difference between data collection, and it’s utilization. While it seems subtle, the difference is key, and utilization is where many make mistakes.
Vikash K. Mansinghka
(Sep 28, 2015)
Probabilistic inference is a widely-used, rigorous approach for processing ambiguous information based on models that are uncertain or incomplete. However, models and inference algorithms can be difficult to specify and implement, let alone design, validate, or optimize. Additionally, inference often appears to be intractable. Probabilistic programming is an emerging field that aims to address these challenges by formalizing modeling and inference using key ideas from probability theory, programming languages, and Turing-universal computation.
Paul Barsch
(October 13, 2014)
Forecasting is hard, and even those who sometimes get it right, often fail on a continuous basis. But fear not, there are three steps you can take to drastically improve your forecast accuracy, but you’ll have to be willing to put in the work, and possibly put your ego aside to get there.
1) First, understand that domain knowledge of a particular area doesn’t necessarily mean you’ll see the future better than anyone else.
2) Second, if you want better forecasts, run your expert opinions by others.
3) Third, bring your data – in fact, bring all of them.
Daniel Gutierrez
(November 5, 2014)
Once the often laborious task of data munging is complete, the next step in the data science process is to become intimately familiar with the data set by performing what’s called Exploratory Data Analysis (EDA). The way to gain this level of familiarity is to utilize the features of the statistical environment you’re using (R, Matlab, SAS, Python, etc.) that support this effort – numeric summaries, aggregations, distributions, densities, reviewing all the levels of factor variables, applying general statistical methods, exploratory plots, and expository plots.
G. U. Yule and M. G. Kendall
It is with data affected by numerous causes that Statistics is mainly concerned. Experiment seeks to disentangle a complex of causes by removing all but one of them, or rather by concentrating on the study of one and reducing the others, as far as circumstances permit, to comparatively small residium. Statistics, denied this resource, must accept for analysis data subject to the influence of a host of causes, and must try to discover from the data themselves which causes are the important ones and how much of the observed effect is due to the operation of each.
Mark van Rijmenam
(October 16, 2014)
In the fast moving world of today, data is being created at lightning speed. Data comes from an infinite variety of sources and all this data can be used to discover valuable business insights. Combining internal and external data can enable organisations to beat the competition, as the analysis will provide valuable insights. The more business users that work with such insights, the better your organisation will become. Organisations should therefore strive for a data-driven, information-centric culture, where every business user makes decisions based on data.
Anders Arpteg, Björn Brinne, Luka Crnkovic-Friis, Jan Bosch
(29 October 2018)
A product owner has many responsibilities, including being able to push changes to production in time. As it may take time to, for instance, implement sufficient logging and to analyze the results, a product owner may be unwilling to spend sufficient time implementing necessary logging. Without proper logging in place, there will not be sufficient data to make accurate retention predictions. This can result in discussions between data scientists and product owners, that may not only have different goals, but also different mindsets in how decisions should be made.
Durgesh Kaushik
(October 9, 2015)
Analytics no matter how advanced they are, does not remove the need for human insights. On the contrary, there is a compelling need for skilled people with the ability to understand data, think from the business point of view and come up with insights. For this very reason technology professionals with Analytics skill are finding themselves in high demand as businesses look to harness the power of Big Data. A professional with the Analytical skills can master the ocean of Big Data and become a vital asset to an organization, boosting the business and their career.
Mark van Rijmenam
For all its tremendous power and benefits, Hadoop does have drawbacks. How it moves data is complex, and it’s not always the most efficient execution with Big Data and unstructured data processing. The automatic association between Big Data and Hadoop is becoming looser as more alternatives to Hadoop are developed. Some have speed advantages, while others allow streaming processing or make more efficient use of hardware. Hadoop alternatives are emerging, and those who deal with Big Data or unstructured data are wise to scope them out when considering their own needs.
Andreas Blumauer
(October 28, 2014)
I can see at least two options where methods from Data Science will benefit from Linked Data technologies and vice versa:

– Machine learning algorithms benefit from the linking of various data sets by using ontologies and common vocabularies as well as reasoning, which leads to a broader data basis with (sometimes) higher data quality

– Linked Data based knowledge graphs benefit from Graph Data Analyses to identify data gaps and potential links (find an example for a semantic knowledge graph about ‘Data Science’ here: http://…/data-science )
Foster Provost & Tom Fawcett
It is important to understand data science even if you never intend to do it yourself, because data analysis is now so critical to business strategy. Businesses increasingly are driven by data analytics, so there is great professional advantage in being able to interact competently with and within such businesses. Understanding the fundamental concepts, and having frameworks for organizing data-analytic thinking not only will allow one to interact competently, but will help to envision opportunities for improving data-driven decision-making, or to see data-oriented competitive threads.
The mechanical process by which data scientists and citizen data scientists make better use of data and analytics is underpinned by a deeper question about the organization as a whole: Does it have processes for sharing anything? This is not always a given in companies that have grown quickly, have grown through mergers and acquisitions or have begun to shrink. If the culture has never embraced or fostered the notion of transparency and sharing, then whatever process the company may put in place to use software to publish analytical models and the data they harvest is unlikely to succeed.
Robert Morison Robert Morison, lead faculty member for the International Institute for Analytics, provided three reasons businesses experience big data failures. Briefly, they are as follows:
1. As cited in the piece, clinging to a traditional IT project management style. Solution: Think R&D.
2. Businesses are taken in by the hype and make their first big data project a big deal. Solution: Businesses should start with a smaller project that will “move the proverbial needle.”
3. Reasonably good analytics are done, but they are not adopted. Solution: The business has to own the problem or the ambition to improve.
Eve the Analyst
(7th November 2017)
Business culture needs to go through a fundamental change to become data-driven. It’s not the new tools, more data, or a PhD-holder staff that will make the change happen. A system, a program, or a robot, designed to be most rational, capable, and failure-proof is put to action only if the environment it operates in allows it. Analytics have been available for years, but they continue being misused because the business models they aid are inherently incompatible. Centrally manged organisations by design impede change; its time that decision making becomes more democratic and distributed, as is data.
Shahbaz Ali
(DEC 24, 2014)
When data is locked in silos, organizations are unable to find and include all enterprise data for use with big data analytics tools. Planning to implement a data centric data management strategy enables the distributed metadata repository to be a source for analytics tools, as it can be used to provide real-time insight, without having to migrate data from silos to a separate analytics platform. It also enhances the quality of results, because having more relevant data often produces more accurate analysis. If organizations can harness all of its data, they will attain a greater competitive advantage.
Gregory Piatetsky
Here are 7 steps for learning data mining and data science. Although they are numbered, you can do them in parallel or in a different order.
1. Languages: Learn R, Python, and SQL
2. Tools: Learn how to use data mining and visualization tools
3. Textbooks: Read introductory textbooks to understand the fundamentals
4. Education: watch webinars, take courses, and consider a certificate or a degree in data science
5. Data: Check available data resources and find something there
6. Competitions: Participate in data mining competitions
7. Interact with other data scientists, via social networks, groups, and meetings.
Tom Phelan
(February 10, 2015)
Today’s enterprise IT teams, and the developers in their ranks, like to fail fast. They don’t have the luxury of spending months developing and testing a new application only to find out it does not or no longer will meet the needs of the business. That’s something they need to find out ASAP, and it requires agility. Big Data applications are no exception. To be successful, organizations need to extend the agile, DevOps model to Big Data and allow data scientists and the developers they work with to get to the answers they need as quickly as possible. A private cloud infrastructure can be the answer for many organizations.
Andrew Pease
(November 3, 2014)
When answering big business questions, the statistically-minded, yet communicative data scientist is first needed to translate the business problem into the required analytic approach, both in terms of the available data and the algorithms required.
• What data is available to answer the question?
• Do we need ­all of the data to answer the question?
• Does it require a complex algorithm with lots of variables?
• How often and how quickly does the analysis need updates?
• How ‘up-to-date’ do the inputs need to be?
• How complete are the inputs in predicting a particular event?
• How much variance remains unaccounted for?
Mark van Rijmenam
The era of simple bar graphs and pie charts is over. Big Data requires a different approach to visualizations; one that will significantly impact your organisation’s bottom-line if carried out correctly. Companies that have become data-driven and information-centric understand this. They have made visualizations of large data sets an integral part of their decision making. Visual data analytics goes beyond standard reporting and is all about rich interactive visualizations where anyone can explore data and discover new insights. Proper data visualizations are all about exploring, deducing and discovering data in an intuitive way.
Isak Bosman
The “Iceberg Secret” speaks to the apparent gap between technical and non-technical stakeholders when it comes to evaluating the quality and progress of building an AI-based solution. Often the solution is judged on the visualization of the data or the scalar output of the predictions (the 10%), and little regard is given to the bulk of the work that is spent on the data preparation (the 90%). For the Data or ML Engineer, they understand the exponential value in carefully focussing on understanding the data, preparing the data and how important it is to ensure that a mature data pipeline is in place before any time is spent on the top 10%.
Strategy& Big data have the potential to improve or transform existing business operations and reshape entire economic sectors. Big data can pave the way for disruptive, entrepreneurial companies and allow new industries to emerge. The technological aspect is important, but insufficient to allow big data to show their full potential and to stop companies from feeling swamped by this information. What matters is to reshape internal decision-making culture so that executives base their judgments on data rather than hunches. Research already indicates that companies that have managed this are more likely to be productive and profitable than the competition.
Foster Provost & Tom Fawcett
Success in today´s data-oriented business environment requires being able to think about how these fundamental concepts (Data Mining, Predictive Analytics) apply to particular business problems – to think data-analytically. Data should be thought of as a business asset, and once we are thinking in this direction we start to ask whether (and how much) we should invest in data. Thus, an understanding of these fundamental concepts is important not only for data scientists themselves, but for any one working with data scientists, employing data scientists, investing in data-heavy ventures, or directing the application of analytics in an organization.
Michael Cavaretta
(August 21, 2014)
The thing that I would say is, “be flexible.” There are appropriate times to be looking at enterprise-level SAS, and there are other times to be looking at something you can hack in Python or do a quick analysis in R. The tools and technologies all have their strengths and weaknesses. The biggest thing is you must “stick” fundamentals before going forward, rather than just concentrating on a fixed tool. I believe you need to get the fundamentals down first. So learn the basic statistics, understand how basic programming works, get that stuff done, then look to specialize, whether that be in network analysis or visualizations or information systems.
Andrew Pease
(November 3, 2014)
And that’s where the statistician needs to take it easy:
• Start with the results, so the audience has a clear view on the outcome
• Proceed to explain the analysis simply and with a minimum of statistical jargon
• Describe what an algorithm does, not the specifics of your killer algo
• Visualize the inputs (e.g.: a correlation matrix showing an ‘influence heat map’)
• Visualize the process (e.g.: a regression line on a chief predictor variable)
• Visualize the results (e.g.: a lift chart to show how much the analysis is improving results)
• Always, always tie each step back to the business challenge
• Always be open to questions and feedback.
n.n. Fi yuo cna raed tihs yuo hvae a sgtrane mnid too. Cna yuo raed tihs? Olny 55 plepoe out of 100 cuold. I could not blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cambridge Uinervtisy, it deos not mtaetr in waht oerdr the ltteres in a wrod are, the olny iproamtnt tihng is taht the frsit and lsat ltteer be in the rghit pclae.
The rset can be a taotl mses and you can sitll raed it whotuit a pboerlm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Azanmig? I awlyas tghuhot slpeling was ipmorantt! If you can raed tihs forwrad it.
Daniel Gutierrez
As I’ve worked in the data science arena long before that name was even used, and having gone through an aborted graduate program in computer science and mathematical statistics some years ago, I think I have a unique perspective for how Data Science manifests itself today. I see it as a confluence of disciplines: computer science, mathematical statistics, probability theory, machine learning, data analysis, and visualization. Being a theorist specializing in applications of machine learning, I can safely say that I’d be hard pressed to minimize the importance of any of these disciplines, especially statistics, in its overall impact to the to the field.
Tracey Wallace
(September 8, 2014)
The stunning achievements and advancements made on behalf of humanity within mathematics are paramount. After all, it is thanks to engineers, mathematicians and statisticians that we ever put a man on the moon, that we’ve ever mapped the floor of the ocean, that human eyes have ever been able to see out-of-this-world phenomena the likes of colliding galaxies or the rings of Jupiter. In fact, it’s arguable that mathematics is the very foundation of our physical world. The Fibonacci sequence, while prone to fodder for conspiracy theorists, can be found in every corner of the universe, from spiral galaxies, to sea shells, to the ratios of your facial features.
Kanishk Priyadarshi
Our conversations with chat bots and digital agents will be highly personalized. And they’ll already know about our problems, so every conversation won’t have to start from scratch. The bots will know that you’ve called five times and that you’re frustrated. They’ll have access to in-depth knowledge about everyone else who’s called with a similar issue, so they’ll know which answer is most likely to resolve your problem. And all of this will seem to happen instantaneously. For years, we’ve been talking about big data. This is where big data finally becomes useful to large numbers of people. The robot will know the answer to your question before you even ask.
Philipp Max Hartmann, Mohamed Zaki, Niels Feldmann, Andy Neely In the field of ‘big data’, Gartner identified five different types of data source used to ‘exploit big data’ in a company (Buytendijk et al., 2013): ‘Operational data comes from transaction systems, the monitoring of streaming data and sensor data; Dark data is data that you already own but don’t use: emails, contracts, written reports and so forth; Commercial data may be structured or unstructured, and is purchased from industry organisations, social media providers and so on; Social data comes from Twitter, Facebook and other interfaces; Public data can have numerous formats and topics, such as economic data, socio-demographic data and even weather data.’
Mike Urbonas
(April 11, 2015)
The best technology tools will be those that empower subject matter experts to quickly apply intuitive reasoning:
• Data visualization and advanced analytics that do not require programming or advanced technical skills
• Data integration and workflow tools to rapidly infuse existing data sets with new, untapped data sources that enable a more complete analytic picture – again with no programming required
• Agile application development tools that shield the user from coding, data connectivity and other programming challenges
• Tools for predictive analytics that support intuitive reasoning as to what data attributes and other conditions will impact future performance.
Tracey Wallace
(September 8, 2014)
Our Collective Data Science Duty: Here’s the thing, technology is empowering the public in never before seen ways, and data is the backbone of that shift. Between wearable tech and digital identity platforms, people are creating more data every day than has ever been created in decades, no, centuries past. Each of us is essentially our own personal data scientist, and those working in the digital space have very much been their own statisticians for quite some time. It’s why platforms like Google Analytics, Omniture and more are so popular across the industry. They put the power of analytics in the hands of users, requiring little training but returning lots of measurability.
Tom Phelan
(February 10, 2015)
An agile environment is one that’s adaptive and promotes evolutionary development and continuous improvement. It fosters flexibility and champions fast failures. Perhaps most importantly, it helps software development teams build and deliver optimal solutions as rapidly as possible. That’s because in today’s competitive market chock-full of tech-savvy customers used to new apps and app updates every day and copious amounts of data with which to work, IT teams can no longer respond to IT requests with months-long development cycles. It doesn’t matter if the request is from a product manager looking to map the next rev’s upgrade or a data scientist asking for a new analytics model.
Or Shani
(January 27, 2015)
What was once just a figment of the imagination of some our most famous science fiction writers, artificial intelligence (AI) is taking root in our everyday lives. We’re still a few years away from having robots at our beck and call, but AI has already had a profound impact in more subtle ways. Weather forecasts, email spam filtering, Google’s search predictions, and voice recognition, such Apple’s Siri, are all examples. What these technologies have in common are machine-learning algorithms that enable them to react and respond in real time. There will be growing pains as AI technology evolves, but the positive effect it will have on society in terms of efficiency is immeasurable.
Michel Bruley
Management gurus agree to distinguish four levels of decision making for businesses. Strategic decisions define the relationships between the company and its environment (choice of activities, markets, technology clusters, …), they set axes and policies for the long term. Organic decisions define the organizational structures, processes, information systems (business, mission, delegation, responsibility, resources, …), they set the framework for the medium term. Management decisions for the short term set the direction and expected performance (objectives, resources, control, …). Finally, operational decisions are interested in the daily operation, in supervision of the execution.
Shambhavi Vyas
(March 3, 2015)
We’ve all heard about Big Data, and for most of us, the concept of taming big data is overwhelming. In the average business enterprise, data is everywhere – hidden in spreadsheets, report writers, databases, ERP, CRM, SCM and other large systems as well as in legacy systems and small packages designed for accounting or other targeted departmental activities. The first order of business in taming big data is to pull all of that data together. Once the business has cataloged and compiled its data, it faces an even bigger task, namely, how to tame that data and make it accessible, clear and concise so that every user can leverage the data to effectively contribute to the business bottom line.
Randy Bartlett
It is going to take quants to deliver the real promises of Big Data. An understanding of statistics is necessary to access how to analyze Big Data and to properly lead and organize the analytics resources handling it. As Deming said, ‘The nonstatistician cannot always recognize a statistical problem when he sees one.’ We should expect depictions of Big Data, which are void of an understanding of statistics. Even if we can ‘get by’ without addressing statistical errors or reducing the data, we still need statistical thinking, statistical assumptions, and statistical techniques. We need to combine our knowledge about the business problem with a mastery of techniques from all three tool boxes.
Jeff Leek
Data science done well looks easy – and that is a big problem for data scientists. The really tricky twist is that bad data science looks easy too. You can scrape a data set off the web and slap a machine learning algorithm on it no problem. So how do you judge whether a data science project is really ‘hard’ and whether the data scientist is an expert? Just like with anything, there is no easy shortcut to evaluating data science projects. You have to ask questions about the details of how the data were collected, what kind of biases might exist, why they picked one data set over another, etc. In the meantime, don’t be fooled by what looks like simple data science – it can often be pretty effective.
Guerrilla Analytics
(July 21, 2015)
Data Scientists and automation (data products, algorithms, production code, whatever) are complementary functions. Good Data Science supports automation. It quickly adds value by investigating, testing, and quantifying hypotheses about existing data and potential new data. Simply switching on software ignores the reality of working with data, regardless of the claims of that software. Data is full of nuances, errors and unknown relationships that are best discovered and tested by an expert Data Scientist. This takes time and does not scale but it does not have to scale. It is the necessary prudent investment that you make before spending months in product development and automation of the wrong algorithm on the wrong or broken data.
Anders Arpteg, Björn Brinne, Luka Crnkovic-Friis, Jan Bosch
(29 October 2018)
One clear conclusion of this work is that, although the DL technology has achieved very promising results, there is still a significant need for further research into and development in how to easily and efficiently build high-quality production-ready DL systems. Traditional SE has high-quality tools and practices for reviewing, writing test, and debugging code. However, they are rarely sufficient for building production-ready systems containing DL components. If the SE community, together with the DL community, could make an effort in finding solutions to these challenges, the power of the DL technology could not only be made available to researchers and large technology companies, but also to the vast majority of companies around the world.
Kristen Paral
(August 23, 2014)
Most companies think traditional Business Intelligence (BI) in which data is collected in warehouses, models are created based on business criteria and results are visualized through reports is sufficient. While this is true if your only concern is to answer basic questions like which customers are more profitable, it is not enough to deliver transformative business change like Data Science can. Data Science takes a different approach than BI in that insights and models are derived from the data through the application of statistical and mathematical techniques by Data Scientists. The data drives the modeling and insights. When you let the data guide you – you are less likely to try to use the data to support wrong predispositions or conclusions.
Mike Barlow
Top takeaways from my interviews with experts from organizations offering AI products and services:
• AI is too big for any single device or system
• AI is a distributed phenomenon
• AI will deliver value to users through devices, but the heavy lifting will be performed in the cloud
• AI is a two-way street, with information passed back and forth between local devices and remote systems
• AI apps and interfaces will be designed and engineered increasingly for nontechnical users
• Companies will incorporate AI capabilities into new products and services routinely
• A new generation of AI-enriched products and services will be connected and supported through the cloud
• AI in the cloud will become a standard combination, like peanut butter and jelly
Miguel A. Hernán, John Hsu, Brian Healy
(July 12, 2018)
The component of prediction tasks that can be easily automated is the one that does not involve any expert knowledge. Prediction tasks require expert knowledge to specify the scientific question (what input and what outputs) and to identify/generate relevant data sources. (The extent of expert knowledge varies across different prediction tasks.18) However, no expert knowledge is required for prediction after the inputs and outputs are specified and measured in a particular dataset. At this point, a machine learning algorithm can take over the data analysis to deliver a mapping and quantify its performance. The resulting mapping may be opaque, as in many deep learning applications, but its ability to map the inputs to the outputs with a known accuracy is not in question.
Wen-wen Tung, Ashrith Barthur, Matthew C. Bowers, Yuying Song, John Gerth, William S. Cleveland
Data science is, at its foundation, centered on the analysis of data. The technical areas of data science are those that need direct study to make data analysis as effective as possible The areas are:
1.Statistical theory,
2.Statistical models,
3.Statistical and machine-learning methods,
4.Visualization methods,
5.Algorithms for statistical, machine-learning, and visualization methods,
6.Computational environments for data analysis: hardware, software, and database management, as well as
7.Live analyses of data where results are judged by the subject-matter findings, not the methodology and systems that are used.
Of course, these areas can be divided into sub-areas, which in turn can have sub-sub-areas, and so forth. Also, research in an area can depend heavily on research in others.
Foster Provost & Tom Fawcett
On a scale less grand, but probably more common, data analytics projects reach into all business units. Employees throughout these units must interact with the data science team. If these employees do not have a fundamental grounding in the principles of data-analytic thinking, they will not really understand what is happening in the business. This lack of understanding is much more damaging in data science projects than in other technical projects, because the data science is supporting improved decision-making. This requires a close interaction between the data scientists and the business people responsible for decision-making. Firms where the business people do not understand what the data scientists are doing are at a substantial disadvantage, because they waste time and effort or, worse, because they ultimately make wrong decisions.
Alice Zheng
If we think of training the model as a part of it, then even after you’ve trained a model and evaluated it and found it to be good by some evaluation metric standards, when you deploy it, where it actually goes and faces users, then there’s a different set of metrics that would impact the users. You might measure: how long do users actually interact with this model? Does it actually make a difference in the length of time? Did they used to interact less and now they’re more engaged, or vice versa? That’s different from whatever evaluation metric that you used, like AUC or per class accuracy or precision and recall. … It’s probably not enough to just say this model has a .85 F1 score and expect someone who has not done any data science to understand what that means. How good are the results? What does it actually mean to the end users of the product?
Sean McClure
(November 3, 2014)
The majority of organizations have barely moved beyond static BI reports and are unaware of the actual potential their data holds. Going from being ‘data unaware’ to investing in a big data architecture in one leap sets a company up for a bad ROI in analytics. A solid investment must begin with understanding what data is actually available and identifying the low hanging ‘data fruits’ that can lead to real value for the company. This can be used to build lightweight solutions that are imperfect but hugely beneficial. It can provide real-world tools for decision support via recommended actions or highlighted opportunities in real-time. It can offload much of the routine repetitive decision-making to algorithms so that professionals can operate with a more strategic view of the organization and bring their creative talents to their company’s challenges.
Robert Chang Type A Data Scientist: The A is for Analysis. This type is primarily concerned with making sense of data or working with it in a fairly static way. The Type A Data Scientist is very similar to a statistician (and may be one) but knows all the practical details of working with data that aren’t taught in the statistics curriculum: data cleaning, methods for dealing with very large data sets, visualization, deep knowledge of a particular domain, writing well about data, and so on.
Type B Data Scientist: The B is for Building. Type B Data Scientists share some statistical background with Type A, but they are also very strong coders and may be trained software engineers. The Type B Data Scientist is mainly interested in using data “in production.” They build models which interact with users, often serving recommendations (products, people you may know, ads, movies, search results).
Philip Russom
Managing big data for analytics is not the same as managing DW data for reporting. In fact, the two are almost opposites … . For example, reporting is about seeing the latest values of the numbers that you track over time via a report. Obviously, you know the report, the business entities it represents, and the data warehouse that feeds the report. An analysis is more about discovering variables you don’t know, based on data that you probably don’t know very well. Also, a report requires a solid audit trail, so its data must be managed with welldocumented metadata and possibly master data, too. Since most analyses have no expectation of an audit trail, there’s no need to manage one. That’s just a sampling of the differences. The point is to embrace Big Data Management for analytics as a unique practice that doesn’t follow all the strict rules we’re taught for reporting and data warehousing.
Vincent Granville
(November 15, 2014)
A different perspective on what data scientists are capable of:
• Imagine dozens of scenarios and rank them by chance of occurring
• Get siloed data from various departments (finance, sales, marketing, product, IT)
• Analyze the data in connection with the scenarios (including checking data validity)
• Get external data (competitive intelligence) as needed
• Find the causes (not just correlations)
• Find the remedies
• Detect issues well before anyone else can see them, by looking in summary data
• Complete the analysis with a 48 hours turnaround
Such a data scientist who can save billions to a company, is usually not hired, for the following reasons
• Companies are looking for coders, not business solvers, when they hire a data guru, despite claiming the contrary
• A data scientist without Python on his resume is unlikely to ever get hired
• Hard work gets rewarded, smart work does not.
Randy Bartlett
Today’s information rush is exemplified by the great promise of overflowing observational data, hyper communications, and the approaching Internet of Things. The promotional hype intially comes from journals, self-glorifying books, and vendors, all with a certain perspective that is not informed by practice experience—publishers are unable to discern qualifications. This creates misinformation stampedes with energized statistics deniers writing amplifying blogs, presentation decks, et al., which further mischaracterize and even adulterate statistics. The downstream echos talk everyone into believing their own hyped fabrications. Two of the problems are that 1. Selling good statistics practice can be less lucrative than cutting some serious corners; and 2. Promoting services, workshops, data-analysis results, etc. is easier when not encombered by competently weilding and accurately depicting statistics.
Jeff Leek
(October 9, 2014)
As data becomes cheaper and cheaper there are more people that want to be able to analyze and interpret that data. I see more and more that people are creating tools to accommodate folks who aren’t trained but who still want to look at data right now. While I admire the principle of this approach – we need to democratize access to data – I think it is the most dangerous way to solve the problem. … The danger with using point and click tools is that it is very hard to automate the identification of warning signs that seasoned analysts get when they have their hands in the data. These may be spurious correlation like the plot above or issues with data quality, or missing confounders, or implausible results. These things are much easier to spot when analysis is being done interactively. Point and click software is also getting better about reproducibility, but it still a major problem for many interfaces.
Yanir Seroussi
People like simple explanations for complex phenomena. If you work as a data scientist, or if you are planning to become/hire one, you’ve probably seen storytelling listed as one of the key skills that data scientists should have. Unlike “real” scientists that work in academia and have to explain their results mostly to peers who can handle technical complexities, data scientists in industry have to deal with non-technical stakeholders who want to understand how the models work. However, these stakeholders rarely have the time or patience to understand how things truly work. What they want is a simple hand-wavy explanation to make them feel as if they understand the matter – they want a story, not a technical report (an aside: don’t feel too smug, there is a lot of knowledge out there and in matters that fall outside of our main interests we are all non-technical stakeholders who get fed simple stories).
Ray Major
(October 8, 2014)
Descriptive Analytics: insight into the past

Which use data aggregation and data mining techniques to provide insight into the past and answer: “What has happened?”
Use Descriptive statistics when you need to understand at an aggregate level what is going on in your company, and when you want to summarize and describe different aspects of your business.
Predictive Analytics: understanding the future

Which use statistical models and forecasts techniques to understand the future and answer: “What could happen?”
Use Predictive analysis any time you need to know something about the future, or fill in the information that you do not have.
Prescriptive Analytics: advise on possible outcomes

Which use optimization and simulation algorithms to advice on possible outcomes and answer: “What should we do?
Use prescriptive statistics anytime you need to provide users with advice on what action to take.
datumengineering Most of the organisations work in silos on their data and in absence of effective communication channel between Datawarehouse and Analytics team the whole effort of effective analysis goes awry. 80% organisations divide the effort of Datawarehouse, Advance analytics & Statistical analysis into different teams and these teams not only address the different business problems but they also aligned with different architects. In my opinion this could be the main reason that kills the flavor of Data Science. Interestingly during one of my assignments in the field of retail data analysis, I observed that they had developed their datawarehouse team only at the maturity level of summarization and aggregation. I realized that this datawarehouse or Data store world would end after delivering bunch of reports and some dashboards. Eventually, they would be more interested in archiving the data thereafter. That’s the Bottleneck !
William Vorhies
(August 19, 2014)
Two decades ago the folks who prepared our reports, graphs, and visualizations were ‘data analysts’ who knew how to extract data from relational data warehouses and run it through reporting and visualization tools like Crystal Reports. Ten years ago, predictive models were built by ‘predictive modelers’ who understood both the extraction and preparation of the data as well as the specialized predictive analytic tools like SAS and SPSS that allowed them to prepare predictive models. In the last few years, Gartner now declares that we need ‘data scientist’ who have all the above skills but also understand the complexities of the new NoSQL databases like Hadoop and can marry data from many sources and types together to produce useful and profitable predictive models. The requirement for broader and deeper skills is real and must factor into any business decision to build in-house capacity, as well as vetting potential consultants.
Oren Etzioni
(January 5, 2017)
I think what’s missing in the AI conversation is a dose of realism. We have on the one extreme people like Kurzweil, who are fantastically optimistic but don’t really have data to back up their wildly optimistic predictions. On the other hand, we have people who are very afraid, like Nick Bostrom, who’s a philosopher from Oxford, or Elon Musk, who needs no introduction, and say AI is like summoning the demon, which is really religious imagery. But again, neither party has the data to base their conclusions on; it’s wild extrapolations, it’s metaphor (like AI is a demon), it’s philosophical argumentation. I think we need to have a more measured approach, where we measure AI’s performance, where we understand that superhuman success on a narrow task like Go doesn’t translate to even human performance level on a broad range of tasks, the kind that people do. I like saying my six-year-old is a lot smarter than AlphaGo—he can cross the street, more or less.
Strategy& There is no general rule dictating how organizations should navigate the stages of big data maturity. They must each decide for themselves, based on their own situation – the competitive environment they are operating in, their business model, and their existing internal capabilities. In less-advanced sectors, with executives still grappling with existing data, making intelligent use of what they already possess may have a substantial impact on decision making.
The main priorities for executives are to:
• develop a clear (big) data strategy;
• prove the value of data in pilot schemes;
• identify the owner for “big data” in the organization and formally establish a “Chief Data Scientist” position (where applicable);
• recruit/train talent to ask the right questions and technical personnel to provide the systems and tools to allow data scientists to answer those questions;
• position big data as an integral element of the operating model; and establish a data-driven decision culture and launch a communication campaign around it.
Bill Franks
(January 9, 2015)
Ensemble methods have been around for some time. Recently, however, their popularity has been increasing even further. In my opinion, this isn’t just because ensemble methods work, but also is the result of a few other trends coming together.
1. The capture and standardization of data is much more mature today, which makes it easier to get the data necessary for an analysis together. This, in turn, allows more time for analysis.
2. A broad array of complex algorithms is now available and accessible. It is entirely possible for an organization to have many strong algorithms at its fingertips.
3. The amount of processing power available at a cheap cost means that we no longer have to be stingy with our work. We can afford to execute many iterations of multiple algorithms today without costs spinning out of control.
4. Software tools are starting to build ensemble capabilities in which makes it very easy to request and execute an ensemble process. Your organization should consider taking advantage of the “wisdom of the crowd” phenomenon and move into ensemble methods in 2015.
Sean McClure
(November 3, 2014)
We are currently witnessing a land rush of investment in Big Data architectures promising companies that they can turn their data into gold using the latest in distributed computing and advanced analytical methods. Although there is indeed much potential in applying machine learning and statistical analysis to largedatasets, many companies are hardly sitting on the kind of data that will allow them to compete using hundreds of machines chugging through terabytes of data.
But that’s okay. There is a massive benefit to just getting an organization to understand what data they do have and how they can deploy intelligent models on this data to disrupt their current approaches to doing business. This does not require the latest in parallel computing or bursting into new map-reduce paradigms just to derive insight. It does not require huge data warehouses or terabytes of unstructured data. What it does require is good science on good data. This is where organizations need to start; becoming ”data aware’ and building an organizational culture that understands data as a real asset.
Mayank Mehta
(August 27, 2015)
To put it simply, there is too much friction. In any given workflow, you have to go through several levels to get to what you really need. For instance, say you’re part of the customer service team: You use Salesforce to get the information you need to best serve customers. But depending on the information, you have to go across half-a-dozen windows in search for the right sales pitch, product information, or other collateral. You are 15 steps into a workflow before you get to the real starting point. This wastes time, money, and reduces quality of service. This is in sharp contrast to what you have come to expect using consumer products. Think peer-to-peer payment option solutions like Square that make payments as simple as the tap of a button – eliminating dozens of process steps that you would usually go through. This simple, bare-bones approach has changed industries across the board, be it transportation (Uber), insurance (15 minutes can save you…), accounting (TurboTax), retail (Amazon Same-Day), and so on. Enterprises that provide this personalized, contextual experience will thrive and those that don’t will falter.
Debleena Roy
(30. July 2015)
The Last Mile of Analytics
1. Succeeding in Analytics and getting to the Last Mile of embedding Analytics in decision-making is not just about dealing with data. The journey from Data to Decisions requires one to look at how Analytics is operationalized in the business. And the missing piece that actually can drive or enable Analytics is not data but Culture. Can culture be enabled by Analytics?
2. To make Analytics actually a part of the Company culture, it’s not enough to have a set of people providing Analytics solutions. Analytics needs to be embedded in technology accelerators that can directly enable decisions at the point of action. This makes Analytics accountable for real business decisions rather than just providing more data.
3. Making Analytics work across the business requires collaboration and the right choice of Engagement Model which would vary based on the maturity of the organization and its decision-making, not its data needs. The business models could range from Products, Services, and Managed Solutions to Analytic Marketplaces. Unless the right business model is chosen, Analytics will remain a discrete project.
Alvine Boaye Belle
Technical debt is a metaphor used to convey the idea that doing things in a ‘quick and dirty’ way when designing and constructing a software leads to a situation where one incurs more and more deferred future expenses. Similarly to financial debt, technical debt requires payment of interest in the form of the additional development effort that could have been avoided if the quick and dirty design choices have not been made. Technical debt applies to all the aspects of software development, spanning from initial requirements analysis to deployment, and software evolution. Technical debt is becoming very popular from scientific and industrial perspectives. In particular, there is an increase in the number of related papers over the years. There is also an increase in the number of related tools and of their adoption in the industry, especially since technical debt is very pricey and therefore needs to be managed. However, techniques to estimate technical debt are inadequate, insufficient since they mostly focus on requirements, code, and test, disregarding key artifacts such as the software architecture and the technologies used by the software at hand.
Mark van Rijmenam
(31 Dec. 2014)
Pattern Analytics can be defined as a discipline of Big Data that enables business leaders to understand how different variables of the business interact and are linked with each other. Variables can be of any kind and within any data source, structured as well as unstructured. Such patterns can indicate opportunities for innovation or threats of disruption for your business and therefore require action. Finding patterns within the data and sifting it out is difficult. Machine learning can contribute in helping us humans find patterns that are relevant, but too difficult for us to see. This enables organizations to find patterns they act on. Business leaders can learn from these patterns and use them in their decision-making process. Business leaders therefore should rely less on their gut feeling and years of experience, and more on the data. Pattern Analytics does not require predefined models; the algorithms will do the work for you and find whatever is relevant in a combination of large sets of data. The key with pattern analytics is automatically revealing intelligence that is hidden in the data and these insights will help you grow your business.
Michael Walker
(January 25, 2015)
Data veracity is sometimes thought as uncertain or imprecise data, yet may be more precisely defined as false or inaccurate data. The data may be intentionally, negligently or mistakenly falsified. Data veracity may be distinguished from data quality, usually defined as reliability and application efficiency of data, and sometimes used to describe incomplete, uncertain or imprecise data. The unfortunate reality is that for most data analytic projects about one half or more of time is spent on “data preparation” processes (e.g., removing duplicates, fixing partial entries, eliminating null/blank entries, concatenating data, collapsing columns or splitting columns, aggregating results into buckets…etc.). I suggest this is a “data quality” issue in contrast to false or inaccurate data that is a “data veracity” issue. Data veracity is a serious issue that supersedes data quality issues: if the data is objectively false then any analytical results are meaningless and unreliable regardless of any data quality issues. Moreover, data falsity creates an illusion of reality that may cause bad decisions and fraud – sometimes with civil liability or even criminal consequences.
Alice Zheng
There’s structure in it, but it’s kind of a different form. … It’s spit out by machines and programs. There’s structure, but that structure is difficult to understand for humans. … So, you can’t just throw all of it into an algorithm and expect the algorithm to be able to make sense of it. You really have to process the features, do a lot of pre-processing, and first do things like extract out the frequent sequences, maybe, or figure out what’s the right way to represent IP addresses, for instance. Maybe you don’t want to represent latency by the actual latency number, which could have a very skewed distribution, with lots and lots of large numbers. You might want to assign them into bins or something. There are a lot of things that you need to do to get the data into a format that’s friendly to the model, and then you want to choose the right model. Maybe after you choose the model, you realize this model really is suitable for numeric data and not categorical data. Then you need to go back to the feature engineering part and figure out the best way to represent the data. … I hesitate to say anything critical because half of my friends are in machine learning, which is all about algorithms. I think we already have enough algorithms. It’s not that we don’t need more and better algorithms. I think a much, much bigger challenge is data itself, features, and feature engineering.
Foster Provost & Tom Fawcett
In Web 1.0, businesses busied themselves with getting the basic internet technologies in place so that they could establish a web presence, build electronic commerce capability, and improve operating efficiency. We can think of ourselves as being in the era of Big Data 1.0, with firms engaged in building capabilities to process large data. These primarily support their current operations – for example, to make themselves more efficient. With Web 1.0, once firms had incorporated basic technologies thoroughly (and in the process had driven down prices) they started to look further. They began to ask what the web could do for them, and how it could improve upon what they’d always done. This ushered in the era of Web 2.0, in which new systems and companies started to exploit the interactive nature of the web. The changes brought on by this shift in thinking are extensive and pervasive; the most obvious are the incorporation of social-networking components and the rise of the ‘‘voice’’ of the individual consumer (and citizen). Similarly, we should expect a Big Data 2.0 phase to follow Big Data 1.0. Once firms have become capable of processing massive data in a flexible fashion, they should begin asking: What can I now do that I couldn’t do before, or do better than I could do before? This is likely to usher in the golden era of data science. The principles and techniques of data science will be applied far more broadly and far more deeply than they are today.
Alice Zheng
It’s not enough to tell someone, ‘This is done by boosted decision trees, and that’s the best classification algorithm, so just trust me, it works.’ As a builder of these applications, you need to understand what the algorithm is doing in order to make it better. As a user who ultimately consumes the results, it can be really frustrating to not understand how they were produced. When we worked with analysts in Windows or in Bing, we were analyzing computer system logs. That’s very difficult for a human being to understand. We definitely had to work with the experts who understood the semantics of the logs in order to make progress. They had to understand what the machine learning algorithms were doing in order to provide useful feedback. … It really comes back to this big divide, this bottleneck, between the domain expert and the machine learning expert. I saw that as the most challenging problem facing us when we try to really make machine learning widely applied in the world. I saw both machine learning experts and domain experts as being difficult to scale up. There’s only a few of each kind of expert produced every year. I thought, how can I scale up machine learning expertise? I thought the best thing that I could do is to build software that doesn’t take a machine learning expert to use, so that the domain experts can use them to build their own applications. That’s what prompted me to do research in automating machine learning while at MSR [Microsoft Research].
Michael Jordan
Graphical models are a marriage between probability theory and graph theory. They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering — uncertainty and complexity — and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms. Fundamental to the idea of a graphical model is the notion of modularity — a complex system is built by combining simpler parts. Probability theory provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data. The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms. Many of the classical multivariate probabalistic systems studied in fields such as statistics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism — examples include mixture models, factor analysis, hidden Markov models, Kalman filters and Ising models. The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism. This view has many advantages — in particular, specialized techniques that have been developed in one field can be transferred between research communities and exploited more widely. Moreover, the graphical model formalism provides a natural framework for the design of new systems.
Justin Washtell
(November 16, 2014)
1. Never assume that your data is correct. By all means know that it is, but don’t assume it. Don’t trust the client on this count – to do so might be to do them a disservice.

2. Regardless of how noisy and inconsistent the bulk of your data may be, make sure that you have a small sample of “sanity-check” data that is exceptionally good, or at least whose limitations are very well understood. If the test data is solid, any problems with the training data – even if rife – may prove inconsequential. Without solid test data, you will never know.

3. Do not force (even by mild assumption) the use of sophisticated algorithms and complex models if the data does not support them. Sometimes much simpler is much better. The problem of overfitting (building unnecessarily complex models which serve only to reproduce idiosyncracies of the training data) is well documented, but the extent of this problem is still capable of causing surprise! Let the algorithms – in collaboration with the data – speak for themselves.

4. It follows that the algorithm and parameters the provide the best solution (when very rigorously cross-validated) can actually provide an indication of the quality of your data or the true complexity of the process that it embodies. If a thorough comparison of all the available algorithms suggests Nearest Centroid, Naive Bayes, or some Decision Stump, it is a good indication that the dominant signal in your data is a very simple one.

5. In situations like this machine learning algorithms can actually be used to clean the source data, if there’s a business case for it. Again, a small super-high-quality test set is essential in order to validate the efficacy of this cleaning process.
Mike Barlow
Today, you are much less likely to face a scenario in which you cannot query data and get a response back in a brief period of time. Analytical processes that used to require month, days, or hours have been reduced to minutes, seconds, and fractions of seconds. But shorter processing times have led to higher expectations. Two years ago, many data analysts thought that generating a result from a query in less than 40 minutes was nothing short of miraculous. Today, they expect to see results in under a minute. That’s practically the speed of thought – you think of a query, you get a result, and you begin your experiment. “It’s about moving with greater speed toward previously unknown questions, defining new insights, and reducing the time between when an event happens somewhere in the world and someone responds or reacts to that event,” says Erickson. A rapidly emerging universe of newer technologies has dramatically reduced data processing cycle time, making it possible to explore and experiment with data in ways that would not have been practical or even possible a few years ago. Despite the availability of new tools and systems for handling massive amounts of data at incredible speeds, however, the real promise of advanced data analytics lies beyond the realm of pure technology. “Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data warehouse,” says Michael Minelli, co-author of Big Data, Big Analytics. “It’s about the ability to make better decisions and take meaningful actions at the right time. It’s about detecting fraud while someone is swiping a credit card, or triggering an offer while a shopper is standing on a checkout line, or placing an ad on a website while someone is reading a specific article. It’s about combining and analyzing data so you can take the right action, at the right time, and at the right place.” For some, real-time big data analytics (RTBDA) is a ticket to improved sales, higher profits and lower marketing costs. To others, it signals the dawn of a new era in which machines begin to think and respond more like humans.
Istvan Hajnal
(February 23, 2015)
There are few trends in the Big Data and Data Science world that can be of interest to market researchers:
• Visualization. There is a lot of interest in the Big Data and Data Science world for everything that has to do with Visualization. I’ll admit that sometimes it is Visualize to Impress rather than to Inform, but when it comes to informing clearly, communicating in a simple and understandable way, storytelling, and so on, we market researchers have a head start.
• Natural Language Processing. One of the 4 V’s of Big Data stands for Variety. Very often this refers to unstructured data, which sometimes refers to free text. Big Data and Data Science folks, for instance, start to analyze text that is entered in the free fields of production systems. This problem is not disimilar to what we do when we analyse open questions. Again market research has an opportunity to play a role here. By the way, it goes beyond sentiment analysis. Techniques that I’ve seen successfully used in the Big Data / Data Science world are topic generation and document classification. Think about analysing customer complaints, for instance.
• Deep Learning. Deep learning risks to become the next fad, largely because of the name Deep. But deep here does not refer to profound, but rather to the fact that you have multiple hidden layers in a neural network. And a neural network is basically a logistic regression (OK, I simplify a bit here). So absolutely no magic here, but absolutely great results. Deep learning is a machine learning technique that tries to model high-level abstractions by using so called learning representations of data where data is transformed to a representation of that data that is easier to use with other Machine Learning techniques. A typical example is a picture that constitutes of pixels. These pixels can be represented by more abstract elements such as edges, shapes, and so on. These edges and shapes can on their turn be furthere represented by simple objects, and so on. In the end, this example, leads to systems that are able to reasonably describe pictures in broad terms, but nonetheless useful for practical purposes, especially, when processing by humans is not an option. How can this be applied in Market Research? Already today (shallow) Neural networks are used in Market Research. One research company I know uses neural networks to classify products sold in stores in broad buckets such as petfood, clothing, and so on, based on the free field descriptions that come with the barcode data that the stores deliver.
Alistair Croll, Benjamin Yoskovitz
What makes a good metric?

Here are some rules of thumb for what makes a good metric-a number that will drive the changes you’re looking for.

A good metric is comparative.

Being able to compare a metric to other time periods, groups of users, or competitors helps you understand which way things are moving. “Increased conversion from last week” is more meaningful than “2% conversion”.

A good metric is understandable.

If people can’t remember it and discuss it, it’s much harder to turn a change in the data into a change in the culture.

A good metric is a ratio or a rate.

Accountants and financial analysts have several ratios they look at to understand, at a glance, the fundamental health of a company. You need some, too.

There are several reasons ratios tend to be the best metrics:

1 Ratios are easier to act on. Think about driving a car. Distance travelled is informational. But speed-distance per hour-is something you can act on, because it tells you about your current state, and whether you need to go faster or slower to get to your destination on time.

2 Ratios are inherently comparative. If you compare a daily metric to the same metric over a month, you’ll see whether you’re looking at a sudden spike or a long-term trend. In a car, speed is one metric, but speed right now over average speed this hour shows you a lot about whether you’re accelerating or slowing down.

3 Ratios are also good for comparing factors that are somehow opposed, or for which there’s an inherent tension. In a car, this might be distance covered divided by traffic tickets. The faster you drive, the more distance you cover-but the more tickets you get. This ratio might suggest whether or not you should be breaking the speed limit. A good metric changes the way you behave. This is by far the most important criterion for a metric: what will you do differently based on changes in the metric?

1 “Accounting” metrics like daily sales revenue, when entered into your spreadsheet, need to make your predictions more accurate. These metrics form the basis of Lean Startup’s innovation accounting, showing you how close you are to an ideal model and whether your actual results are converging on your business plan.

2 “Experimental” metrics, like the results of a test, help you to optimize the product, pricing, or market. Changes in these metrics will significantly change your behavior. Agree on what that change will be before you collect the data: if the pink website generates more revenue than the alternative, you’re going pink; if more than half your respondents say they won’t pay for a feature, don’t build it; if your curated MVP doesn’t increase order size by 30%, try something else. Drawing a line in the sand is a great way to enforce a disciplined approach. A good metric changes the way you behave precisely because it’s aligned to your goals of keeping users, encouraging word of mouth, acquiring customers efficiently, or generating revenue. If you want to choose the right metrics, you need to keep five things in mind:

1 Qualitative versus quantitative metrics

Qualitative metrics are unstructured, anecdotal, revealing, and hard to aggregate; quantitative metrics involve numbers and statistics, and provide hard numbers but less insight.

2 Vanity versus actionable metrics

Vanity metrics might make you feel good, but they don’t change how you act. Actionable metrics change your behavior by helping you pick a course of action.

3 Exploratory versus reporting metrics

Exploratory metrics are speculative and try to find unknown insights to give you the upper hand, while reporting metrics keep you abreast of normal, managerial, day-to-day operations.

4 Leading versus lagging metrics

Leading metrics give you a predictive understanding of the future; lagging metrics explain the past. Leading metrics are better because you still have time to act on them-the horse hasn’t left the barn yet.

5 Correlated versus causal metrics

If two metrics change together, they’re correlated, but if one metric causes another metric to change, they’re causal. If you find a causal relationship between something you want (like revenue) and something you can control (like which ad you show), then you can change the future

Analysts look at specific metrics that drive the business, called key performance indicators (KPIs). Every industry has KPIs-if you’re a restaurant owner, it’s the number of covers (tables) in a night; if you’re an investor, it’s the return on an investment; if you’re a media website, it’s ad clicks; and so on.
John Mount
(April 19, 2013)
1. The nature of statistics
Statistics is the original computing with data. It is the field that deals with data with the most portability (it isn’t dependent on one type of physical model) and rigor. Statistics can be a pessimal field: statisticians are the masters of anticipating what can go wrong with experiments and what fallacies can be drawn from naive uses of data. Statistics has enough techniques to solve just about any problem, but it also has an inherent conservatism to it.
I often say the best source of good statistical work is bad experiments. If all experiments were well conducted, we wouldn’t need a lot of statistics. However, we live in the real world; most experiments have significant shortcomings and statistics is incredibly valuable.
Another aspect of statistics is it is the only field that really emphasizes the risks of small data. There are many other potential data problems statistics describes well (like Simpson’s paradox), but statistics is fairly unique in the information sciences in emphasizing the risks of trying to reason from small datasets. This is actually very important: datasets that are expensive to produce (such as drug trials) are necessarily small.
It is only recently that minimally curated big datasets became perceived as being inherently valuable (the earlier attitude being closer to GIGO). And in some cases big data is promoted as valuable only because it is the cheapest to produce. Often a big dataset (such as logs of all clicks seen on a search engine) is useful largely because they are a good proxy for a smaller dataset that is too expensive to actually produce (such as interviewing a good cross section of search engine users as to their actual intent).
If your business is directly producing truly valuable data (not just producing useful proxy data) you likely have small data issues. If you have any hint of a small data issue, you want to consult with a good statistician.
2. The nature of machine learning
In some sense machine learning rushes where statisticians fear to tread. Machine learning does have some concept of small data issues (such as knowing about over-fitting), but it is an essentially optimistic field.
The goal of machine learning is to create a predictive model that is indistinguishable from a correct model. This is an operational attitude that tends to offend statisticians who want a model that not only appears to be accurate but is in fact correct (i.e. also has some explanatory value).
My opinion is the best machine learning work is an attempt to re-phrase prediction as an optimization problem (see for example: Bennett, K. P., & Parrado-Hernandez, E. (2006). The Interplay of Optimization and Machine Learning Research. Journal of Machine Learning Research, 7, 1265/1281). Good machine learning papers use good optimization techniques and bad machine learning papers (most of them in fact) use bad out of date ad-hoc optimization techniques.
3. The nature of data mining
Data mining is a term that was quite hyped and now somewhat derided. One of the reasons more people use the term “data science” nowadays is they are loath to say “data mining” (though in my opinion the two activities have different goals).
The goal of data mining is to find relations in data, not to necessarily make predictions or come up with explanations. Data mining is often what I call “an x’s only enterprise” (meaning you have many driver or “independent” variables but no pre-ordained outcome or “dependent” variables) and some of the typical goals are clustering, outlier detection and characterization.
There is a sense that when it was called exploratory statistics it was considered boring, but when it was called data mining it was considered sexy. Actual exploratory statistics (as defined by Tukey) is exciting and always an important “get your hands into the data” step of any predictive analytics project.
4. The nature of informatics
Informatics and in particular bioinformatics are very hot terms. A lot of good data scientists (a term I will explain later) come from the bioinformatics field.
Once we separate out the portions of bioinformatics that are in fact statistics and the ones that are in fact biology we are left with data infrastructure and matching algorithms. We have the creation and management of data stores, data bases and design of efficient matching and query algorithms. This isn’t meant to be a left handed compliment: algorithms are a first love of mine and some of the matching algorithms bioinformaticians uses (like online suffix trees) are quite brilliant.
5. The nature of big data
Big data is a white-hot topic. The thing to remember is: it is just the infrastructure (MapReduce, Hadoop, noSQL and so on). It is the platform you perform modeling (or usually just report generation) on top of.
6. The nature of predictive analytics
The Wikipedia defines Predictive analytics as the variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events. It is a set of goals and techniques emphasizing making models. It is very close to what is also meant by data science.
I don’t tend to use the term predictive analytics because I come from a probability, simulation, algorithms and machine learning background and not from an analytics background. To my ear analytics is more associated with visualization, reporting and summarization than with modeling. I also try to use the term modeling over prediction (when I remember) as prediction often in non-technical English implies something like forecasting into the future (which is but one modeling task).
7. The nature of data science
The Wikipedia defines data science as a field that incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.
Data science is a term I use to represent the ownership and management of the entire modeling process: discovering the true business need, collecting data, managing data, building models and deploying models into production.
8. Conclusion
Machine learning and statistics may be the stars, but data science the whole show.
Kalyanaraman K
Data analysis, one of the main requirements for Research has transformed into ‘Data Science’ which is considered as one of the most important concepts in the current internet enabled scenario. May be it is for a different purpose, where the requirement of manpower in data analysis related issues is huge. Business decisions started moving towards data aided decisions and the availability of data and information infrastructure have created a situation where Statistics is termed as the sexiest job of the new century. (Davenport,T.H. and Patil, D.J., Data Scientist: The Sexiest Job of the 21st Century, Harvard University). An attempt is made to provide a concise account of the evolution of the concept ‘Data Science’ over the last few years.
With an expanding scenario in computing facilities as well as research efforts in various disciplines, the role of Statistics for data analysis either for experiment based data, primary sample data or secondary data gained enormous importance. Inclusion of ‘Management’ as a separate discipline of study started creating professionals in management and data analysis found a new place with more and more business decisions taken depending upon statistical evidence. With better computing facilities, data analysis got enlarged into Management Information System (MIS) which resulted in the emergence of Decision Support Systems (DSS). The role of statistical methods in DSS is primary in nature. These concepts got converted into another concept called Business Intelligence (BI) where analytical solutions leading to the knowledge of the existing state of the business systems were the main attraction. When the knowledge so obtained were used in deciding future course of actions in a business system, a broader concept was designed, i.e., Business Analytics (BA), which included BI. Currently, BA is an important area where professional Statisticians are also in demand.
BA needed tools that included Mathematics, Statistics, Commerce, Economics, Data Mining, Data Visualisation and others and large business operations started outsourcing their requirements. Every day the requirement for manpower on Data Analysis is increasing justifying the content of the article by Davenport and Patil referred to earlier. During the last few years Data storage developed manifold with better Data Warehousing methods. This, in turn, resulted in a situation where large quantity of data got stored and methods required handling them. This gave rise to the term ‘Big Data’. There is enormous potential and manpower is very low for obvious reasons.
This situation was further supported by more powerful computer infrastructure both for computing and communication that led to make the computers to deliver solutions without human intervention resulting in concepts of Machine Learning as part of data analysis. This is an area where Algorithms, High-end Probability Models including Bayesian Principles are used in various ways.
A separate discipline came up to be called as Data Science that includes Mathematics, Statistics, Probability, Business Analytics, Predictive Analytics, Data Acquisition, Data Warehousing, Data Communications, Programming facilities and Machine Learning among others. Concise accounts of these concepts are provided. Initially, these efforts started with John.W.Tukey publishing a paper ‘The Future of Data Analysis’ in 1962. With more additions to stored program concepts in electronic computers Data Analysis started to have a different dimension. Tukey was the one who coined the term ‘bit’ which is used by Shannon in ‘A Mathematical Theory of Communications’ With the publication of ‘Exploratory Data Analysis’ by J.W.Tukey in 1977, statistical methods started moving from academic requirements to the exploration of data organized under Data Warehousing principles. This may be taken as the starting point in the evolution of Data Analysis to Data Science. With the publication of ‘Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics’ by William. S. Cleveland the new discipline was placed in the computer framework and the relevant methods, like data mining, used in exploratory data analysis. Currently, this seems to be one of the attractions of both academics and users.

Early Stages in Statistical Data Analysis:

Use of statistics as a data analysis requirement seems to have its origin in the work of four British Statisticians; Francis Galton, Karl Pearson, William Sealy Gosset and Ronald Fisher. Sir Francis Galton, a half cousin of Charles Darwin, was the exponent of ‘eugenics’ a term coined by him which is a study relating to betterment of human race by selective breeding. He made some important contributions to Mathematics and Statistics. He developed on the idea of Correlation and regression, taking the clue from a Geologist by name Baver. His partial work on these concepts were seen to be put on a mathematical footing by his junior Karl Pearson, who is considered as one among those responsible for Mathematical Statistics. Next in line was W.S.Gosset who had expertise in Mathematics and Chemistry. Guinness Brewery in Ireland sought his help to bring in more consistency in the quality of Beer produced by them. However, brewing is a time consuming process and Gosset had to settle for small samples to arrive at decisions on quality. Thus, he developed small sample techniques in statistics and published his findings in a knick name Student as his employers were hostile to such publications. Then came Ronald Fisher who studied Biology and Genetics apart from Mathematics and he had to find an employment as he was relatively poor. Thus, Rothamstead Experimental Station gained his expertise and he developed principles of experiment and one of the most used statistical tools, Analysis of Variance. Further he went on to bring out his major contribution ‘Foundations of Mathematical Statistics’ in 1925. Thus, the initial phase in the development of Statistics had data analysis as its core element.

Impact of Computers on Statistics:

Formal education in Statistics in early days required the student to be trained in practical application of statistical methods. Initially paper and pencil were the only computing facility, perhaps with the assistance of Clarke’s Tables. Subsequently, Manual Calculators, Electrical Calculators, Slide Rule, Nomography and other tools were taught as part of training in Statistics. Electronic calculators came only during early seventies. With the development of ‘Stored Program Concept’ computers started to wield influence in the use of Statistical methods, mainly for Academic Research in India. One needed to know some programming language to make the computer to solve statistical applications. This situation continued until the emergence of personal computers and statistical software. Now a host of programming environment along with large collection of software made statistical applications very user friendly. The current expectation about programming environments for Data Analysis under a Data Science principle is the use of Python and R.

The transition:

With computer infrastructure getting expanded and with network principles making communication easier, data analysis started to help business decisions with evidence gained from the data related to their past behaviour. Tukey’s publication ‘Future of Data Analysis’ and ‘Exploratory Data Analysis’ provided the direction for the non statistical professionals in business to go forward in data aided decision making efforts. ‘Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics’ by William. S. Cleveland provided the framework for putting the decision making process on the computer network, surpassing all existing principles like MIS, DSS and others. The addition of a concept called ‘Big Data’ has transformed the professional side of data handling either by a mathematician, statistician, professional managers or computer experts as an exciting career.
Some of these principles are expanded below to provide insight to education managers to see what is required and what is not available. Features around Business Analytics, Big Data and Data Science are explained to locate the place for trained personnel both in theoretical and practical aspects of Mathematics, Statistics and Computer Science.

Business Analytics:

Definition: Business Analytics refers to the methodology employed by an organization to enhance its business and make optimized decisions by the use of statistical techniques i.e. collecting data, assembling and analyzing it to better their products, services, marketing etc.
Terms which are associated with Business Analytics are MIS, ERP, DSS, BI, Data Mining, Data Warehouse and others. The term “business analytics” is loosely used by some to describe a set of different procedures including data mining and other analytic methods. However, this includes more concepts and procedures.
Business Analytics can be considered in three different perspectives; historical data based analytics, experiment based analytics and real-time analytics.
Historical data based analytics use accrued data in an organized form to identify patterns of important concepts related to the problem, like customer behavior, employee satisfaction in Marketing Problems and try to relate them to the functions required like, customer selection, loyalty and service in such problems.?Experiment based analytics relate to the use of data analytics to measure the overall impact of possible interventions to boost revenue in a continuous experimental framework. For example, to understand the drivers of financial performance experimental situations are to be created to understand the impact of different type of interventions in the existing process. As an example, Capital One can be observed conducting more than 30000 experiments in a year with different interest rates, incentives and other variables with a directed effort to identify variables that are responsible to maximize the number of good customers. Such methods are also called ‘Enterprise Approach’.
Real time analytics are concerned with high priority requirements in business operations. This will be a continuous activity involving capabilities to use continuous flow of data and take appropriate decisions to optimize their function. This issue may be explained in terms of some examples.
* A Logistic operator may be in a situation that requires monitoring transport network availability in relation to their quantum of operations across the country. With a real-time view into business events, operators are better informed, and they can make alternate plans to keep the deliveries on time and customers satisfied.
* A manufacturer may be required to track availability of power in relation to consumption of electricity at plants using related tools that provide notifications. With real-time visibility into power availability and demand, he can ensure cost-effective production schedule.
BA activities include:
* Exploring data to find new patterns and relationships (data mining)
* Explaining why a certain result occurred (statistical analysis, quantitative analysis)
* Experimenting to test previous decisions ( multivariate studies)
* Forecasting future results (predictive modeling, predictive analytics)
Answers are to be provided for questions arising out of these activities using tools based on new principles and concepts of Reporting, Monitoring, Alerting, Dashboards, Scorecards, OLAP and Ad hoc queries using Statistical Analysis, Data Mining, and Predictive Modeling and Multivariate methods and others.

Big Data:

Initially, in Academic research there used to be problems of insufficient data either due to cost of experiment or other issues in acquiring data. The problem seems to be reversed now, with enormous data availability along with cheap storage and user friendly data analysis tools. Conventional Data Mining tools have to be reinvented to accommodate the enormity of data. Technical issues relating to computing with such large quantities of data have given rise to new software. Hadoop with concepts like MapReduce have come to assist statistical computing. Rapidminer is another facility to mine enormous data. In short, Big Data is characterized by Three Vs, Volume, Velocity and Variety. As an example one can look at the kind of data getting added to a telecom system when a call is made using mobile communication.
Data Science emerged as a discipline to include all the requirements of BA and Big Data in an automated environment with facilities that includes Data Acquisition, Data Warehousing, Data Communication, Mathematics, Statistics, Machine learning and others. The role of Statistics and Mathematics need not be told except that high end Probability and Statistical techniques such as Bayesian Decision Processes, Markov Decision process and the like are used under different Algorithms where Mathematics has a primary role. A small description of what is aimed at is provided here.

Machine Learning:

This is a branch of Artificial Intelligence that is concerned with designs of systems that can learn from data. Arthur Samuel (1959) defined machine learning as ‘A field of study that gives the ability to learn without being explicitly programmed’. In this activity, mainly Data Mining is used to obtain knowledge from data and use such knowledge for prediction. In order to achieve the goal Machine Learning Algorithms are used. Depending upon the requirement, such algorithms may be classified as Supervised learning, Unsupervised learning, Semi-Supervised learning, Transduction, Reinforcement learning, Learning to learn, Development learning and the like. These are to be implemented using Computational Learning Theory, a branch of Computer Science. There seems to be lot of similarities between machine learning theory and statistical inference. Machine Learning algorithms classified earlier may use any of the available approaches in an appropriate manner. Decision Tree, Association Rule, Artificial Neural Networks, Inductive Logic Programming, Cluster Analysis, Bayesian Networks, Similarity and Dissimilarity based methods are some of the approaches. It is not difficult to see the complexity of Mathematical and statistical methods in this respect. Unless a person is in complete understanding of the theory such activities cannot be practical.
Learning is all about generalizing regularities in observed data to yet unobserved data. Good generalization depends upon how good one is in balancing prior information with information from data. One can observe something ‘Bayesian’ here. The required computational activity involves good classification of data. Principles of Nearest Neighbors Classification, Naïve Bayesian Classification find dominant place in such regularizations. These principles are seen to be used with discrete data. However, methods based on Logistic Regression Classifier and its relative Perceptron are also used in classifications.
Kernel programming is what is used at implementation level. In computing, Kernel is a computer program that manages input/output requests from software and translates them into data processing instructions for the CPU and other electronic components of a computer (Wikipedia). Algorithms for statistical methods need to be defined appropriately for kernel programming, leading to Kernel Regression, Kernel Clustering, Kernel PCA, Kernel Discriminant Function, Kernel Canonical Correlation Analysis and others.

Categories of work related to Data Science:

By way of summarizing the requirements that may characterize Data Scientists, a classification is provided by Vincent Granville, a forerunner in this field. Categories of Data Scientists include,
1. ‘Statisticians’, who may develop new theories related to Big Data, who are experts in Data Modeling, Sampling, Experimental Design, Clustering, Data Reduction, Prediction and others.
2. ‘Mathematicians’ who are experts in astronomy, geometry, operations research, optimization and the like as well as Algorithms.
3. ‘Data Engineers’ who are able with Hadoop, Data Base, File System architecture, Data Flows and others.
4. ‘Machine Learning Experts’ who can handle complex computer systems/algorithms.
5. ‘Professional Managers / Business experts’ who are good at ROI optimization and related tools such as Dashboards, Design of Performance Metrics.
6. ‘Software Engineers’ to produce codes for computer implementation
7. ‘Visualization Experts’, who may have insight to bring out the knowledge generated in visual facilities.
8. ‘Spatial Data Experts’ to generate Graphs and Graph Data Base.
The aim of educational managers, in the field of Mathematics, Statistics, Computer Science, Visual Media and Management should be to sit together and organize a training program to create employable data scientists. They are in great demand.
Vincent Granville
Data Science

First, let’s start by describing data science, the new discipline.

Job titles include data scientist, chief scientist, senior analyst, director of analytics and many more. It covers all industries and fields, but especially digital analytics, search technology, marketing, fraud detection, astronomy, energy, healhcare, social networks, finance, forensics, security (NSA), mobile, telecommunications, weather forecasts, and fraud detection.

Projects include taxonomy creation (text mining, big data), clustering applied to big data sets, recommendation engines, simulations, rule systems for statistical scoring engines, root cause analysis, automated bidding, forensics, exo-planets detection, and early detection of terrorist activity or pandemics, An important component of data science is automation, machine-to-machine communications, as well as algorithms running non-stop in production mode (sometimes in real time), for instance to detect fraud, predict weather or predict home prices for each home (Zillow).

An example of data science project is the creation of the fastest growing data science Twitter profile, for computational marketing. It leverages big data, and is part of a viral marketing / growth hacking strategy that also includes automated high quality, relevant, syndicated content generation (in short, digital publishing version 3.0).

Unlike most other analytic professions, data scientists are assumed to have great business acumen and domain expertize — one of the reasons why they tend to succeed as entrepreneurs.There are many types of data scientists, as data science is a broad discipline. Many senior data scientists master their art/craftsmanship and possess the whole spectrum of skills and knowledge; they really are the unicorns that recruiters can’t find. Hiring managers and uninformed executives favor narrow technical skills over combined deep, broad and specialized business domain expertize – a byproduct of the current education system that favors discipline silos, while true data science is a silo destructor. Unicorn data scientists (a misnomer, because they are not rare – some are famous VC’s) usually work as consultants, or as executives. Junior data scientists tend to be more specialized in one aspect of data science, possess more hot technical skills (Hadoop, Pig, Cassandra) and will have no problems finding a job if they received appropriate training and/or have work experience with companies such as Facebook, Google, eBay, Apple, Intel, Twitter, Amazon, Zillow etc. Data science projects for potential candidates can be found here.

Data science overlaps with

Computer science: computational complexity, Internet topology and graph theory, distributed architectures such as Hadoop, data plumbing (optimization of data flows and in-memory analytics), data compression, computer programming (Python, Perl, R) and processing sensor and streaming data (to design cars that drive automatically)

Statistics: design of experiments including multivariate testing, cross-validation, stochastic processes, sampling, model-free confidence intervals, but not p-value nor obscure tests of thypotheses that are subjects to the curse of big data

Machine learning and data mining: data science indeed fully encompasses these two domains.

Operations research: data science encompasses most of operations research as well as any techniques aimed at optimizing decisions based on analysing data.

Business intelligence: every BI aspect of designing/creating/identifying great metrics and KPI’s, creating database schemas (be it NoSQL or not), dashboard design and visuals, and data-driven strategies to optimize decisions and ROI, is data science.

Comparison with other analytic discplines

Machine learning: Very popular computer science discipline, data-intensive, part of data science and closely related to data mining. Machine learning is about designing algorithms (like data mining), but emphasis is on prototyping algorithms for production mode, and designing automated systems (bidding algorithms, ad targeting algorithms) that automatically update themselves, constantly train/retrain/update training sets/cross-validate, and refine or discover new rules (fraud detection) on a daily basis. Python is now a popular language for ML development. Core algorithms include clustering and supervised classification.

Data mining: This discipline is about designing algorithms to extract insights from rather large and potentially unstructured data (text mining), sometimes called nugget discovery, for instance unearthing a massive Botnets after looking at 50 million rows of data.Techniques include pattern recognition, feature selection, clustering, supervised classification and encompasses a few statistical techniques (though without the p-values or confidence intervals attached to most statistical methods being used). Instead, emphasis is on robust, data-driven, scalable techniques, without much interest in discovering causes or interpretability. Data mining thus have some intersection with statistics, and it is a subset of data science. Data mining is applied computer engineering, rather than a mathematical science. Data miners use open source and software such as Rapid Miner.

Predictive modeling: Not a discipline per se. Predictive modeling projects occur in all industries across all disciplines. Predictive modeling applications aim at predicting future based on past data, usually but not always based on statistical modeling. Predictions often come with confidence intervals. Roots of predictive modeling are in statistical science.

Statistics. Currently, statistics is mostly about surveys (typically performed with SPSS software), theoretical academic research, bank and insurance analytics (marketing mix optimization, cross-selling, fraud detection, usually with SAS and R), statistical programming, social sciences, global warming research (and space weather modeling), economic research, clinical trials (pharmaceutical industry), medical statistics, epidemiology, biostatistics.and government statistics. Agencies hiring statisticians include the Census Bureau, IRS, CDC, EPA, BLS, SEC, and EPA (environmental/spatial statistics). Jobs requiring a security clearance are well paid and relatively secure, but the well paid jobs in the pharmaceutical industry (the golden goose for statisticians) are threatened by a number of factors – outsourcing, company mergings, and pressures to make healthcare affordable. Because of the big influence of the conservative, risk-adverse pharmaceutical industry, statistics has become a narrow field not adapting to new data, and not innovating, loosing ground to data science, industrial statistics, operations research, data mining, machine learning — where the same clustering, cross-validation and statistical training techniques are used, albeit in a more automated way and on bigger data. Many professionals who were called statisticians 10 years ago, have seen their job title changed to data scientist or analyst in the last few years.

Industrial statistics. Statistics frequently performed by non-statisticians (engineers with good statistical training), working on engineering projects such as yield optimization or load balancing (system analysts). They use very applied statistics, and their framework is closer to six sigma, quality control and operations research, than to traditional statistics. Also found in oil and manufacturing industries. Techniques used include time series, ANOVA, experimental design, survival analysis, signal processing (filtering, noise removal, deconvolution), spatial models, risk and reliability models.

Mathematical optimization. Solves business optimization problems with techniques such as the simplex algorithm, Fourier transforms (signal processing), differential equations, and software such as Matlab. They are found in big companies such as IBM, research labs, and NSA and in the finance industry (sometimes recruiting physics or engineer graduates). These professionals sometimes solve the exact same problems as statisticians do, using the exact same techniques, though they use different names. Mathematicians use least square optimization for interpolation or extrapolation; statisticians use linear regression for predictions and model fitting, but both concepts are identical, and rely on the exact same mathematical machinery: it’s just two names describing the same thing. Mathematical optimization is however closer to operations research than statistics, the choice of hiring a mathematician rather than another practitioner (data scientist) is often dictated by historical reasons, especially for organizations such as NSA or IBM.

Actuarial sciences. Just a subset of statistics focusing on insurance (car, health, etc.) using survival models: predicting when you will die, what your health expenditures will be based on your health status (smoker, gender, previous diseases) to determine your insurance premiums. Also predicts extreme floods and weather events to determine premiums. These latter models are notoriously erroneous (recently) and have resulted in far bigger payouts than expected. For some reasons, this is a very vibrant, secretive community of statisticians, that do not call themselves statisticians anymore (job title is actuary). They have seen their average salary increase nicely over time: access to profession is restricted and regulated just like for lawyers, for no other reasons than protectionism to boost salaries and reduce the number of qualified applicants to job openings. Actuarial sciences is indeed data science (a sub-domain).
HPC. High performance computing, not a discipline per se, but should be of concern to data scientists, big data practitioners, computer scientists and mathematicians, as it can redefine the computing paradigms in these fields. If quantum computers ever become successful, it will totally change the way algorithms are designed and implemented. HPC should not be confused with Hadoop and Map-Reduce: HPC is hardware-related, Hadoop is software-related (though heavily relying on Internet bandwidth and servers configuration and proximity).

Operations research. Abbreviated as OR. They separated from statistics a while back (like 20 years ago), but they are like twin brothers, and their respective organizations (INFORMS and ASA) partner together. OR is about decision science and optimizing traditional business projects: inventory management, supply chain, pricing. They heavily use Markov Chain models, Monter-Carlo simulations, queuing and graph theory, and software such as AIMS, Matlab or Informatica. Big, traditional old companies use OR, new and small ones (start-ups) use data science to handle pricing, inventory management or supply chain problems. Many operations research analysts are becoming data scientists, as there is far more innovation and thus growth prospect in data science, compared to OR. Also, OR problems can be solved by data science. OR has a siginficant overlap with six-sigma (see below), also solves econometric problems, and has many practitioners/applications in the army and defense sectors.

Six sigma. It’s more a way of thinking (a business philosophy, if not a cult) rather than a discipline, and was heavily promoted by Motorola and GE a few decades ago. Used for quality control and to optimize engineering processes (see entry on industrial statistics in this article), by large, traditional companies. They have a LinkedIn group with 270,000 members, twice as large as any other analytic LinkedIn groups including our data science group. Their motto is simple: focus your efforts on the 20% of your time that yields 80% of the value. Applied, simple statistics are used (simple stuff works must of the time, I agree), and the idea is to eliminate sources of variances in business processes, to make them more predictable and improve quality. Many people consider six sigma to be old stuff that will disappear. Perhaps, but the fondamental concepts are solid and will remain: these are also fundamental concepts for all data scientists. You could say that six sigma is a much more simple if not simplistic version of operations research (see above entry), where statistical modeling is kept to a minimum. Risks: non qualified people use non-robust black-box statistical tools to solve problems, it can result in disasters. In some ways, six sigma is a discipline more suited for business analysts (see business intelligence entry below) than for serious statisticians.

Quant. Quant people are just data scientists working for Wall Street on problems such as high frequency trading or stock market arbitraging. They use C++, Matlab, and come from prestigious universities, earn big bucks but lose their job right away when ROI goes too South too quickly. They can also be employed in energy trading. Many who were fired during the great recession now work on problems such as click arbitraging, ad optimization and keyword bidding. Quants have backgrounds in statistics (few of them), mathematical optimization, and industrial statistics.

Artificial intelligence. It’s coming back. The intersection with data science is pattern recognition (image analysis) and the design of automated (some would say intelligent) systems to perform various tasks, in machine-to-machine communication mode, such as identifying the right keywords (and right bid) on Google AdWords (pay-per-click campaigns involving millions of keywords per day). I also consider smart search (creating a search engine returning the results that you expect and being much broader than Google) one of the greatest problems in data science, arguably also an AI and machine learning problem.

Computer science. Data science has some overlap with computer science: Hadoop and Map-Reduce implementations, algorithmic and computational complexity to design fast, scalable algorithms, data plumbing, and problems such as Internet topology mapping, random number generation, encryption, data compression, and steganography (though these problems overlap with statistical science and mathematical optimization as well).

Econometrics. Why it became separated from statistics is unclear. So many branches disconnected themselves from statistics, as they became less generic and start developing their own ad-hoc tools. But in short, econometrics is heavily statistical in nature, using time series models such as auto-regressive processes. Also overlapping with operations research (itself overlaping with statistics!) and mathematical optimization (simplex algorithm). Econometricians like ROC and efficiency curves (so do six sigma practitioners, see corresponding entry in this article). Many do not have a strong statistical background, and Excel is their main or only tool.

Data engineering. Performed by software engineers (developers) or architects (designers) in large organizations (sometimes by data scientists in tiny companies), this is the applied part of computer science (see entry in this article), to power systems that allow all sorts of data to be easily processed in-memory or near-memory, and to flow nicely to (and between) end-users, including heavy data consumers such as data scientists. A sub-domain currently under attack is data warehousing, as this term is associated with static, siloed conventational data bases, data architectures, and data flows, threatened by the rise of NoSQL, NewSQL and graph databases. Transforming these old architectures into new ones (only when needed) or make them compatible with new ones, is a lucrative business.

Business intelligence. Abbreviated as BI. Focuses on dashboard creation, metric selection, producing and scheduling data reports (statistical summaries) sent by email or delivered/presented to executives, competitive intelligence (analyzing third party data), as well as involvement in database schema design (working with data architects) to collect useful, actionable business data efficiently. Typical job title is business analyst, but some are more involved with marketing, product or finance (forecasting sales and revenue). They typically have an MBA degree. Some have learned advanced statistics such as time series, but most only use (and need) basic stats, and light analytics, relying on IT to maintain databases and harvest data. They use tools such as Excel (including cubes and pivot tables, but not advanced analytics), Brio (Oracle browser client), Birt, Micro-Sreategy or Business Objects (as end-users to run queries), though some of these tools are increasingly equipped with better analytic capabilities. Unless they learn how to code, they are competing with some polyvalent data scientists that excel in decision science, insights extraction and presentation (visualization), KPI design, business consulting, and ROI/yield/business/process optimization. BI and market research (but not competitive intelligence) are currently experiencing a decline, while AI is experiencing a come-back. This could be cyclical. Part of the decline is due to not adapting to new types of data (e.g. unstructured text) that require engineering or data science techniques to process and extract value.

Data analysis. This is the new term for business statistics since at least 1995, and it covers a large spectrum of applications including fraud detection, advertising mix modeling, attribution modeling, sales forecasts, cross-selling optimization (retails), user segmentation, churn analysis, computing long-time value of a customer and cost of acquisition, and so on. Except in big companies, data analyst is a junior role; these practitioners have a much more narrow knwoledge and experience than data scientists, and they lack (and don’t need) business vision. They are detailed-orirented and report to managers such as data scientists or director of analytics, In big companies, someone with a job title such as data analyst III might be very senior, yet they usually are specialized and lack the broad knowledge gained by data scientists working in a variety of companies large and small.

Business analytics. Same as data analysis, but restricted to business problems only. Tends to have a bit more of a finacial, marketing or ROI flavor. Popular job titles include data analyst and data scientist, but not business analyst (see business intelligence entry for business intelligence, a different domain).

Finally, there are more specialized analytic disciplines that recently emerged: health analytics, computational chemistry and bioinformatics (genome research), for instance.
Jenna Dutcher
(September 3rd, 2014)
What is Big Data ?

John Akred (Founder and CTO, Silicon Valley Data Science)

“Big Data” refers to a combination of an approach to informing decision making with analytical insight derived from data, and a set of enabling technologies that enable that insight to be economically derived from at times very large, diverse sources of data. Advances in sensing technologies, the digitization of commerce and communications, and the advent and growth in social media are a few of the trends which have created the opportunity to use large scale, fine grained data to understand systems, behavior and commerce; while innovation in technology makes it viable economically to use that information to inform decisions and improve outcomes.

Philip Ashlock (Chief Architect, Data.gov)

While the use of the term is quite nebulous and is often co-opted for other purposes, I’ve understood “big data” to be about analysis for data that’s really messy or where you don’t know the right questions or queries to make – analysis that can help you find patterns, anomalies, or new structures amidst otherwise chaotic or complex data points. Usually this revolves around datasets with a byte size that seems fairly large relative to our frame of reference using files on a desktop PC (e.g., larger than a terabyte) and many of the tools around big data are to help deal with a large volume of data, but to me the most important concepts of big data don’t actually have much to do with it being “big” in this sense (especially since that’s such a relative term these days). In fact, they can often be applied to smaller datasets as well. Natural language processing and lucene based search engines are good examples of big data techniques and tools that are often used with relatively small amounts of data.

Jon Bruner (Editor-at-Large, O’Reilly Media)

Big Data is the result of collecting information at its most granular level – it’s what you get when you instrument a system and keep all of the data that your instrumentation is able to gather.

Reid Bryant (Data Scientist, Brooks Bell)

As computational efficiency continues to increase, “big data” will be less about the actual size of a particular dataset and more about the specific expertise needed to process it. With that in mind, “big data” will ultimately describe any dataset large enough to necessitate high-level programming skill and statistically defensible methodologies in order to transform the data asset into something of value.

Mike Cavaretta (Data Scientist and Manager, Ford Motor Company)

You cannot give me too much data. I see big data as storytelling – whether it is through information graphics or other visual aids that explain it in a way that allows others to understand across sectors. I always push for the full scope of the data over averages and aggregations – and I like to go to the raw data because of the possibilities of things you can do with it.

Drew Conway (Head of Data, Project Florida)

Big Data, which started as a technological innovation in distributed computing, is now a cultural movement by which we continue to discover how humanity interacts with the world – and each other – at large-scale.

Rohan Deuskar (CEO and Co-Founder, Stylitics)

Big data refers to the approach to data of “collect now, sort out later”…meaning you capture and store data on a very large volume of actions and transactions of different types, on a continuous basis, in order to make sense of it later. The low cost of storage and better methods of analysis mean that you generally don’t need to have a specific purpose for the data in mind before you collect it.

Amy Escobar (Data Scientist, 2U, Inc)

[Big Data is] an opportunity to gain a more complex understanding of the relationships between different factors and to uncover previously undetected patterns in data by leveraging advances in the technical aspects of collecting, storing, and retrieving data along with innovative ideas and techniques for manipulating and analyzing data.

Josh Ferguson (Chief Technology Officer, Mode Analytics)

Big data is the broad name given to challenges and opportunities we have as data about every aspect of our lives becomes available. It’s not just about data though; it also includes the people, processes, and analysis that turn data into meaning.

John Foreman (Chief Data Scientist, MailChimp)

I prefer a flexible but functional definition of big data. Big data is when your business wants to use data to solve a problem, answer a question, produce a product, etc., but the standard, simple methods (maybe it’s SQL, maybe it’s k-means, maybe it’s a single server with a cron job) break down on the size of the data set, causing time, effort, creativity, and money to be spent crafting a solution to the problem that leverages the data without simply sampling or tossing out records.

The main consideration here, then, is to weigh the cost of using “all the data” in this complex (and potentially brittle) solution versus the benefits gained over using a smaller data set in a cheaper, faster, more stable way.

Daniel Gillick (Senior Research Scientist, Google)

Historically, most decisions – political, military, business, and personal – have been made by brains which have unpredictable logic and operate on subjective experiential evidence. “Big data” represents a cultural shift in which more and more decisions are made by algorithms with transparent logic, operating on documented immutable evidence. I think “Big” refers more to the pervasive nature of this change than to any particular amount of data.

Annette Greiner (Lecturer, UC Berkeley School of Information)

Big Data is data that contains enough observations to demand unusual handling because of its sheer size, though what is unusual changes over time and varies from one discipline to another. Scientific computing is accustomed to pushing the envelope, constantly developing techniques to address relentless growth in dataset size, but many other disciplines are now just discovering the value – and hence the challenges – of working with data at the unwieldy end of the scale.

Seth Grimes (Principal Consultant, Alta Plana Corporation)

Big data has taken a beating in recent years, the accusation being that marketers and analysts have stretched and squeezed the term to cover a multitude of disparate problems, technologies, and products. Yet the core of big data remains what it has been for over a decade, framed by Doug Laney’s 2001 three Vs, Volume, Velocity, and Variety, and indicating data challenges sufficient to justify non-routine computing resources and processing techniques.

Joel Gurin (Author of “Open Data Now”)

Big Data describes datasets that are so large, complex, or rapidly changing that they push the very limits of our analytical capability. It’s a subjective term: What seems “big” today may seem modest in a few years when our analytic capacity has improved. While Big Data can be about anything, the most important kinds of Big Data – and perhaps the only ones worth the effort – are those that can have a big impact through what they tell us about society, public health, the economy, scientific research, or any number of other large-scale subjects.

Quentin Hardy(Deputy Tech Editor, The New York Times)

What’s “big” in big data isn’t necessarily the size of the databases, it’s the big number of data sources we have, as digital sensors and behavior trackers migrate across the world. As we triangulate information in more ways, we will discover hitherto unknown patterns in nature and society – and pattern-making is the wellspring of new art, science, and commerce.

Harlan Harris (Director, Data Science at Education Advisory Board)

To me, “Big Data” is the situation where an organization can (arguably) say that they have access to what they need to reconstruct, understand, and model the part of the world that they care about. Using their Big Data, then, they can (try to) predict future states of the world, optimize their processes, and otherwise be more effective and rational in their activities.

Jessica Kirkpatrick (Director of Data Science, InstaEDU)

Big Data refers to using complex datasets to drive focus, direction, and decision making within a company or organization. This is done by deriving actionable insights from the analysis of your organization’s data.

David Leonhardt (Editor, The Upshot , The New York Times)

Big Data is nothing more than a tool for capturing reality – just as newspaper reporting, photography and long-form journalism are. But it’s an exciting tool, because it holds the potential of capturing reality in some clearer and more accurate ways than we have been able to do in the past.

Hilary Mason (Founder, Fast Forward Labs)

Big Data is just the ability to gather information and query it in such a way that we are able to learn things about the world that were previously inaccessible to us.

Deirdre Mulligan (Associate Professor, UC Berkeley School of Information)

Big data: Endless possibilities or cradle-to-grave shackles, depending upon the political, ethical, and legal choices we make.

Sharmila Mulligan (CEO and Founder, ClearStory Data)

[Big Data means] harnessing more sources of diverse data where ‘data variety’ and ‘data velocity’ are the key opportunities. (Each source represents ‘a signal’ on what is happening in the business.) The opportunity is to harness data variety, automate ‘harmonization’ of data sources, to deliver fast-updating insights consumable by the line-of-business users.

Sean Patrick Murphy (Consulting Data Scientist and Co-Founder of a stealth startup)

While “big data” is often large in size relative to the available tool set, “big” actually refers to being important. Scientists and engineers have long known that data is valuable, but now the rest of the world, including those in control of purse strings, understand the value that can be created from data.

Prakash Nanduri (Co-Founder, CEO and President, Paxata, Inc)

Everything we know spits out data today – not just the devices we use for computing. We now get digital exhaust from our garage door openers to our coffee pots, and everything in between. At the same time, we have become a generation of people who demand instantaneous access to information – from what the weather is like in a country thousands of miles away to which store has better deals on toaster ovens. Big data is at the intersection of collecting, organizing, storing and turning all of that raw data in truly meaningful information.

Chris Neumann (CEO, DataHero)

At Aster Data, we originally used the term Big Data in our marketing to refer to analytical MPP databases like ours and to differentiate them from traditional data warehouse software. While both were capable of storing a “big” volume of data (which, in 2008, we defined as 10 TB or greater), “Big Data” systems were capable of performing complex analytics on top of that data – something that legacy data warehouse software could not do. Thus, our original definition was a system that (1) was capable of storing 10 TB of data or more and (2) was capable of executing advanced workloads, such as behavioral analytics or market basket analysis, on those large volumes of data. As time went on, diversity of data started to become more prevalent in these systems (particularly the need to mix structured and unstructured data), which led to more widespread adoption of the “3 Vs” (volume, velocity, and variety) as a definition for Big Data, which continues to this day.

Cathy O’Neil (Program Director, the Lede Program at Columbia University)

“Big Data” is more than one thing, but an important aspect is its use as a rhetorical device, something that can be used to deceive or mislead or overhype. It is thus vitally important that people who deploy big data models consider not just technical issues but the ethical issues as well.

Brad Peters (Chief Product Officer, Chairman at Birst)

In my view, Big Data is data that requires novel processing techniques to handle. Typically, big data requires massive parallelism in some fashion (storage and/or compute) to deal with volume and processing variety.

Gregory Piatetsky-Shapiro (President and Editor1, KDnuggets.com)

The best definition I saw is, “Data is big when data size becomes part of the problem.” However this refers to the size only. Now the buzzword “Big Data” refers to the new data-driven paradigm of business, science and technology, where the huge data size and scope enables better and new services, products, and platforms. #BigData also generates a lot of hype and will probably be replaced by a new buzzword, like “Internet of Things,” but “big data”-enabled services companies, like Google, Facebook, Amazon, location services, personalized/precision medicine, and many more will remain and prosper.

Jake Porway (Founder and Executive Director, DataKind)

As our lives have moved from the physical to the digital world, our everyday tools like smartphones and ubiquitous Internet, create vast amounts of data. One of the best interpretations of the “big” in “big data” is expansive – whether you are a Fortune 500 company who just released an app that is creating a torrent of user data about every click and every activity of every user or a nonprofit who just launched a cellphone-based app to find the closest homeless shelters that are now spewing forth information about every search and every click, we all have data. Dealing with this so-called Big Data requires a massive shift in technologies for storing, processing, and managing data – but also presents tremendous opportunity for the social sector to gather and analyze information faster to address some of our world’s most pressing challenges.

Kyle Rush (Head of Optimization, Optimizely)

There is certainly a colorful variety of definitions for the term big data out there. To me it means working with data at a large scale and velocity.

AnnaLee Saxenian (Dean, UC Berkeley School of Information)

I’m not fond of the phrase “Big Data” because it focuses on the volume of data, obscuring the far-reaching changes are making data essential to individuals and organizations in today’s world. But if I have to define it I’d say that “big data” is data that can’t be processed using standard databases because it is too big, too fast-moving or too complex for traditional data processing tools.

Josh Schwartz (Chief Data Scientist, Chartbeat)

The rising accessibility of platforms for the storage and analysis of large amounts of data (and the falling price per TB of doing so) has made it possible for a wide variety of organizations to store nearly all data in their purview – every log line, customer interaction, and event – unaggregated and for a significant period of time. The associated ethos of “store everything now and ask questions later” to me more than anything else characterizes how the world of computational systems looks under the lens of modern “big data” systems.

Peter Skomoroch (Entrepreneur, ex Principal Data Scientist, LinkedIn)

Big Data originally described the practice in the consumer Internet industry of applying algorithms to increasingly large amounts of disparate data to solve problems that had suboptimal solutions with smaller datasets. Many features and signals can only be observed by collecting massive amounts of data (for example, the relationships across an entire social network), and would not be detected using smaller samples. Processing large datasets in this manner was often difficult, time consuming, and error prone before the advent of technologies like MapReduce and Hadoop, which ushered in a wave of related tools and applications now collectively called Big Data technologies.

Anna Smith (Analytics Engineer, Rent the Runway)

Big Data is when data grows to the point that the technology supporting the data has to change. It also encompasses a variety of topics relating to how disparate data can be combined, processed into insights, and/or reworked into smart products.

Ryan Swanstrom (Data Science Blogger, Data Science 101)

Big Data used to mean data that a single machine was unable to handle. Now big data has become a buzzword to mean anything related to data analytics or visualization.

Shashi Upadhyay (CEO and Founder, Lattice Engines)

Big data is an umbrella term that means a lot of different things, but to me, it means the possibility of doing extraordinary things using modern machine learning techniques on digital data. Whether it is predicting illness, the weather, the spread of infectious diseases or what you will buy next, it offers a world of possibilities for improving people’s lives.

Mark van Rijmenam (CEO/Founder, BigData-Startups)

Big Data is not all about volume, it is more about combining different data sets and to analyse it in real-time to get insights for your organisation. Therefore, the right definition of Big Data should in fact be: Mixed Data.

Hal Varian (Chief Economist, Google)

Big data means data that cannot fit easily into a standard relational database.

Timothy Weaver (CIO, Del Monte Foods)

I’m happy to repeat the definition I’ve heard used and think appropriately defines the over[all] subject. I believe it’s Forrester’s definition of Volume, Velocity, Variety, and Variability. A lot of different data coming fast and in different structures.

Steven Weber (Professor, UC Berkeley School of Information and Department of Political Science)

For me, the technological definitions (like ‘too big to fit in an excel spreadsheet’ or ‘too big to hold in memory’) are important, but aren’t really the main point. Big data for me is data at a scale and scope that changes in some fundamental way (not just at the margins) the range of solutions that can be considered when people and organizations face a complex problem. Different solutions, not just ‘more, better.’

John Myles White (Twitter)

The term Big Data is really only useful if it describes a quantity of data that’s so large that traditional approaches to data analysis are doomed to failure. That can mean that you’re doing complex analytics on data that’s too large to fit into memory or it can mean that you’re dealing with a data storage system that doesn’t offer the full functionality of a standard relational database. What’s essential is that your old way of doing things doesn’t apply anymore and can’t just ‘be scaled out.’

Brian Wilt (Senior Data Scientist, Jawbone)

The joke is that Big Data is data that breaks Excel, but we try not to be snooty about whether you measure your data in MBs or PBs. Data is more about your team and the results they can get.

Raymond Yee, Ph.D. (Software Developer, unglue.it)

Big data enchants us with the promise of new insights. Let’s not forget the knowledge hidden in the small data right before us.

6 thoughts on “Quotes”

  1. Dear Michael !
    I liked your Quotes really. You can see my work at http://happydatascientist.blogspot.com/2015/04/data-scientists-at-work-thoughts-that.html. Also you can 2 video here. It’s original for kdnuggets post 😉
    my best regards

    • Hello Andy, thank you very much for your hint. I had a look at your list and found 40 which were not in my list right now. My list now contains >700 from which I publish one a day. So at least another 2 Years …. There are some typos in your list, e.g “better plac”. You might have a look. Thank you very much, Michael

      • Hello Michael !
        Thanks a lot for your attention to my humble work. I have fixed typo “better place” and hope for best. How did you find videos for #1, #2 interviews quotes ?
        I hope you enjoy it too :)) I saw your web site and found it very useful for me.
        So thanks again for your attention.

  2. Very nice post. I simply stumbled upon your weblog and wanted to say that
    I’ve really loved browsing your blog posts. In any case
    I will be subscribing for your rss feed and I’m hoping you write again soon!

  3. hatemgkotb said:

    This is simply AMAZING!

  4. You must be wondering why you should rely on us for your next university assignments, research paper, thesis or dissertation. Well, the answer is simple, we have been in the education industry for the last 10 years now. Our expert writers have helped over thousands of local and international university students with their assignments.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.