In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty. We give an overview over different proposed structures of Data Science and address the impact of statistics on such steps as data acquisition and enrichment, data exploration, data analysis and modeling, validation and representation and reporting. Also, we indicate fallacies when neglecting statistical reasoning.
Recently I’ve been very into the idea of time-based heatmaps as an easy way of understanding relative aggregates by date and time. I think it’s important to view behavior by time as the numbers are often very different when looking across time intervals such as year, month, day of the week, hour etc. For example, US consumers shopping patterns typically ramp up during the October-December months with a drop off in January. Another example could be coffee shop orders by time of day. Intuitively, we expect the coffee sales to peak during the morning and early afternoon hours. So that covers the why I like time period analysis, but why do I like heatmaps? Well, heatmaps can pack in a dense amount of information into an information grid which will very quickly allow the user to identify relative patterns. For example, Looking at the raw data set below (left) does not have the same impact as looking at the equivalent heatmap (right).
With the explosion of “Big Data” over the last few years, the need for people who know how to build and manage data-pipelines has grown. Unfortunately, supply has not kept up with demand and there seems to be a shortage of engineers focused on the ingestion and management of data at scale. Part of the problem is the lack of education focused on this growing field. Currently, there seems to be no official curriculum or certification to become a Data Engineer or Data Architect (that we know of). While that’s bad news for companies that need qualified engineers, that’s great news for you! Why? Because it’s basic supply and demand and data engineers and data architects are cleaning up! In fact, we did a little research and found that the average salary for a data engineer is around $95.5k. While the salaries for data architects average around $112k nationally, the main path to this strategic, coveted position (and salary) means cutting your teeth as a data engineer and working your way up or making a lateral job move.
Datafication, according to MayerSchoenberger and Cukier is the transformation of social action into online quantified data, thus allowing for real-time tracking and predictive analysis. Simply said, it is about taking previously invisible process/activity and turning it into data, that can be monitored, tracked, analysed and optimised. Latest technologies we use have enabled lots of new ways of ‘datify’ our daily and basic activities. Summarizing, datafication is a technological trend turning many aspects of our lives into computerized data using processes to transform organizations into data-driven enterprises by converting this information into new forms of value. Datafication refers to the fact that daily interactions of living things can be rendered into a data format and put to social use.
To build a Forest Plot often the forestplot package is used in R. However, I find the ggplot2 to have more advantages in making Forest Plot, such as enable inclusion of several variables with many categories in a lattice form. I can also use any scale of your choice such as log scale etc. In this post, I will introduce how to plot Risk Ratios and their Confidence Intervals of several conditions.
Data Scientists should focus on developing product sense to move fast and systematically, create models that are relevant and to able to know when to stop.
No other mean of data description is more comprehensive than Descriptive Statistics and with the ever increasing volumes of data and the era of low latency decision making needs, its relevance will only continue to increase.
April 8, 2018 Generating Text From An R DataFrame using PyTracery, Pandas and Reticulate In a couple of recent posts (Textualisation With Tracery and Database Reporting 2.0 and More Tinkering With PyTracery) I’ve started exploring various ways of using the pytracery port of the tracery story generation tool to generate variety of texts from Python pandas data frames. For my F1DataJunkie tinkerings I’ve been using R + SQL as the base languages, with some hardcoded Rdata2text constructions for rendering text from R dataframes (example). Whilst there is a basic port of tracery to R, I want to make use of the various extensions I’ve been doodling with to pytracery, so it seemed like a good opportunity to start exploring the R reticulate package. It was a bit of a faff trying to get things to work the first time, so here on some notes on what I had to consider to get a trivial demo working in my RStudio/Rmd/knitr environment.
Many R package authors (including myself) lump a collection of small, useful functions into some type of utils.R file and usually do not export the functions since they are (generally) designed to work on package internals rather than expose their functionality via the exported package API. Just like Batman’s utility belt, which can be customized for any mission, any set of utilities in a given R package will also likely be different from those in other packages.
We recently had an awesome opportunity to work with a great client that asked Business Science to build an open source anomaly detection algorithm that suited their needs. The business goal was to accurately detect anomalies for various marketing data consisting of website actions and marketing feedback spanning thousands of time series across multiple customers and web sources. Enter anomalize: a tidy anomaly detection algorithm that’s time-based (built on top of tibbletime) and scalable from one to many time series!! We are really excited to present this open source R package for others to benefit. In this post, we’ll go through an overview of what anomalize does and how it works.
In this exercise, we will continue to build our model from our previous exercise here, specifically to revise the errors that may be generated from the model, including rounding and truncating errors. Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
One of the nicest things about working with R is that with very little effort you can customize and automate activities to produce the output you want – just the way you want it. You can contrast that with more monolithic packages that may allow you to do a bit of scripting, but for the most part, the price of a GUI or packaging everything in one package is that you lose the ability to have things just your way. Since everything in R is pretty much a function already, you may as well invest a little time and energy in making functions… your way, and to exactly your tastes and needs. This post is not meant to be an exhaustive or complete treatment of writing a function. For that you probably want a book, or at least a Chapter like the one Hadley has in Advanced R. This post will focus on a very practical, and hopefully useful, single example.