How should I organize a larger data science team?

… BTW, here are some pain points for us:
• Maintaining data pipelines
• Getting training algorithms to scale
• Optimizing our AWS bill
• Training new data scientists
• Training engineers on related teams
• Managing requirements from business users and other engineering teams
• Making sure that the data scientists are innovators and not just order takers

Prediction Interval, the wider sister of Confidence Interval

In this post, I will illustrate the use of prediction intervals for the comparison of measurement methods. In the example, a new spectral method for measuring whole blood hemoglobin is compared with a reference method. But first, let’s start with discussing the large difference between a confidence interval and a prediction interval.

Interpreting machine learning models with the lime package for R

Many types of machine learning classifiers, not least commonly-used techniques like ensemble models and neural networks, are notoriously difficult to interpret. If the model produces a surprising label for any given case, it’s difficult to answer the question, ‘why that label, and not one of the others ‘. One approach to this dilemma is the technique known as LIME (Local Interpretable Model-Agnostic Explanations). The basic idea is that while for highly non-linear models it’s impossible to give a simple explanation of the relationship between any one variable and the predicted classes at a global level, it might be possible to asses which variables are most influential on the classification at a local level, near the neighborhood of a particular data point. An procedure for doing so is described in this 2016 paper by Ribeiro et al, and implemented in the R package lime by Thomas Lin Pedersen and Michael Benesty (and a port of the Python package of the same name).

Build httr Functions Automagically from Manual Browser Requests with the middlechild Package

You can catch a bit of the @rOpenSci 2018 Unconference experience at home w with this short-ish ‘splainer video on how to use the new middlechild package (https://…/middlechild ) & mitmproxy to automagically create reusable httr verb functions from manual browser form interactions.

bounceR 0.1.2: Automated Feature Selection

As promised, we kept on working on our bounceR package. For once, we changed the interface: users now do not have to choose a number of tuning parameters, that – thanks to my somewhat cryptic documentation – sound more complicated than they are. Inspired by feature to let the user set the time he or she wants to wait, instead of a number of cryptic tuning parameters, we added a similar function.

How Much Money Should Machines Earn?

I think that a very good way to start with R is doing an interactive visualization of some open data because you will train many important skills of a data scientist: loading, cleaning, transforming and combinig data and performing a suitable visualization. Doing it interactive will give you an idea of the power of R as well, because you will also realise that you are able to handle indirectly other programing languages such as JavaScript. That´s precisely what I´ve done today.

Spreadsheet Data Manipulation in R

This repository provides some simple code with working examples to perform spreadsheet manipulation in R. More functions will be added!

Root Cause Analysis in IT Infrastructures using Ontologies and Abduction in Markov Logic Networks

Information systems play a crucial role in most of today’s business operations. High availability and reliability of services and hardware, and, in the case of outages, short response times are essential. Thus, a high amount of tool support and automation in risk management is desirable to decrease downtime. We propose a new approach for calculating the root cause for an observed failure in an IT infrastructure. Our approach is based on abduction in Markov Logic Networks. Abduction aims to find an explanation for a given observation in the light of some background knowledge. In failure diagnosis, the explanation corresponds to the root cause, the observation to the failure of a component, and the background knowledge to the dependency graph extended by potential risks. We apply a method to extend a Markov Logic Network in order to conduct abductive reasoning, which is not naturally supported in this formalism. Our approach exhibits a high amount of reusability and facilitates modeling by using ontologies as background knowledge. This enables users without specific knowledge of a concrete infrastructure to gain viable insights in the case of an incident. We implemented the method in a tool and illustrate its suitability for root cause analysis by applying it to a sample scenario and testing its scalability on randomly generated infrastructures.

Data Ethics Framework

Making better use of data offers huge benefits, in helping us provide the best possible services to the people we serve. However, all new opportunities present new challenges. The pace of technology is changing so fast that we need to make sure we are constantly adapting our codes and standards. Those of us in the public sector need to lead the way. As we set out to develop our National Data Strategy, getting the ethics right, particularly in the delivery of public services, is critical. To do this, it is essential that we agree collective standards and ethical frameworks. Ethics and innovation are not mutually exclusive. Thinking carefully about how we use our data can help us be better at innovating when we use it. Our new Data Ethics Framework sets out clear principles for how data should be used in the public sector. It will help us maximise the value of data whilst also setting the highest standards for transparency and accountability when building or buying new data technology. We have come a long way since we published the first version of the Data Science Ethical Framework. This new version focuses on the need for technology, policy and operational specialists to work together, so we can make the most of expertise from across disciplines. We want to work with others to develop transparent standards for using new technology in the public sector, promoting innovation in a safe and ethical way. This framework will build the confidence in public sector data use needed to underpin a strong digital economy. I am looking forward to working with all of you to put it into practice.

Confirmatory Factor Analysis: How To Measure Something We Cannot Observe or Measure Directly

Many times in science we are intrigued to measure an underlying characteristic that cannot be observed or measured directly. This measure is hypothesized to exist to explain variables, such as behavior, that can be observed. The measurable variables are called manifest variables. The unmeasurable are called latent variables.

Object-Oriented Bayesian Networks for Condition Monitoring, Root Cause Analysis and Decision Support on Operation of Complex Continuous Processes: Methodology & Applications

The increasing complexity of large-scale industrial processes and the struggle for cost reduction and higher profitability implies automated systems for processes diagnosis in plant operation and maintenance are required. We have developed a methodology to address this issue and have designed a prototype system on which this methodology has been applied. The methodology integrates decision-theoretic troubleshooting with risk assessment for industrial process control. It is applied to a pulp operation and screening process. The process is modeled using object-oriented Bayesian networks (OOBNs). Most abnormalities are derived from the contextadaptive signal classifications with their enable and action events. The system performs reasoning under uncertainty based on multisensor information extraction and presents to users corrective actions, with explanations of the root causes. It records users´ actions with associated cases and the OOBN models are prepared to perform sequential learning to increase its performance in diagnostics and advice. The system allows modeling during the design phase of a new plant, can provide guidance already in the start up phase and allows adaptation to process changes during plant operation.

Step Forward Feature Selection: A Practical Example in Python

When it comes to disciplined approaches to feature selection, wrapper methods are those which marry the feature selection process to the type of model being built, evaluating feature subsets in order to detect the model performance between features, and subsequently select the best performing subset.