NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System

We present new data and semantic parsing methods for the problem of mapping English sentences to Bash commands (NL2Bash). Our long-term goal is to enable any user to perform operations such as file manipulation, search, and application-specific scripting by simply stating their goals in English. We take a first step in this domain, by providing a new dataset of challenging but commonly used Bash commands and expert-written English descriptions, along with baseline methods to establish performance levels on this task.

Google – Dopamine

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms. It aims to fill the need for a small, easily grokked codebase in which users can freely experiment with wild ideas (speculative research).
Our design principles are:
• Easy experimentation: Make it easy for new users to run benchmark experiments.
• Flexible development: Make it easy for new users to try out research ideas.
• Compact and reliable: Provide implementations for a few, battle-tested algorithms.
• Reproducible: Facilitate reproducibility in results.

Breaking Through the Cost Barrier to Deep Learning

Remember when we used to say data is the new oil. Not anymore. Now Training Data is the new oil. Training data is proving to be the single greatest impediment to the wide adoption and creation of deep learning models. We´ll discuss current best practice but more importantly new breakthroughs into fully automated image labeling that are proving to be superior even to hand labeling.

Call for Code asks developers worldwide to collaborate on solutions to save lives

Disasters hit unexpectedly and cause life-threatening issues across the world. Large groups of people are left without water, electricity, or other basic systems that sustain life. In an effort to help the communities of the world be better prepared to handle these tough situations, David Clark Cause launched Call for Code along with IBM as the founding partner. Call for Code is a worldwide, multi-year initiative that challenges developers to solve pressing problems with sustainable software solutions. The Call for Code Challenge in 2018 is a competition that rewards participants who come up with the applications that make the greatest impact.

Top Six Considerations for a Streaming Analytics Platform

1. One-stop shop for big data processing
2. Visual low code development
3. Application lifecycle management
4. Unified support for ‘real-time’, ‘near real-time’, and batch processing
5. Advanced analytics and machine learning capabilities to predict outcomes and act in real-time
6. Technology agnostic and open source enabled
The streaming analytics platform for the modern real-time enterprise is a combination of these essential elements. It is future proof and integrates smoothly into the enterprise ecosystem without imposing any lock-ins or creating new data silos. It enables the user at every step to successfully navigate the evolving and complex big data landscape and successively move towards faster and predictive data processing easily and quickly.

Visualization of Tumor Response – Spider Plots

A collection of some commonly used and some newly developed methods for the visualization of outcomes in oncology studies include Kaplan-Meier curves, forest plots, funnel plots, violin plots, waterfall plots, spider plots, swimmer plot, heatmaps, circos plots, transit map diagrams and network analysis diagrams (reviewed here). Previous articles in this blog presented an introduction to forest plots, violin plots and waterfall plots as well as provided some R code for the generation of these plots. As a continuation of the series, the current article provides an introduction to spider plots for the visualization of tumor response and generation of the same using R.

Doing Enterprise-Level Data Science with a Skeleton Crew

Data science is a team sport.’ As early as 2013, this axiom has been repeated to articulate that there is no unicorn data scientist, no single person that can do it all. Many companies have followed this wisdom, fielding massive data science operations. But more often than not, a big data science team isn´t an option. You´re a couple techy people trying to make waves in a bigger organization, or a small shop that depends on analytics as a part of the product. My company falls into the second camp – at Adlumin, we´re a small team that has a big enterprise-level problem: cybersecurity. We use anomaly detection to monitor user behavior, looking for malicious activity on a network. Because we need to catch intrusions quickly, we perform streaming analytics on the information we receive. Two things allow us to succeed: building on the cloud and thoroughly testing our analytics. Cloud isn´t specifically for small teams, but it helps to compete with and exceed bigger competitors. Testing is our failsafe. By implementing useful tests on our analytics, we can have assurance that the models will perform when they´re released. Below are three principles that I´ve distilled into part of a playbook for doing data science at scale with a small team.

More Flexible Ordinal Outcome Models

In the previous post (https://…n-ratio-logit-models-for-ordinal-outcomes ), we´ve shown alternative models for ordinal outcomes in addition to commonly used Cumulative Logit models under the proportional odds assumption, which are also known as Proportional Odds model. A potential drawback of Proportional Odds model is the lack of flexibility and the restricted assumption of proportional odds, of which the violation might lead to the model mis-specification. As a result, Cumulative Logit models with more flexible assumptions are called for.

Towards Preventing Overfitting: Regularization

In machine learning, you must have come across the term Overfitting. Overfitting is a phenomenon where a machine learning model models the training data too well but fails to perform well on the testing data. Performing sufficiently good on testing data is considered as a kind of ultimatum in machine learning.

Deploying scikit-learn Models at Scale

Scikit-learn is great for putting together a quick model to test out your dataset. But what if you want to run it against incoming live data? Find out how to serve your scikit-learn model in an auto-scaling, serverless environment!

Constructing a Data Analysis

This week Hilary Parker and I have started our ‘Book Club’ on Not So Standard Deviations where we will be discussing Nigel Cross´s book Design Thinking: Understanding How Designers Think and Work. We will be talking about how the work of designers parallels the work of data scientists and how many of the principles developed in design port over so well to data analysis. While data visualization has always taken cues from design, I think much broader aspects of data analysis could benefit from the work studying design. At any rate, I think this is a topic that should be discussed more amongst statisticians and data analysts. One of the first revelations I´ve had recently is realizing that data analyses are not naturally occurring phenomena. You will not run into a data analysis while walking out in the woods. Data analyses must be created and constructed by people. One way to think about a data analysis is to think of it as a product to be designed. Data analysis is not a theoretical exercise. The goal is not to reveal something new about the world or to discover truth (although knowledge and truth may be important by-products). The goal of data analysis is to produce something useful. Useful to the scientist, useful to the product manager, useful to the business executive, or useful to the policy-maker. In that sense, data analysis is a fairly down-to-Earth activity. Producing a useful product requires careful consideration of who will be using it. Good data analysis can be useful to just about anyone. The fact that many different kinds of people make use of data analysis is not exactly news, but what is new is the tremendous availability of data in general. If we consider a data analysis as something to be designed, this provides for us a rough road map for how to proceed.