# If you did not already know

We investigate a projection free method, namely conditional gradient sliding on batched, stochastic and finite-sum non-convex problem. CGS is a smart combination of Nesterov’s accelerated gradient method and Frank-Wolfe (FW) method, and outperforms FW in the convex setting by saving gradient computations. However, the study of CGS in the non-convex setting is limited. In this paper, we propose the non-convex conditional gradient sliding (NCGS) which surpasses the non-convex Frank-Wolfe method in batched, stochastic and finite-sum setting. …

Innovation Management
Innovation management is the management of innovation processes. It refers both to product and organizational innovation. Innovation management includes a set of tools that allow managers and engineers to cooperate with a common understanding of processes and goals. Innovation management allows the organization to respond to external or internal opportunities, and use its creativity to introduce new ideas, processes or products. It is not relegated to R&D; it involves workers at every level in contributing creatively to a company’s product development, manufacturing and marketing. …

Central Network (CentralNet)
This paper proposes a novel multimodal fusion approach, aiming to produce best possible decisions by integrating information coming from multiple media. While most of the past multimodal approaches either work by projecting the features of different modalities into the same space, or by coordinating the representations of each modality through the use of constraints, our approach borrows from both visions. More specifically, assuming each modality can be processed by a separated deep convolutional network, allowing to take decisions independently from each modality, we introduce a central network linking the modality specific networks. This central network not only provides a common feature embedding but also regularizes the modality specific networks through the use of multi-task learning. The proposed approach is validated on 4 different computer vision tasks on which it consistently improves the accuracy of existing multimodal fusion approaches. …

# R Packages worth a look

Derive Polygenic Risk Score Based on Emprical Bayes Theory (EBPRS)
EB-PRS is a novel method that leverages information for effect sizes across all the markers to improve the prediction accuracy. No parameter tuning is …

A ‘sparklyr’ Extension for Nested Data (sparklyr.nested)
A ‘sparklyr’ extension adding the capability to work easily with nested data.

A Two-Sample Test for the Equality of Distributions for High-Dimensional Data (TwoSampleTest.HD)
For high-dimensional data whose main feature is a large number, p, of variables but a small sample size, the null hypothesis that the marginal distribu …

# Whats new on arXiv

Often, a community becomes alarmed when high rates of cancer are noticed, and residents suspect that the cancer cases could be caused by a known source of hazard. In response, the CDC recommends that departments of health perform a standardized incidence ratio (SIR) analysis to determine whether the observed cancer incidence is higher than expected. This approach has several limitations that are well documented in the literature. In this paper we propose a novel causal inference approach to cancer cluster investigations, rooted in the potential outcomes framework. Assuming that a source of hazard representing a potential cause of increased cancer rates in the community is identified a priori, we introduce a new estimand called the causal SIR (cSIR). The cSIR is a ratio defined as the expected cancer incidence in the exposed population divided by the expected cancer incidence under the (counterfactual) scenario of no exposure. To estimate the cSIR we need to overcome two main challenges: 1) identify unexposed populations that are as similar as possible to the exposed one to inform estimation under the counterfactual scenario of no exposure, and 2) make inference on cancer incidence in these unexposed populations using publicly available data that are often available at a much higher level of spatial aggregation than what is desired. We overcome the first challenge by relying on matching. We overcome the second challenge by developing a Bayesian hierarchical model that borrows information from other sources to impute cancer incidence at the desired finer level of spatial aggregation. We apply our proposed approach to determine whether trichloroethylene vapor exposure has caused increased cancer incidence in Endicott, NY.
This paper attempts to address the issues of machine learning in its current implementation. It is known that machine learning algorithms require a significant amount of data for training purposes, whereas recent developments in deep learning have increased this requirement dramatically. The performance of an algorithm depends on the quality of data and hence, algorithms are as good as the data they are trained on. Supervised learning is developed based on human learning processes by analysing named (i.e. annotated) objects, scenes and actions. Whether training on large quantities of data (i.e. big data) is the right or the wrong approach, is debatable. The fact is, that training algorithms the same way we learn ourselves, comes with limitations. This paper discusses the issues around applying a human-like approach to train algorithms and the implications of this approach when using limited data. Several current studies involving non-data-driven algorithms and natural examples are also discussed and certain alternative approaches are suggested.
Consider a data publishing setting for a dataset composed of non-private features correlated with a set of private features not necessarily present in the dataset. The objective of the publisher is to maximize the amount of information about the non-private features in a revealed dataset (utility), while keeping the information leaked about the private attributes bounded (privacy). Here, both privacy and utility are measured using information leakage measures that arise in adversarial settings. We consider a practical setting wherein the publisher uses an estimate of the joint distribution of the features to design the privacy mechanism. For any privacy mechanism, we provide probabilistic upper bounds for the discrepancy between the privacy guarantees for the empirical and true distributions, and similarly for utility. These bounds follow from our main technical results regarding the Lipschitz continuity of the considered information leakage measures. We also establish the statistical consistency of the notion of optimal privacy mechanism. Finally, we introduce and study the family of uniform privacy mechanisms, mechanisms designed upon an estimate of the joint distribution which are capable of providing privacy to a whole neighborhood of the estimated distribution, and thereby, guaranteeing privacy for the true distribution with high probability.
Constrained sequential pattern mining aims at identifying frequent patterns on a sequential database of items while observing constraints defined over the item attributes. We introduce novel techniques for constraint-based sequential pattern mining that rely on a multi-valued decision diagram representation of the database. Specifically, our representation can accommodate multiple item attributes and various constraint types, including a number of non-monotone constraints. To evaluate the applicability of our approach, we develop an MDD-based prefix-projection algorithm and compare its performance against a typical generate-and-check variant, as well as a state-of-the-art constraint-based sequential pattern mining algorithm. Results show that our approach is competitive with or superior to these other methods in terms of scalability and efficiency.
In unsupervised learning, dimensionality reduction is an important tool for data exploration and visualization. Because these aims are typically open-ended, it can be useful to frame the problem as looking for patterns that are enriched in one dataset relative to another. These pairs of datasets occur commonly, for instance a population of interest vs. control or signal vs. signal free recordings.However, there are few methods that work on sets of data as opposed to data points or sequences. Here, we present a probabilistic model for dimensionality reduction to discover signal that is enriched in the target dataset relative to the background dataset. The data in these sets do not need to be paired or grouped beyond set membership. By using a probabilistic model where some structure is shared amongst the two datasets and some is unique to the target dataset, we are able to recover interesting structure in the latent space of the target dataset. The method also has the advantages of a probabilistic model, namely that it allows for the incorporation of prior information, handles missing data, and can be generalized to different distributional assumptions. We describe several possible variations of the model and demonstrate the application of the technique to de-noising, feature selection, and subgroup discovery settings.
Deep learning involves a difficult non-convex optimization problem, which is often solved by stochastic gradient (SG) methods. While SG is usually effective, it may not be robust in some situations. Recently, Newton methods have been investigated as an alternative optimization technique, but nearly all existing studies consider only fully-connected feedforward neural networks. They do not investigate other types of networks such as Convolutional Neural Networks (CNN), which are more commonly used in deep-learning applications. One reason is that Newton methods for CNN involve complicated operations, and so far no works have conducted a thorough investigation. In this work, we give details of all building blocks including function, gradient, and Jacobian evaluation, and Gauss-Newton matrix-vector products. These basic components are very important because with them further developments of Newton methods for CNN become possible. We show that an efficient MATLAB implementation can be done in just several hundred lines of code and demonstrate that the Newton method gives competitive test accuracy.
Timely prediction of clinically critical events in Intensive Care Unit (ICU) is important for improving care and survival rate. Most of the existing approaches are based on the application of various classification methods on explicitly extracted statistical features from vital signals. In this work, we propose to eliminate the high cost of engineering hand-crafted features from multivariate time-series of physiologic signals by learning their representation with a sequence-to-sequence auto-encoder. We then propose to hash the learned representations to enable signal similarity assessment for the prediction of critical events. We apply this methodological framework to predict Acute Hypotensive Episodes (AHE) on a large and diverse dataset of vital signal recordings. Experiments demonstrate the ability of the presented framework in accurately predicting an upcoming AHE.
In Business Intelligence, accurate predictive modeling is the key for providing adaptive decisions. We studied predictive modeling problems in this research which was motivated by real-world cases that Microsoft data scientists encountered while dealing with e-commerce transaction fraud control decisions using transaction streaming data in an uncertain probabilistic decision environment. The values of most online transactions related features can return instantly, while the true fraud labels only return after a stochastic delay. Using partially mature data directly for predictive modeling in an uncertain probabilistic decision environment would lead to significant inaccuracy on risk decision-making. To improve accurate estimation of the probabilistic prediction environment, which leads to more accurate predictive modeling, two frameworks, Current Environment Inference (CEI) and Future Environment Inference (FEI), are proposed. These frameworks generated decision environment related features using long-term fully mature and short-term partially mature data, and the values of those features were estimated using varies of learning methods, including linear regression, random forest, gradient boosted tree, artificial neural network, and recurrent neural network. Performance tests were conducted using some e-commerce transaction data from Microsoft. Testing results suggested that proposed frameworks significantly improved the accuracy of decision environment estimation.
This paper examines the possibility of, and the possible advantages to learning the filters of convolutional neural networks (CNNs) for image analysis in the wavelet domain. We are stimulated by both Mallat’s scattering transform and the idea of filtering in the Fourier domain. It is important to explore new spaces in which to learn, as these may provide inherent advantages that are not available in the pixel space. However, the scattering transform is limited by its inability to learn in between scattering orders, and any Fourier domain filtering is limited by the large number of filter parameters needed to get localized filters. Instead we consider filtering in the wavelet domain with learnable filters. The wavelet space allows us to have local, smooth filters with far fewer parameters, and learnability can give us flexibility. We present a novel layer which takes CNN activations into the wavelet space, learns parameters and returns to the pixel space. This allows it to be easily dropped in to any neural network without affecting the structure. As part of this work, we show how to pass gradients through a multirate system and give preliminary results.
Deep neural networks have shown superior performance in many regimes to remember familiar patterns with large amounts of data. However, the standard supervised deep learning paradigm is still limited when facing the need to learn new concepts efficiently from scarce data. In this paper, we present a memory-augmented neural network which is motivated by the process of human concept learning. The training procedure, imitating the concept formation course of human, learns how to distinguish samples from different classes and aggregate samples of the same kind. In order to better utilize the advantages originated from the human behavior, we propose a sequential process, during which the network should decide how to remember each sample at every step. In this sequential process, a stable and interactive memory serves as an important module. We validate our model in some typical one-shot learning tasks and also an exploratory outlier detection problem. In all the experiments, our model gets highly competitive to reach or outperform those strong baselines.
Deep domain adaptation has recently undergone a big success. Compared with shallow domain adaptation, deep domain adaptation has shown higher predictive performance and stronger capacity to tackle structural data (e.g., image and sequential data). The underlying idea of deep domain adaptation is to bridge the gap between source and target domains in a joint feature space so that a supervised classifier trained on labeled source data can be nicely transferred to the target domain. This idea is certainly appealing and motivating, but under the theoretical perspective, none of the theory has been developed to support this. In this paper, we have developed a rigorous theory to explain why we can bridge the relevant gap in an intermediate joint space. Under the light of our proposed theory, it turns out that there is a strong connection between deep domain adaptation and Wasserstein (WS) distance. More specifically, our theory revolves the following points: i) first, we propose a context wherein we can perfectly perform a transfer learning and ii) second, we further prove that by means of bridging the relevant gap and minimizing some reconstruction errors we are minimizing a WS distance between the push forward source distribution and the target distribution via a transport that maps from the source to target domains.
In machine learning, a nonparametric forecasting algorithm for time series data has been proposed, called the kernel spectral hidden Markov model (KSHMM). In this paper, we propose a technique for short-term wind-speed prediction based on KSHMM. We numerically compared the performance of our KSHMM-based forecasting technique to other techniques with machine learning, using wind-speed data offered by the National Renewable Energy Laboratory. Our results demonstrate that, compared to these methods, the proposed technique offers comparable or better performance.
Interactive visualizations are arguably the most important tool to explore, understand and convey facts about data. In the past years, the database community has been working on different techniques for Approximate Query Processing (AQP) that aim to deliver an approximate query result given a fixed time bound to support interactive visualizations better. However, classical AQP approaches suffer from various problems that limit the applicability to support the ad-hoc exploration of a new data set: (1) Classical AQP approaches that perform online sampling can support ad-hoc exploration queries but yield low quality if executed over rare subpopulations. (2) Classical AQP approaches that rely on offline sampling can use some form of biased sampling to mitigate these problems but require a priori knowledge of the workload, which is often not realistic if users want to explore a new database. In this paper, we present a new approach to AQP called Model-based Approximate Query Processing that leverages generative models learned over the complete database to answer SQL queries at interactive speeds. Different from classical AQP approaches, generative models allow us to compute responses to ad-hoc queries and deliver high-quality estimates also over rare subpopulations at the same time. In our experiments with real and synthetic data sets, we show that Model-based AQP can in many scenarios return more accurate results in a shorter runtime. Furthermore, we think that our techniques of using generative models presented in this paper can not only be used for AQP in databases but also has applications for other database problems including Query Optimization as well as Data Cleaning.
Different layers of deep convolutional neural networks(CNN) can encode different-level information. High-layer features always contain more semantic information, and low-layer features contain more detail information. However, low-layer features suffer from the background clutter and semantic ambiguity. During visual recognition, the feature combination of the low-layer and high-level features plays an important role in context modulation. Directly combining the high-layer and low-layer features, the background clutter and semantic ambiguity may be caused due to the introduction of detailed information.In this paper, we propose a general network architecture to concatenate CNN features of different layers in a simple and effective way, called Selective Feature Connection Mechanism (SFCM). Low level features are selectively linked to high-level features with an feature selector which is generated by high-level features. The proposed connection mechanism can effectively overcome the above-mentioned drawbacks. We demonstrate the effectiveness, superiority, and universal applicability of this method on many challenging computer vision tasks, such as image classification, scene text detection, and image-to-image translation.
A contextual care protocol is used by a medical practitioner for patient healthcare, given the context or situation that the specified patient is in. This paper proposes a method to build an automated self-adapting protocol which can help make relevant, early decisions for effective healthcare delivery. The hybrid model leverages neural networks and decision trees. The neural network estimates the chances of each disease and each tree in the decision trees represents care protocol for a disease. These trees are subject to change in case of aberrations found by the diagnosticians. These corrections or prediction errors are clustered into similar groups for scalability and review by the experts. The corrections as suggested by the experts are incorporated into the model.
The Age of Information (AoI) metric has recently received much attention as a measure of freshness of information in a network. In this paper, we study the average AoI in a generic setting with correlated sensor information, which can be related to multiple Internet of Things (IoT) scenarios. The system consists of physical sources with discrete-time updates that are observed by a set of sensors. Each source update may be observed by multiple sensors, and hence the sensor observations are correlated. We devise a model that is simple, but still capable to capture the main tradeoffs. We propose two sensor scheduling policies that minimize the AoI of the sources; one that requires the system parameters to be known a priori, and one based on contextual bandits in which the parameters are unknown and need to be learned. We show that both policies are able to exploit the sensor correlation to reduce the AoI, which result in a large reduction in AoI compared to the use of schedules that are random or based on round-robin.
Deep learning adoption in the financial services industry has been limited due to a lack of model interpretability. However, several techniques have been proposed to explain predictions made by a neural network. We provide an initial investigation into these techniques for the assessment of credit risk with neural networks.
Language models, being at the heart of many NLP problems, are always of great interest to researchers. Neural language models come with the advantage of distributed representations and long range contexts. With its particular dynamics that allow the cycling of information within the network, `Recurrent neural network’ (RNN) becomes an ideal paradigm for neural language modeling. Long Short-Term Memory (LSTM) architecture solves the inadequacies of the standard RNN in modeling long-range contexts. In spite of a plethora of RNN variants, possibility to add multiple memory cells in LSTM nodes was seldom explored. Here we propose a multi-cell node architecture for LSTMs and study its applicability for neural language modeling. The proposed multi-cell LSTM language models outperform the state-of-the-art results on well-known Penn Treebank (PTB) setup.
In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets harvested via these means, sometimes resulting in entire classes of data on which learned classifiers generalize poorly. For real world applications, we argue that it can be beneficial to avoid training on such classes entirely. In this work, we aim to explore the classes in a given data set, and guide supervised training to spend time on a class proportional to its learnability. By focusing the training process, we aim to improve model generalization on classes with a strong signal. To that end, we develop an online algorithm that works in conjunction with classifier and training algorithm, iteratively selecting training data for the classifier based on how well it appears to generalize on each class. Testing our approach on a variety of data sets, we show our algorithm learns to focus on classes for which the model has low generalization error relative to strong baselines, yielding a classifier with good performance on learnable classes.

# If you did not already know

Halide
Halide is a computer programming language designed for writing digital image processing code that takes advantage of memory locality, vectorized computation and multi-core CPUs and GPUs. Halide is implemented as an internal domain-specific language (DSL) in C++. The main innovation Halide brings is the separation of the algorithm being implemented from its execution schedule, i.e. code specifying the loop nesting, parallelization, loop unrolling and vector instruction. These two are usually interleaved together and experimenting with changing the schedule requires the programmer to rewrite large portions of the algorithm with every change. With Halide, changing the schedule does not require any changes to the algorithm and this allows the programmer to experiment with scheduling and finding the most efficient one.
DNN Dataflow Choice Is Overrated

Learning Active Learning (LAL)
In this paper, we suggest a novel data-driven approach to active learning: Learning Active Learning (LAL). The key idea behind LAL is to train a regressor that predicts the expected error reduction for a potential sample in a particular learning state. By treating the query selection procedure as a regression problem we are not restricted to dealing with existing AL heuristics; instead, we learn strategies based on experience from previous active learning experiments. We show that LAL can be learnt from a simple artificial 2D dataset and yields strategies that work well on real data from a wide range of domains. Moreover, if some domain-specific samples are available to bootstrap active learning, the LAL strategy can be tailored for a particular problem. …

Dimensional Clustering
This paper introduces a new clustering technique, called {\em dimensional clustering}, which clusters each data point by its latent {\em pointwise dimension}, which is a measure of the dimensionality of the data set local to that point. Pointwise dimension is invariant under a broad class of transformations. As a result, dimensional clustering can be usefully applied to a wide range of datasets. Concretely, we present a statistical model which estimates the pointwise dimension of a dataset around the points in that dataset using the distance of each point from its $n^{\text{th}}$ nearest neighbor. We demonstrate the applicability of our technique to the analysis of dynamical systems, images, and complex human movements. …

# Document worth reading: “Multi-Agent Reinforcement Learning: A Report on Challenges and Approaches”

Reinforcement Learning (RL) is a learning paradigm concerned with learning to control a system so as to maximize an objective over the long term. This approach to learning has received immense interest in recent times and success manifests itself in the form of human-level performance on games like \textit{Go}. While RL is emerging as a practical component in real-life systems, most successes have been in Single Agent domains. This report will instead specifically focus on challenges that are unique to Multi-Agent Systems interacting in mixed cooperative and competitive environments. The report concludes with advances in the paradigm of training Multi-Agent Systems called \textit{Decentralized Actor, Centralized Critic}, based on an extension of MDPs called \textit{Decentralized Partially Observable MDP}s, which has seen a renewed interest lately. Multi-Agent Reinforcement Learning: A Report on Challenges and Approaches

# Book Memo: “Reinforcement Learning for Optimal Feedback Control”

 A Lyapunov-Based Approach Reinforcement Learning for Optimal Feedback Control develops model-based and data-driven reinforcement learning methods for solving optimal control problems in nonlinear deterministic dynamical systems. In order to achieve learning under uncertainty, data-driven methods for identifying system models in real-time are also developed. The book illustrates the advantages gained from the use of a model and the use of previous experience in the form of recorded data through simulations and experiments. The book’s focus on deterministic systems allows for an in-depth Lyapunov-based analysis of the performance of the methods described during the learning phase and during execution. To yield an approximate optimal controller, the authors focus on theories and methods that fall under the umbrella of actor-critic methods for machine learning. They concentrate on establishing stability during the learning phase and the execution phase, and adaptive model-based and data-driven reinforcement learning, to assist readers in the learning process, which typically relies on instantaneous input-output measurements.

# Distilled News

Today’s trend of Artificial Intelligence (AI) and the increased level of Automation in manufacturing allow firms to flexibly connect assets and improve productivity through data-driven insights that has not been possible before. As more automation is used in manufacturing, the speed of responses required in dealing with maintenance issues is going to get faster and automated decisions as to what’s the best option from an economic standpoint are getting more complex.
I published implementation of Faster R-CNN with MXNet C++ Frontend. You can use this implementation as comprehensive example of using MXNet C++ Frontend, it has custom data loader for MS Coco dataset, implements custom target proposal layer as a part of the project without modification MXNet library, contains code for errors checking (Missed in current C++ API), have Eigen and NDArray integration samples. Feel free to leave comments and proposes.
I´m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R´s documentation.
I love the formattable package, but I always struggle to remember its syntax. A quick Google search reveals that I’m not alone in this struggle. This post is intended as a reminder for myself of how the package works – and hopefully you’ll find it useful too!
RDQA is a R package for Qualitative Data Analysis, a free (free as freedom) qualitative analysis software application (BSD license). It works on Windows, Linux/FreeBSD and Mac OSX platforms. RQDA is an easy to use tool to assist in the analysis of textual data. At the moment it only supports plain text formatted data. All the information is stored in a SQLite database via the R package of RSQLite. The GUI is based on RGtk2, via the aid of gWidgetsRGtk2. It includes a number of standard Computer-Aided Qualitative Data Analysis features. In addition it seamlessly integrates with R, which means that a) statistical analysis on the coding is possible, and b) functions for data manipulation and analysis can be easily extended by writing R functions. To some extent, RQDA and R make an integrated platform for both quantitative and qualitative data analysis.
Delivery time prediction has long been a part of city logistics, but refining accuracy has recently become very important for services such as Deliveroo, Foodpanda and Uber Eats which deliver food on-demand. These services and similar ones must receive an order and have it delivered within ~30 minutes to appease their users. In these situations +/- 5 minutes can make a big difference so it’s very important for customer satisfaction that the initial prediction is highly accurate and that any delays are communicated effectively.
Assume that we have a some data that looks like above, and red arrow shows the first principle component, while the blue arrow shows the second principle component. How would normalization as well as standardization change those two? Also can we perform some kind of normalization respect to the variance we have? (I will make more post as I go along. )
Computers are fast: they store and manipulate data using electronic signals that travel across their silicon internals at hundreds of thousands of miles per hour. For comparison, the fastest signals in the human nervous system travel at about 250mph, which is about 3 million times slower, and those speeds are only possible for unconscious signals – signal speeds observed for conscious thought and calculation are typically orders of magnitude slower still. Basically, we’re never going to be able to out calculate a computer.
How would you like to speed up your language modeling (LM) tasks by 1000%, with nearly no drop in accuracy? A recent paper from Grave et al. (2017), called ‘Efficient softmax approximation for GPUs’, shows how you can gain a massive speedup in one of the most time-consuming aspects of language-modeling, the computation-heavy softmax step, through their ‘adaptive softmax’. The giant speedup from using the adaptive softmax comes with only minimal costs in accuracy, so anyone who is doing language modeling should definitely consider using it. Here in Part 1 of this blog post, I’ll fully explain the adaptive softmax, then in Part 2 I’ll walk you step by step through a Pytorch implementation (with an accompanying Jupyter notebook), which uses Pytorch’s built-in AdaptiveLogSoftmaxWithLoss function.
In Part 1 of this blog post, I explained how the adaptive softmax works, and how it can speed up your language model by up to 1000%. Here in Part 2, I’ll walk you step by step through a Pytorch implementation (with an accompanying Jupyter notebook), which uses Pytorch’s built-in AdaptiveLogSoftmaxWithLoss function. For preprocessing you will need fastai (see https://docs.fast.ai ), a deep learning library that runs on top of Pytorch that simplifies training neural networks. [For those who want to learn state-of-the-art deep learning techniques, I highly recommend Jeremy Howard’s fast.ai course, which is available online for free: https://…/]. I decided to use the Wikitext-2 dataset, which is a relatively small dataset that contains ~2 million tokens and ~33 thousand words in the vocabulary. Once the data has been downloaded and properly formatted in csv files, fastai makes it easy to quickly tokenize, numericalize, and create a data-loader that will be used for training. I also downloaded GloVe pre-trained word vectors that are used for the models’ word embeddings. Finally, I created a modeler class that handles training.

# Magister Dixit

“1. Think carefully about which projects you take on.
2. Use as much data as you can from as many places as possible.
3. Don’t just use internal customer data.
4. Have a clear sampling strategy.
5. Always use a holdout sample.
6. Spend time on ‘throwaway’ modelling.
8. Make sure your insights are meaningful to other people.
9. Use your model in the real world.”
Rachel Clinton ( January 7, 2015 )