reticulate: R interface to Python

We are pleased to announce the reticulate package, a comprehensive set of tools for interoperability between Python and R. The package includes facilities for:
reticulated python
• Calling Python from R in a variety of ways including R Markdown, sourcing Python scripts, importing Python modules, and using Python interactively within an R session.
• Translation between R and Python objects (for example, between R and Pandas data frames, or between R matrices and NumPy arrays).
• Flexible binding to different versions of Python including virtual environments and Conda environments.
Reticulate embeds a Python session within your R session, enabling seamless, high-performance interoperability. If you are an R developer that uses Python for some of your work or a member of data science team that uses both languages, reticulate can dramatically streamline your workflow!

Introduction to k-Nearest Neighbors: Simplified (with implementation in Python)

In four years of my career into analytics I have built more than 80% of classification models and just 15-20% regression models. These ratios can be more or less generalized throughout the industry. The reason of a bias towards classification models is that most analytical problem involves making a decision. For instance will a customer attrite or not, should we target customer X for digital campaigns, whether customer has a high potential or not etc. These analysis are more insightful and directly links to an implementation roadmap. In this article, we will talk about another widely used classification technique called K-nearest neighbors (KNN) . Our focus will be primarily on how does the algorithm work and how does the input parameter effect the output/prediction.

GDPR and the Paradox of Interpretability

GDPR carries many new data and privacy requirements including a “right to explanation”. On the surface this appears to be similar to US rules for regulated industries. We examine why this is actually a penalty and not a benefit for the individual and offer some insight into the actual wording of the GDPR regulation which also offers some relief.

How VW Predicts Churn with GPU-Accelerated Machine Learning and Visual Analytics

MapD is a founding and active member of GOAi (the GPU Open Analytics Initiative). One of the primary goals of GOAiI is to enable end-to-end analytics on GPUs. The reason for this is that while each technology in the process leverages GPU’s beautifully on their own, if data has to leave the GPU to move to the next system in the process, this can have significant latency implications. So, keeping the data in a GPU buffer through the exploration, extraction, preprocessing, model training, validation, and prediction makes it much faster and simpler. MapD and Anaconda, another GOAi founding member, are involved in development of pythonic clients such as pymapd (interface to MapD’s SQL engine supporting DBAPI 2.0), pygdf (Python interface to access and manipulate the GPU Dataframe) along with our core platform modules MapD Core SQL engine and MapD Immerse, visual analytics tool.

Data scientists and engineers: Advice I would give my younger self

4 Insight alumnae share stories about how they got their start and what data science and data engineering means to them

Comparing Deep Learning Frameworks: A Rosetta Stone Approach

We believe deep-learning frameworks are like languages: Sure, many people speak English, but each language serves its own purpose. We have created common code for several different network structures and executed it across many different frameworks. Our idea was to a create a Rosetta Stone of deep-learning frameworks – assuming you know one well, to help anyone leverage any framework. Situations may arise where a paper publishes code in another framework or the whole pipeline is in another language. Instead of writing a model from scratch in your favourite framework it may be easier to just use the “foreign” language. We want to extend our gratitude to the CNTK, Pytorch, Chainer, Caffe2 and Knet teams, and everyone else from the open-source community who contributed to the repo over the past few months.

MMdnn – Model Management Deep Neural Network

MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch and CoreML.
A comprehensive, cross-framework solution to convert, visualize and diagnosis deep neural network models. The ‘MM’ in MMdnn stands for model management and ‘dnn’ is an acronym for deep neural network.
Basically, it converts many DNN models that trained by one framework into others. The major features include:
• Model File Converter Converting DNN models between frameworks
• Model Code Snippet Generator Generating training or inference code snippet for frameworks
• Model Visualization Visualizing DNN network architecture and parameters for frameworks
• Model compatibility testing (On-going)
This project is designed and developed by Microsoft Research (MSR). We also encourage researchers and students leverage this project to analysis DNN models and we welcome any new ideas to extend this project.

Steps to Perform Survival Analysis in R

When there are so many tools and techniques of prediction modelling, why do we have another field known as survival analysis? As one of the most popular branch of statistics, Survival analysis is a way of prediction at various points in time. This is to say, while other prediction models make predictions of whether an event will occur, survival analysis predicts whether the event will occur at a specified time. Thus, it requires a time component for prediction and correspondingly, predicts the time when an event will happen. This helps one in understanding the expected duration of time when events occur and provide much more useful information. One can think of natural areas of application of survival analysis which include biological sciences where one can predict the time for bacteria or other cellular organisms to multiple to a particular size or expected time of decay of atoms. Some interesting applications include prediction of the expected time when a machine will break down and maintenance will be required

Causal Inference & Big Data Summer Institute: June 25-28, 2018

The Causal Inference and Big Data Summer Institute is a four-day, intensive learning experience. Each day offers didatic lectures by experts in the field and discussion of real examples. The causal inference days include some live demonstrations of data analysis. Participants who bring laptops will have the opportunity to implement the methods during the computer lab sessions. Prerequisites include familiarity with traditional data analysis methods (such as regression models) and the programming language R. The institute is aimed at practitioners in industry, researchers, and students who are interested in learning about these statistical methods and how to implement them in practice. While the first two days focus on causal inference and the second two days focus on big data, there is cross-over. For example, day 2 of causal inference includes machine learning methods in causal inference.

Machine Learning with Text in PySpark – Part 1

We usually work with structured data in our machine learning applications. However, unstructured text data can also have vital content for machine learning models. In this blog post, we will see how to use PySpark to build machine learning models with unstructured text data.

Exploring DeepFakes

In December 2017, a user named “DeepFakes” posted realistic looking explicit videos of famous celebrities on Reddit. He generated these fake videos using deep learning, the latest in AI, to insert celebrities’ faces into adult movies. In the following weeks, the internet exploded with articles about the dangers of face swapping technology: harassing innocents, propagating fake news, and hurting the credibility of video evidence forever. ‘It’s true that bad actors will use this technology for harm; but given that the genie is out of the bottle, shouldn’t we pause to consider what else DeepFakes could be used for? ‘ In this post, I explore the capabilities of this tech, describe how it works, and discuss potential applications.

The accuracy, fairness, and limits of predicting recidivism

Algorithms for predicting recidivism are commonly used to assess a criminal defendant’s likelihood of committing a crime. These predictions are used in pretrial, parole, and sentencing decisions. Proponents of these systems argue that big data and advanced machine learning make these analyses more accurate and less biased than humans. We show, however, that the widely used commercial risk assessment software COMPAS is no more accurate or fair than predictions made by people with little or no criminal justice expertise. We further show that a simple linear predictor provided with only two features is nearly equivalent to COMPAS with its 137 features.

Principles of Guided Analytics

Systems that automate the data science cycle have been gaining a lot of attention recently. Similar to smart home assistant systems however, automating data science for business users only works for well-defined tasks. We do not expect home assistants to have truly deep conversations about changing topics. In fact, the most successful systems restrict the types of possible interactions heavily and cannot deal with vaguely defined topics. Real data science problems are similarly vaguely defined: only an interactive exchange between the business analysts and the data analysts can guide the analysis in a new, useful direction, potentially sparking interesting new insights and further sharpening the analysis. Therefore, as soon as we leave the realm of completely automatable data science sandboxes, the challenge lies in allowing data scientists to build interactive systems, interactively assisting the business analyst in her quest to find new insights in data and predict future outcomes. At KNIME we call this “Guided Analytics”. We explicitly do not aim to replace the driver (or totally automate the process) but instead offer assistance and carefully gather feedback whenever needed throughout the analysis process. To make this successful, the data scientist needs to be able to easily create powerful analytical applications that allow interaction with the business user whenever their expertise and feedback is needed.