Microsoft Visual Studio Code Supports R Syntax

The core concepts will make you more productive writing and navigating your code. It covers the basics of file and folder management, searching across your code base, using the integrated Git tooling, launching a debugging session, leveraging the Command Palette to access context specific actions, and customizing the environment.


Transform your media into high value assets. Visual recognition technology that helps you get intelligent with your data.

Time-to-Insight Versus Time-to-Action

Time-to-Action can indeed be real time. Time-to-Insight, the original discovery and testing of the pattern are faster now but not likely ever to be real time.

The Data Science Industry: Who Does What (Infographic)

DataCamp took a look at this avalanche of data science job postings in an attempt to unravel these cool-sounding and playful job titles into a comparison of different data science related careers. We summarized the results in our latest infographic “The Data Science Industry: Who Does What”:

R Packages: A Healthcare Application

Building off my last post, I want to use the same healthcare data to demonstrate the use of R packages. Packages in R are stored in libraries and often are pre-installed, but reaching the next level of skill requires being able to know when to use new packages and what they contain. With that let’s get to our example.

Distributed Machine Learning Toolkit

Distributed machine learning has become more important than ever in this big data era. Especially in recent years, practices have demonstrated the trend that bigger models tend to generate better accuracies in various applications. However, it remains a challenge for common machine learning researchers and practitioners to learn big models, because the task usually requires a large number of computation resources. In order to enable the training of big models using just a modest cluster and in an efficient manner, we release the Microsoft Distributed Machine Learning Toolkit (DMTK), which contains both algorithmic and system innovations. These innovations make machine learning tasks on big data highly scalable, efficient and flexible.

How to Perform T-test in R

• One-Sample T-Tests
• Paired-Samples T-Tests
• Independent Samples

Deep Learning in a Single File for Smart Devices

Deep learning (DL) systems are complex and often have a few of dependencies. It is often painful to port a DL library into different platforms, especially for smart devices. There is one fun way to solve this problem: provide a light interface and putting all required codes into a single file with minimal dependencies. In this tutorial we will give details on how to do the amalgamation. In addition, we will show a demo to run image object recognition on mobile devices.

Ufora – Scalable Python for Data Science

Ufora is a compiled, automatically parallel subset of python for data science and numerical computing. Any code you run with Ufora will work unmodified in python. But with Ufora, it can run hundreds or thousands of times faster, and can operate on datasets many times larger than the RAM of a single machine.

You can apply conditional formatting, the visual styling of a DataFrame depending on the data within, by using the property. This is a property that returns a Styler object, which has useful methods for formatting and displaying DataFrames. The styling is accomplished using CSS. You write functions that take DataFrames or Series, and return like-indexed DataFrames or Series with CSS ‘attribute: value’ pairs for the values. You can build up your styles incrementally using method chains, before rending.

How to discover stolen data using Hadoop and Big data?

We discuss recent data breaches and present an approach that uses Hadoop and data fingerprint matching techniques to discover stolen data.

Understanding Convolutional Neural Networks for NLP

When we hear about Convolutional Neural Network (CNNs), we typically think of Computer Vision. CNNs were responsible for major breakthroughs in Image Classification and are the core of most Computer Vision systems today, from Facebook’s automated photo tagging to self-driving cars. More recently we’ve also started to apply CNNs to problems in Natural Language Processing and gotten some interesting results. In this post I’ll try to summarize what CNNs are, and how they’re used in NLP. The intuitions behind CNNs are somewhat easier to understand for the Computer Vision use case, so I’ll start there, and then slowly move towards NLP.

Bootstrapping standard errors for difference-in-differences estimation with R

I’m currently working on a paper (with my colleague Vincent Vergnat who is also a Phd candidate at BETA) where I want to estimate the causal impact of the birth of a child on hourly and daily wages as well as yearly worked hours. For this we are using non-parametric difference-in-differences (henceforth DiD) and thus have to bootstrap the standard errors. In this post, I show how this is possible using the function boot. For this we are going to replicate the example from Wooldridge’s Econometric Analysis of Cross Section and Panel Data and more specifically the example on page 415. You can download the data for R here. The question we are going to try to answer is how much does the price of housing decrease due to the presence of an incinerator in the neighborhood?