A Step Towards Reproducible Data Science : Docker for Data Science Workflows

This article aims to provide the perfect starting point to nudge you to use Docker for your Data Science workflows! I will cover both the useful aspects of Docker – namely, setting up your system without installing the tools and creating your own data science environment.

Programming Languages for Data Science and ML – With Source Code Illustrations

This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, Hadoop, decision trees, ensembles, correlation, outliers, regression, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, time series, cross-validation, model fitting, dataviz, AI and many more.

The Gaussian Correlation Inequality in One Picture

Yet another one of these One Picture tutorials, and in some ways, in the same old-fashioned style as our Type I versus Type II Errors in One Picture.

Concatenate TensorFlow Tensors Along A Given Dimension

Concatenate TensorFlow tensors along a given dimension using the TensorFlow concatenation concat functionality and then check the shape of the concatenated tensor using the TensorFlow shape functionality.

How Can We Trust Machine Learning and AI?

If you want build trust in machine learning, try treating it like a human, asking it the same type of questions.

IBM Introduces New Software to Ease Adoption of AI, Machine Learning and Deep Learning

IBM announced new software to deliver faster time to insight for high performance data analytics (HPDA) workloads, such as Spark, Tensor Flow and Caffé, for AI, Machine Learning and Deep Learning. Based on the same software, which will be deployed for the Department of Energy’s CORAL Supercomputer Project at both Oak Ridge and Lawrence Livermore, IBM will enable new solutions for any enterprise running HPDA workloads. New to this launch is Deep Learning Impact (DLI), a set of software tools to help users develop AI models with the leading open source deep learning frameworks, like TensorFlow and Caffe. The DLI tools are complementary to the PowerAI deep learning enterprise software distribution already available from IBM. Also new is web access and simplified user interfaces for IBM Spectrum LSF Suites, combining a powerful workload management platform with the flexibility of remote access. Finally, the latest version of IBM Spectrum Scale software ensures support to move workloads such as unified file, object and HDFS from where it is stored to where it is analyzed.

How (& Why) Data Scientists and Data Engineers Should Share a Platform

Sharing one platform has some obvious benefits for Data Science and Data Engineering teams, but technical, language and process challenges often make this a challenge. Learn how one company implemented single cloud platform for R, Python and other workloads – and some of the unexpected benefits they discovered along the way.

Generative Adversarial Networks – Part II

In Part I of this series, the original GAN paper was presented. Although being clever and giving state of the art results at the time, much has been improved upon since. In this post I’ll talk about the contributions from the Deep Convolutional-GAN (DCGAN) paper.

Top 10 Videos on Deep Learning in Python

1. Overview: Deep Learning Frameworks compared (96K views) – 5 minutes
2. Playlist: TensorFlow tutorial by Sentdex (114 K views) – 4.5 hours
3. Individual tutorial: TensorFlow tutorial 02: Convolutional Neural Network (69.7 K views) – 36 minutes
4. Overview : How to predict stock prices easily (210 K views) – 9 minutes
5. Tutorial: Introduction to Deep Learning with Python and the Theano library (201 K views) – 52 minutes
6. Playlist: PyTorch Zero to All (3 K views) – 2 hours 15 minutes
7. Individual tutorial: TensorFlow tutorial (43.9 K views) – 49 minutes
8. Playlist: Deep Learning with Python (1.8K views) – 83 minutes
9. Playlist: Deep Learning with Keras- Python (30.3 K views) – 85 minutes
10. Free online course: Deep Learning by Andrew Ng (Full course) (28 K views) – 4 week course

Stop Doing Fragile Research

If you develop methods for data analysis, you might only be conducting gentle tests of your method on idealized data. This leads to “fragile research,” which breaks when released into the wild. Here, I share 3 ways to make your methods robust.

Why is R slow? some explanations and MKL/OpenBLAS setup to try to fix this

Many users tell me that R is slow. With old R releases that is 100% true provided old R versions used its own numerical libraries instead of optimized numerical libraries.
But, numerical libraries do not explain the complete story. In many cases slow code execution can be attributed to inefficient code and in precise terms because of not doing one or more of these good practises:
• Using byte-code compiler
• Vectorizing operations
• Using simple data structures (i.e using data frames instead of matrices in large computing instances)
• Re-using results

How to combine point and boxplots in timeline charts with ggplot2 facets

In a recent project, I was looking to plot data from different variables along the same time axis. The difficulty was, that some of these variables I wanted to have as point plots, while others I wanted as box-plots. Because I work with the tidyverse, I wanted to produce these plots with ggplot2. Faceting was the obvious first step but it took me quite a while to figure out how to best combine facets with point plots (where I have one value per time point) with and box-plots (where I have multiple values per time point).

Teaching to machines: What is learning in machine learning entails?

Machine Learning (ML) is now a de-facto skill for every quantitative job and almost every industry embraced it, even though fundamentals of the field is not new at all. However, what does it mean to teach to a machine? Unfortunately, for even moderate technical people coming from different backgrounds, answer to this question is not apparent in the first instance. This sounds like a conceptual and jargon issue, but it lies in the very success of supervised learning algorithms.

R pool Package

The pool package makes it easier for Shiny developers to connect to databases. Up until now, there wasn’t a clearly good way to do this. As a Shiny app author, if you connect to a database globally (outside of the server function), your connection won’t be robust because all sessions would share that connection (which could leave most users hanging when one of them is using it, or even all of them if the connection breaks). But if you try to connect each time that you need to make a query (e.g. for every reactive you have), your app becomes a lot slower, as it can take in the order of seconds to establish a new connection. The pool package solves this problem by taking care of when to connect and disconnect, allowing you to write performant code that automatically reconnects to the database only when needed. So, if you are a Shiny app author who needs to connect and interact with databases inside your apps, keep reading because this package was created to make your life easier.

Predict Customer Churn – Logistic Regression, Decision Tree and Random Forest

Customer churn occurs when customers or subscribers stop doing business with a company or service, also known as customer attrition. It is also referred as loss of clients or customers. One industry in which churn rates are particularly useful is the telecommunications industry, because most customers have multiple options from which to choose within a geographic location. Similar concept with predicting employee turnover, we are going to predict customer churn using telecom dataset. We will introduce Logistic Regression, Decision Tree, and Random Forest. But this time, we will do all of the above in R. Let’s get started!

New Poll: Which Data Science / Machine Learning methods and tools you used?

Please vote in new KDnuggets poll which examines the methods and tools used for a real-world application or project.

Automated Feature Engineering for Time Series Data

We introduce a general framework for developing time series models, generating features and preprocessing the data, and exploring the potential to automate this process in order to apply advanced machine learning algorithms to almost any time series problem.