Cliff Notes for Managing the Data Science Function

There are an increasing number of larger companies that have truly embraced advanced analytics and deploy fairly large numbers of data scientists. Many of these same companies are the one’s beginning to ask about using AI. Here are some observations and tips on the problems and opportunities associated with managing a larger data science function.

Google Cloud for Data Science: Beginner’s Guide

In this tutorial, you will learn how to:
• Create an instance on Google Compute Engine (GCE),
• Install a data science environment based on the Anaconda Python distribution, and
• Run Jupyter notebooks accessible online.

5 Fantastic Practical Machine Learning Resources

1. Machine Learning Tutorial for Beginners
2. Python Machine Learning (2nd Ed.) Code Repository
3. Machine Learning From Scratch
4. Deep Learning – The Straight Dope
5. Practical Deep Learning For Coders, Part 1 (2018 edition)

The AI Show: Data Science Virtual Machine

The Data Science Virtual Machine was featured on a recent episode of the AI Show with Seth Juarez and Gopi Kumar. If you want a quick and easy way to spin up a virtual machine with all of the data science tools you’ll ever need — including R and RStudio — already installed and ready to go, this video explains what the Data Science Virtual Machine is used for and (at 21:00) how to launch one in the Azure portal.

How to Import a CSV to an R Notebook

Adding a file to your R notebook is a simple 2 step process.

Introducing the Kernelheaping Package

It is not unusual to have interval censored data such as in income surveys due to anonymisation or simplification issues. However, a simple task like plotting a density may fail completely for rounded or interval censored data. When using class centers as input, the density estimate might get bumpy or spiky around the center points. This problem gets even worse for larger sample sizes with decreasing bandwidth. One might try to manually increase the bandwidth until one obtains a sufficiently smooth estimate. However, this will result in oversmoothing as we will see in our example. The Kernelheaping package, available on CRAN, delivers an algorithm to obtain nonparametric density estimates for interval censored data.

Setting up a version controlled shiny-server

After you’ve installed shiny-server, the server serves all apps in ‘/etc/shiny-server’ but I really don’t like to scp into the server to upload files, I’d rather work on my computer, put everything under version control and push the changes to the server. That way I can easily revert my changes if something fails and I don’t lose work. I’ve used Dean Atalli’s excellent guide to setting up a server on digital ocean but I’ve modified a small part. He is using github to push his changes and then pulls from github to his server. I was in a situation where I did not have access to github and I wanted a slightly simpler setup. I created a bare repository on the server (that accepts my changes and responds just as github would, except the pretty website features) and used a git-hook. The files in the bare repo are not laid out in the same way as in the original files (on my computer) so I have to do something with that.

Data Science Strategies Webinar

If an organization operationalizes machine learning, it’s because it has built a data pipeline to efficiently collect, organize, and analyze data. Data science as a practice can’t offer dependable models without accessible data that a company knows and trusts. Hear Gartner and IBM talk about IBM’s strategy for these capabilities working in concert.
What can you do to succeed with your data science projects and empower your teams with the right capabilities? Whether you’re just getting started or are well underway, insights based on Gartner research are invaluable. And IBM offers a wealth of practical experience and progressive strategy to help you get the most value for your business from data science. View the video to learn about:
• Considerations for building and supporting a data science team
• Important capabilities for tools and platforms
• How to break down silos with effective collaboration
• Keys to success for data science projects
• How to get started

Is “Murder by Machine Learning” the New “Death by PowerPoint”?

Software doesn’t always end up being the productivity panacea that it promises to be. As its victims know all too well, “death by PowerPoint,” the poor use of the presentation software, sucks the life and energy out of far too many meetings. And audit after enterprise audit reveals spreadsheets rife with errors and macro miscalculations. Email and chat facilitate similar dysfunction; inbox overload demonstrably hurts managerial performance and morale. No surprises here — this is sadly a global reality that we’re all too familiar with. So what makes artificial intelligence/machine learning (AI/ML) champions confident that their technologies will be immune to comparably counterproductive outcomes? They shouldn’t be so sure. Digital empowerment all too frequently leads to organizational mismanagement and abuse. The enterprise history of personal productivity tools offers plenty of unhappy litanies of unintended consequences. For too many managers, the technology’s costs often rival its benefits.

Introducing the SAP HANA Data Management Suite

SAP Executive Board Member Bernd Leukert recently explored why companies today need a data management suite. I highly recommend you take a look at Bernd’s piece, because it sets the stage nicely for the latest update we’d like to unveil: the SAP HANA Data Management Suite.

Continuous integration for your private R projects with CircleCI

If you have ever developed or used an open-source R package, you’re likely familiar with continuous integration. By automating the process of testing each proposed change in the source code, you can reduce the risk of errors, avoid unnecessary overhead and increase the quality of developed solution. For data scientists, Hadley has a good description of why it’s worth using in R. The most popular CI solution in the R world is TravisCI. Overall it works great, has built-in community support for R and is free for any open source project. CircleCI offers a great alternative with a free plan that includes private repositories. This is a perfect solution if you’re building a package that can not be released publicly and you don’t have a paid Travis account. This post will quickly take you through setting up continuous integration for your private R project with CircleCI.

Deep Feature Synthesis: How Automated Feature Engineering Works

The artificial intelligence market is fueled by the potential to use data to change the world. While many organizations have already successfully adapted to this paradigm, applying machine learning to new problems is still challenging. The single biggest technical hurdle that machine learning algorithms must overcome is their need for processed data in order to work — they can only make predictions from numeric data. This data is composed of relevant variables, known as “features.” If the calculated features don’t clearly expose the predictive signals, no amount of tuning can take a model to the next level. The process for extracting these numeric features is called “feature engineering.” Automating feature engineering optimizes the process of building and deploying accurate machine learning models by handling necessary but tedious tasks so data scientists can focus more on other important steps. Below are the basic concepts behind an automated feature engineering method called Deep Feature Synthesis (DFS), which generates many of the same features that a human data scientist would create.