DevOps for Data Scientists: Taming the Unicorn

When most data scientists start working, they are equipped with all the neat math concepts they learned from school textbooks. However, pretty soon, they realize that the majority of data science work involve getting data into the format needed for the model to use. Even beyond that, the model being developed is part of an application for the end user. Now a proper thing a data scientist would do is have their model codes version controlled on Git. VSTS would then download the codes from Git. VSTS would then be wrapped in a Docker Image, which would then be put on a Docker container registry. Once on the registry, it would be orchestrated using Kubernetes. Now, say all that to the average data scientist and his mind will completely shut down. Most data scientists know how to provide a static report or CSV file with predictions. However, how do we version control the model and add it to an app? How will people interact with our website based on the outcome? How will it scale!? All this would involve confidence testing, checking if nothing is below a set threshold, sign off from different parties and orchestration between different cloud servers (with all its ugly firewall rules). This is where some basic DevOps knowledge would come in handy.

Remote Data Science: How to Send R and Python Execution to SQL Server from Jupyter Notebooks

Did you know that you can execute R and Python code remotely in SQL Server from Jupyter Notebooks or any IDE? Machine Learning Services in SQL Server eliminates the need to move data around. Instead of transferring large and sensitive data over the network or losing accuracy on ML training with sample csv files, you can have your R/Python code execute within your database. You can work in Jupyter Notebooks, RStudio, PyCharm, VSCode, Visual Studio, wherever you want, and then send function execution to SQL Server bringing intelligence to where your data lives.

How to lie with Data Science

Recently I read the book ‘How to lie with statistics’ by Darrel Huff. The book talks about how one can use statistic to make people conclude wrong. I found this an exciting topic, and I think that it is very relevant to Data Science. This why I want to make the ‘Data Science’ version of the examples shown in the book. Some of them are as in the book, others, are examples of what I saw may happen in real life Data Science. This post is not really about how to lie with Data Science. Instead, it´s about how we may be fooled by not giving enough attention to details in different parts of the pipeline.

How to use Covariates to Improve your MaxDiff Model

Ready to improve the accuracy of your MaxDiff model? Today, I’ll explain why you’ll want to include covariates in your model and how to include them in your MaxDiff analyses using Hierarchical Bayes. I’ll walk you through an example that investigates the qualities voters want in a U.S. president.