From software engineering to Data Science: What resources helped me?

Almost 2 years ago, I took the decision to quit my job as a software engineer and to start looking for a job in the machine learning field. Right away after quitting my job, I wrote an article in my blog Up to my new Tech challenges and from there the journey started. In this article, I’m happy to share how I landed the job I dreamt of. Yeah, I got it! I’ve been working as a Data Scientist for Remerge, in Berlin, for one year. Let’s start now!

Data Cleaning with R and the Tidyverse: Detecting Missing Values

Data cleaning is one of the most important aspects of data science. As a data scientist, you can expect to spend up to 80% of your time cleaning data. In a previous post I walked through a number of data cleaning tasks using Python and the Pandas library. That post got so much attention, I wanted to follow it up with an example in R. In this post you’ll learn how to detect missing values using the tidyr and dplyr packages from the Tidyverse. The Tidyverse is the best collection of R packages for data science, so you should become familiar with it.

The technologies that every analytics group needs to have

Four game-changing technologies can enable a world-class analytics group:
1. Collaboration technology
2. Agile management technology
3. Document and App sharing technology
4. Version control

10-Step guide to schedule your script using cloud services

Well…, you can run it locally and I believe many people probably know task scheduler function if you are Windows users. We set the trigger time and conditions, the script will be run based on our pre-defined criteria. However, the drawback on this is that you need to keep your computer ‘power on’ at the scheduled time (if not all the time). I encountered this problem when I wanted to run my tiny script to retrieve weather forecast every morning. It is not practical for me to keep my laptop power on, just to run this simple task on daily schedule. Please consider this as the complimentary guide for Dashboarding with Notebooks: Day 3 / Day 4. The full credit goes to Dr. Rachael and Kaggle team. Thanks for their efforts in the original content, they also published live-streaming recorded here. This article intends to help someone who is not a hardcore programmer and does not familiar with the file system in Bash and cloud services. Therefore, I included a lot of screenshots and detail explanations here.

Adaptive – and Cyclical Learning Rates using PyTorch

The Learning Rate (LR) is one of the key parameters to tune in your neural net. SGD optimizers with adaptive learning rates have been popular for quite some time now: Adam, Adamax and its older brothers are often the de-facto standard. They take away the pain of having to search and schedule your learning rate by hand (eg. the decay rate).

What do missing values hide behind them

Certainly, I’ll keep saying I’m feeling lucky to have landed on the environmental open data portal of the Ontario province, since I’ve found on there a very valuable source of information about air quality in a big city as Toronto and its surrounding area, with complete downloadable information, for free, available as csv files. Surely, excellent data to practice with. But when I first opened a csv file from this portal in a spreadsheet to take a first look to the information, I could realize those strange values: 9999 and -999 What the…? As stated in the header of the file these are invalid and missing data, respectively. Well, it’s time to face the truth about Data Analytics, there are no perfect datasets!

Deep Reinforcement Learning using Unity ml-agents

Last week I was doing some experimentations with two of my colleagues, Pedro Quintas, and Pedro Caldeira, using Unity ml-agents and I decided that this was a great moment to share our results with the community and show you how you can expand your knowledge of reinforcement learning. If you don’t know what Unity ml-agents is, let me give you a brief introduction. Unity ml-agents is an ‘ open-source Unity plugin that enables games and simulations to serve as environments for training intelligent agents’. In my opinion, it’s a great framework to start learning about deep learning and reinforcement learning because it’s possible to actually see what’s happening instead of just seeing numbers and letters in a terminal. Before I start showing our little project, let me show some of the scenarios already created by the framework.

How to make GDPR and ONA work together?

GDPR and ONA complement each other?-?how ONA insights are used, depends on the culture of the organization. Searching online for ONA (Organizational Network Analysis) gets you various definitions, including those with curly math symbols and graph theory. However, in a nutshell, ONA is about who communicates to who in an organization. Although ONA is regarded as one of late buzzwords, it can be traced way back at least in the 80’s. In 1985, George Barnett and colleagues, wrote an article that addresses ONA at different levels of organizational hierarchies. Obviously, at this time and later, the privacy aspects of ONA were of little or no concern. However, the situation changed slowly and then much faster with the EU General Data Protection Regulation (GDPR). Since there have been a lot of articles and discussions about GDPR itself, we will not delve into much more details further.
It is possible and viable to provide ONA related services that are GDPR-aware and follow best privacy practices. The expectation mismatch of end-users is mostly related to misunderstanding of ONA, internal culture, assumptions, and negative vibes with all that is happening related to data privacy. The culture in organizations make or brake ONA insights, not the opposite. The after-ONA phase and how organizations make use of ONA insights (and other decision-making information) must be of employees’ concern. ONA gives means to employees to drive their own interests and culture, rather than hope for the best and wait what happens during whispering coffee-breaks or offsite decision meetings.

Data science productionization: maintenance

This is the third part of a five-part series on data science productionization. I’ll update the following list with links as the posts become available:
1. What does it mean to ‘productionize’ data science?
2. Portability
3. Maintenance
4. Scale
5. Trust

A practical guide to collecting ML datasets

Data is at the core of any Machine Learning problem. All the strides being made using Machine Learning these days would not be possible if not for access to relevant data. Having said that, most of the Machine Learning enthusiasts these days focus on acquiring methodological knowledge, which is good to start with. However, upon reaching a certain level of comfort with methodologies, tackling only the problems for which a dataset is already available is limiting in terms of potential. Luckily, we live in a time where an abundance of data is available on the web; all we need are the skills to identify and extract meaningful datasets. So let’s get started to see what it takes to identify, scrape and build a good-quality Machine Learning dataset. The focus of this post is to explain how good-quality datasets can be constructed through real examples and code snippets. Throughout the article, I will refer to three high-quality datasets I collected, namely Clothing Fit Dataset for Size Recommendation, News Category Dataset, and Sarcasm Detection dataset to explain various points. To set the stage, following I briefly explain what each of the datasets is about. You can find their detailed description of their linked Kaggle profile.

Human Face Detection with R

Doing human face detection with computer vision is probably something you do once unless you work for police departments, you work in the surveillance industry or for the Chinese government. In order to reduce the time you lose on that small exercise, bnosac created a small R package (source code available at https://…/image ) which wraps the weights of a Single Shot Detector (SSD) Convolutional Neural Network which was trained with the Caffe Deep Learning kit. That network allows to detect human faces in images. An example is shown below (tested on Windows and Linux).

RStudio Connect 1.7.2

RStudio Connect 1.7.2 is ready to download, and this release contains some long-awaited functionality that we are excited to share. Several authentication and user-management tooling improvements have been added, including the ability to change authentication providers on an existing server, new group support options, and the official introduction of SAML as a supported authentication provider (currently a beta feature*). But that’s not all… keep reading to learn about great additions to the RStudio Connect UI, updates to Python support, and a brand new Admin dashboard view for tracking scheduled content.

Predicting the ‘Future’ with Facebook’s Prophet

Forecasting is a technique that uses historical data as inputs to make informed estimates that are predictive in determining the direction of future trends. It is an important and common data science task in organisations today. Having prior knowledge of any event can help a company tremendously in the formulation of its goals, policies and planning. However, producing high-quality and reliable forecasts comes with challenges of its own. Forecasting is a complex phenomenon both for humans and for machines. It also requires very experienced time series analysts which as a matter of fact are quite rare. Prophet is a tool that has been built to address these issues and provides a practical approach to forecasting ‘at scale’. It intends to automate the common features of business time series by providing simple and tunable methods. Prophet enables the analysts with a variety of backgrounds to make more forecasts than they can do manually.

A Design Thinking Mindset for Data Science

Data science has received recent attention in the technical research and business strategy since; however, there is an opportunity for increased research and improvements on the data science research process itself. Through the research methods described in this paper, we believe there is potential for the application of design thinking to the data science process in an effort to formalize and improve the research project process. Thus, this paper will focus on three core areas of such theory. The first is a background of the data science research process and an identification of the common pitfalls data scientists face. The second is an explanation of how design thinking principles can be applied to data science. The third is a proposed new process for data science research projects based on the aforementioned findings. The paper will conclude with an analysis of implications for both data science individuals and teams and suggestions for future research to validate the proposed framework.
Core Guiding Principles
1. Empathy
2. Understanding through prototyping
3. Active and purposeful feedback
4. Diagrams over descriptions
5. Build on the ideas of others
6. Embrace creativity and the non-linear journey

Managing Data Science Workflows the Uber Way

Orchestrating workflows is one of the main challenges of machine learning solutions in the real world. A machine learning solution involves more than just picking the right model and productizing it. Data ingestion, training, deployment or optimization are common steps in any machine learning workflow. Unfortunately, the technology stacks for building and managing coordinated actions across all those steps hasn’t developed at the same pace of the frameworks and libraries for creating the models. Uber is one of the companies that have been innovating in this area. Over the last few years, the Uber engineering team has regularly developed relevant building blocks for orchestrating and managing machine learning workflows at scale. The challenge of orchestrating machine learning workflows is often lost in the grand vision of machine learning solutions. Its more exciting to identify the right machine learning technique for a problem than to think about orchestrating data flows or deployments. However, this oversight is the breaking point of many viable machine learning solutions that never find a way to an operational model. Initial attempts to address this problem involved adapting workflow management tools such as Apache Oozie, Apache Airflow, and Jenkins to machine learning workflows. That approach yielded some positive results but resulted very limited as machine learning workflows are fundamentally different from other applications. Recently, domain-specific solutions such as Cloudera’s Data Science Workbench(DSW) have come into scene to address this same challenge. While certainly a powerful stack, DSW hasn’t really been validated in large scale scenarios. After experimenting with many of these alternatives, Uber decided to building their own workflow management framework optimized for machine learning workflows.