Speech-to-Text Benchmark

This is a minimalist and extensible framework for benchmarking different speech-to-text engines. It has been developed and tested on Ubuntu 18.04 with Python3.6.

Horrors of using Azure Kubernetes Service in production

Azure Kubernetes Service (AKS) was recently marked as GA. We decided to move our production workload to it last month. Following is an account of what its really like to use it in production.

Programming Best Practices For Data Science

The data science life cycle is generally comprised of the following components:
• data retrieval
• data cleaning
• data exploration and visualization
• statistical or predictive modeling
While these components are helpful for understanding the different phases, they don’t help us think about our programming workflow.
Often, the entire data science life cycle ends up as an arbitrary mess of notebook cells in either a Jupyter Notebook or a single messy script. In addition, most data science problems require us to switch between data retrieval, data cleaning, data exploration, data visualization, and statistical / predictive modeling.
But there’s a better way! In this post, I’ll go over the two mindsets most people switch between when doing programming work specifically for data science: the prototype mindset and the production mindset.

Autoregressive Models in TensorFlow

This article investigates autoregressive models in TensorFlow, including autoregressive time series and predictions with the actual observations.

Temporal aggregations on time series data – Writing R functions to tidy meteorological data and getting some insights from it

In this post we´re going to work with time series data, and write R functions to aggregate hourly and daily time series in monthly time series to catch a glimpse of their underlying patterns. For this analysis we´re going to use public meteorological data recorded by the government of the Argentinian province of San Luis. Data about rainfalls, temperature, humidity and in some cases winds, is published in the REM website (Red de Estaciones Meteorológicas, http://www.clima.edu.ar ). Also, here you can download meteorological data (in .csv format) that has been recorded by weather stations around different places from San Luis.

Do GPU-based Basic Linear Algebra Subprograms (BLAS) improve the performance of standard modeling techniques in R?

The speed or run-time of models in R can be a critical factor, especially considering the size and complexity of modern datasets. The number of data points as well as the number of features can easily be in the millions. Even relatively trivial modeling procedures can consume a lot of time, which is critical both for optimization and update of models. An easy way to speed up computations is to use an optimized BLAS (Basic Linear Algebra Subprograms). Especially since R´s default BLAS is well regarded for its stability and portability, not necessarily its speed, this has potential. Alternative libraries are for example ATLAS and OpenBLAS which we will use below. Multiple blog-posts showed that they are able to improve the performance of linear algebra operations in R, especially those of the infamous R-benchmark-25.R.

Seven Practical Ideas For Beginner Data Scientists

1. Acquire Domain Expertise
2. Capacity Building
3. Data Understanding
4. Building a Knowledge Repository (Democratizing Data)
5. Focus on Small Wins
6. Repeat After Me: ROI
7. Data Science Roadmap

MeetUp API Tutorial – Pulling Data & Writing it into a JSON

In this tutorial, you´ll learn how to pull data directly from MeetUp´s API using Python and write it into a JSON.

Top 15 Scala libs for Data Science Github data

• Breeze
• Saddle
• Scalalab
• Puck
• Epic
• Vegas
• Breeze-viz
• Smile
• Apache Spark MLlib & ML
• DeepLearning.scala
• Summing Bird
• PredictionIO
• Akka
• Spray
• Slick