Almost six months ago, our team started the journey to replicate some of our data stored in on-premise MySQL machines to AWS. This included over a billion records stored in multiple tables. The new system had to be responsive enough to transfer any new incoming data from the MySQL database to AWS with minimal latency. Everything screamed out for a streaming architecture to be put in place. The solution was designed on the backbone of Kinesis streams, Lambda functions and lots of lessons learned. We use Apache Kafka to capture the changelog from MySQL tables and sink these records to AWS Kinesis. The Kinesis streams will then trigger AWS Lambdas which would then transform the data. These are our learnings from building a fully reactive serverless pipeline on AWS.
Although machine learning (ML) can produce fantastic results, using it in practice is complex. Beyond the usual challenges in software development, machine learning developers face new challenges, including experiment management (tracking which parameters, code, and data went into a result); reproducibility (running the same code and environment later); model deployment into production; and governance (auditing models and data used throughout an organization). These workflow challenges around the ML lifecycle are often the top obstacle to using ML in production and scaling it up within an organization. To address these challenges, many companies are starting to build internal ML platforms that can automate some of these steps. In a typical ML platform, a dedicated engineering team builds a suite of algorithms and management tools that data scientists can invoke. For example, Uber and Facebook have built Michelangelo and FBLearner Flow to manage data preparation, model training, and deployment. However, even these internal platforms are limited: typical ML platforms only support a small set of algorithms or libraries with limited customization (whatever the engineering team builds), and are tied to each company´s infrastructure.
Did you know that you can execute R and Python code remotely in SQL Server from Jupyter Notebooks or any IDE Machine Learning Services in SQL Server eliminates the need to move data around. Instead of transferring large and sensitive data over the network or losing accuracy on ML training with sample csv files, you can have your R/Python code execute within your database. You can work in Jupyter Notebooks, RStudio, PyCharm, VSCode, Visual Studio, wherever you want, and then send function execution to SQL Server bringing intelligence to where your data lives. This tutorial will show you an example of how you can send your python code from Juptyter notebooks to execute within SQL Server. The same principles apply to R and any other IDE as well.
All data needs to be clean before you can explore and create models. Common sense, right. Cleaning data can be tedious but I created a function that will help.
In machine learning, we are often in the realm of ‘function approximation’. That is, we have a certain ground-truth (y) and associated variables (X) and our aim is to use identify a function to wrap our variables in that does a good job in approximating the ground-truth. This exercise in function approximation is also known as ‘supervised-learning’. ‘Unsupervised learning’ on the other hand is a slightly different problem to tackle. Here, our data does not consist of a ground-truth. All we have is our variables. Let´s elaborate on how this situation is different from supervised learning. Since we do not have a ground-truth, our task here is not to predict or approximate any outcome. Consequently, there is no loss/cost function providing feedback on how close or far our function´s output is to the ground-truth. Isn´t this perplexing If there is no feedback on the ‘goodness’ of our output, then how do we know if our output is desirable, or complete hogwash In this tutorial, we will look at what actually is unsupervised learning and comprehensively understanding and execute a common unsupervised learning task, clustering.
Data science is the business of learning from data, which is traditionally the business of statistics. Data science, however, is often understood as a broader, task-driven and computationally-oriented version of statistics. Both the term data science and the broader idea it conveys have origins in statistics and are a reaction to a narrower view of data analysis. Expanding upon the views of a number of statisticians, this paper encourages a big-tent view of data analysis. We examine how evolving approaches to modern data analysis relate to the existing discipline of statistics (e.g. exploratory analysis, machine learning, reproducibility, computation, communication and the role of theory). Finally, we discuss what these trends mean for the future of statistics by highlighting promising directions for communication, education and research.