The Difference between Data Scientists, Data Engineers, Statisticians, and Software Engineers

Finding out the difference between data scientists, data engineers, software engineers, and statisticians can be confusing and complicated. While all of them are linked to data in a way, there is an underlying difference between the work they do and manage. The growth of data and its usage across the industry is hidden from none. During the last decade in general, and the last couple of years in particular, we have seen a major distinction in the roles tasked with crafting and managing data. Data Science is without a doubt a really growing field. Organizations and even countries from across the globe have experienced a drastic rise in their data collection endeavors. With numerous complications associated with collecting and managing data, this field is now host to a wide array of jobs and designations. We now have data scientists who are grouped into more specific tasks of data engineers, data statisticians, and software engineers. But other than the difference in their names, how many of us can comprehend the diversity in the work they do?

Data Science with Python: Exploratory Analysis with Movie-Ratings and Fraud Detection with Credit-Card Transactions

The following problems are taken from the projects / assignments in the edX course Python for Data Science and the coursera course Applied Machine Learning in Python (UMich).

How to apply Linear Regression in R

Machine Learning (ML) is a field of study that provides the capability to a Machine to understand data and to learn from the data. ML is not only about analytics modeling but it is end-to-end modeling that broadly involves following steps:
• Defining problem statement
• Data collection.
• Exploring, Cleaning and transforming data.
• Making the analytics model.
• Dashboard creation & deployment of the model.
Machine learning has two distinct field of study – supervised learning and unsupervised learning. Supervised learning technique generates a response based on the set of input features. Unsupervised learning does not have any response variable and it explores the association and interaction between input features. In the following topic, I will discuss linear regression that is an example of supervised learning technique.

How machine learning will accelerate data management systems

In this episode of the Data Show, I spoke with Tim Kraska, associate professor of computer science at MIT. To take advantage of big data, we need scalable, fast, and efficient data management systems. Database administrators and users often find themselves tasked with building index structures (“indexes” in database parlance), which are needed to speed up data access.

Let it flow, let it flow, let it flow……

This is not the blog post I’d originally intended to write. But I’m glad – because this one is so much better. Some background. I’m one of the few Scotland based members of AphA – the Association of Professional Healthcare Analysts. They’ve had a couple of events recently that I was keeping my eye on via Twitter and it became apparent that a session demonstrating R had made some waves – in a good way. I’d been having a wee exchange with Neil Pettinger regarding R and took the opportunity to ask permission to use one of his Excel files. This featured a dot plot chart that demonstrated patient flow. I wanted to show an alternative way of creating the plot using R.

Time Series Forecasting with Recurrent Neural Networks

In this section, we’ll review three advanced techniques for improving the performance and generalization power of recurrent neural networks. By the end of the section, you’ll know most of what there is to know about using recurrent networks with Keras. We’ll demonstrate all three concepts on a temperature-forecasting problem, where you have access to a time series of data points coming from sensors installed on the roof of a building, such as temperature, air pressure, and humidity, which you use to predict what the temperature will be 24 hours after the last data point. This is a fairly challenging problem that exemplifies many common difficulties encountered when working with time series.

A Data Science Lab for R

A data science lab is an environment for developing code and creating content. It should enhance the productivity of your data scientists and integrate with your existing systems. Your data science lab might live on your premises or in the cloud. It might be built with hardware, virtual machines, or containers. You may use it to support a single data scientist or hundreds of R developers. Here is one reference architecture of a data science lab based on server instances.

A clustering algorithm for multivariate data streams with correlated components

Common clustering algorithms require multiple scans of all the data to achieve convergence, and this is prohibitive when large databases, with data arriving in streams, must be processed. Some algorithms to extend the popular K-means method to the analysis of streaming data are present in literature since 1998 (Bradley et al. in Scaling clustering algorithms to large databases. In: KDD. p. 9-15, 1998; O’Callaghan et al. in Streaming-data algorithms for high-quality clustering. In: Proceedings of IEEE international conference on data engineering. p. 685, 2001), based on the memorization and recursive update of a small number of summary statistics, but they either don’t take into account the specific variability of the clusters, or assume that the random vectors which are processed and grouped have uncorrelated components. Unfortunately this is not the case in many practical situations. We here propose a new algorithm to process data streams, with data having correlated components and coming from clusters with different covariance matrices. Such covariance matrices are estimated via an optimal double shrinkage method, which provides positive definite estimates even in presence of a few data points, or of data having components with small variance. This is needed to invert the matrices and compute the Mahalanobis distances that we use for the data assignment to the clusters. We also estimate the total number of clusters from the data.