It is often the case that in the development of a data science product the preliminary analysis and prototyping is done in R (thanks to its superior tools for visualization/exploration/fast modelling…) but when it’s time to deploy the models in a production environment one switches to Python. One of the main reasons for this is that Python often integrates better with the rest of the developer stack, as it’s a general purpose language widely used for web development and so forth. That’s fair enough and R cannot compete with Python on this ground, even though this specific advantage of Python in production should only hold for companies where python is used (e.g., in my company everything is done in .NET and C#, and no developer actually uses python in production). My question is: are there any other limitations to R (apart from the above) that present a problem for its use at large scale in a production environment? And if so, could these limitations be addressed by developing new R packages with utilities that would ease the use of R in a production environment? As an example, I think @hadley strict package (https://…/strict71 ) would be quite helpful in a production environment. Opinions?
The new J.P. Morgan report provides a framework for machine learning and big data investing and shows how to trade Global Macro strategies using RavenPack Analytics.
At untapt, all of our models involve Natural Language Processing (NLP) in one way or another. Our algorithms consider the natural, written language of our users’ work experience and, based on real-world decisions that hiring managers have made, we can assign a probability that any given job applicant will be invited to interview for a given job opportunity.
There are many situations when running deep learning inferences on local devices is preferable for both individuals and companies: imagine traveling with no reliable internet connection available or dealing with privacy concerns and latency issues on transferring data to cloud-based services. Edge computing provides solutions to these problems by processing and analyzing data at the edge of network. Take the “Ok Google” feature as an example?—?by training “Ok Google” with a user’s voice, that user’s mobile phone will be activated when capturing the keywords. This kind of small-footprint keyword-spotting (KWS) inference usually happens on-device so you don’t have to worry that the service providers are listening to you all the time. The cloud-based services will only be initiated after you make the commands. Similar concepts can be extended to applications on smart home appliances or other IoT devices, where we need hand-free voice control without internet.
Since its release in 2015 by the Google Brain team, TensorFlow has been a driving force in conversations centered on artificial intelligence, machine learning, and predictive analytics. With its flexible architecture, TensorFlow provides numerical computation capacity with incredible parallelism that is appealing to both small and large businesses. TensorFlow, being built on stateful dataflow graphs across multiple systems, allows for parallel processing—data to be leveraged in a meaningful way without requiring petabytes of data. To demonstrate how you can take advantage of TensorFlow without having huge silos of data on hand, I’ll explain how to use TensorFlow to build a linear regression model in this post.
The R language is by and far the most popular statistical language, and has seen massive adoption in both academia and industry. In our new data-centric economy, the models and algorithms that data scientists build in R are not just being used for research and experimentation. They are now also being deployed into production environments, and directly into products themselves. However, taking your workload in R and deploying it at production capacity, and at scale, is no trivial matter. Because of R’s rich and robust package ecosystem, and the many versions of R, reproducing the environment of your local machine in a production setting can be challenging. Let alone ensuring your model’s reproducibility! This is why using containers is extremely important when it comes to operationalizing your R workloads. I’m happy to announce that the doAzureParallel package, powered by Azure Batch, now supports fully containerized deployments. With this migration, doAzureParallel will not only help you scale out your workloads, but will also do it in a completely containerized fashion, letting your bypass the complexities of dealing with inconsistent environments. Now that doAzureParallel runs on containers, we can ensure a consistent immutable runtime while handling custom R versions, environments, and packages.
Big data analytics is one of the emerging technologies as it promises to provide better insights from huge and heterogeneous data. Big data analytics involves selecting the suitable big data storage and computational framework augmented by scalable machine-learning algorithms. Despite the tremendous buzz around big data analytics and its advantages, an extensive literature survey focused on parallel data-intensive machine-learning algorithms for big data has not been conducted so far. The present paper provides a comprehensive overview of various machine-learning algorithms used in big data analytics. The present work is an attempt to identify the gaps in the work already performed by researchers, thus paving the way for further quality research in parallel scalable algorithms for big data.
In most time series data mining, alternate forms of data representation or data preprocessing is required because of the unique characteristics of time series, such as high dimension (the number of data points), presence of random noise, and nonlinear relationship of the data elements. Therefore, any data representation method aims to achieve substantial data reduction to a manageable size, while preserving important characteristics of the original data, and robustness to random noise. Moreover, appropriate choice of a data representation method may result in meaningful data mining. Many high level representation methods of time series data are based on time domain approaches. These methods preprocess the original data in the time domain directly and are useful to understand the behavior of data over time. Piecewise approximation, data representation by identification important points, and symbolic representation are some of the main ideas of time domain approaches, and widely used in various fields.