Data or Algorithms – Which is More Important?

Which is more important, the data or the algorithms? This chicken and egg question led me to realize that it’s the data, and specifically the way we store and process the data that has dominated data science over the last 10 years. And it all leads back to Hadoop.

Some Deep Learning with Python, TensorFlow and Keras

The following problems are taken from a few assignments from the coursera courses Introduction to Deep Learning (by Higher School of Economics) and Neural Networks and Deep Learning (by Prof Andrew Ng, The problem descriptions are taken straightaway from the assignments.

Transform anything into a vector

entity2vec: Using cooperative learning approaches to generate entity vectors

Interpreting Deep Neural Networks with SVCCA

Deep Neural Networks (DNNs) have driven unprecedented advances in areas such as vision, language understanding and speech recognition. But these successes also bring new challenges. In particular, contrary to many previous machine learning methods, DNNs can be susceptible to adversarial examples in classification, catastrophic forgetting of tasks in reinforcement learning, and mode collapse in generative modelling. In order to build better and more robust DNN-based systems, it is critically important to be able to interpret these models. In particular, we would like a notion of representational similarity for DNNs: can we effectively determine when the representations learned by two neural networks are same? In our paper, “SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability,” we introduce a simple and scalable method to address these points. Two specific applications of this that we look at are comparing the representations learned by different networks, and interpreting representations learned by hidden layers in DNNs. Furthermore, we are open sourcing the code so that the research community can experiment with this method.

Introduction To Neural Networks

Artificial Neural Networks are all the rage. One has to wonder if the catchy name played a role in the model’s own marketing and adoption. I’ve seen business managers giddy to mention that their products use “Artificial Neural Networks” and “Deep Learning”. Would they be so giddy to say their products use “Connected Circles Models” or “Fail and Be Penalized Machines”? But make no mistake – Artificial Neural Networks are the real deal as evident by their success in a number of applications like image recognition, natural language processing, automated trading, and autonomous cars. As a professional data scientist who didn’t fully understand them, I felt embarrassed like a builder without a table saw. Consequently I’ve done my homework and written this article to help others overcome the same hurdles and head scratchers I did in my own (ongoing) learning process.

Natural Language Processing Library for Apache Spark – free to use

Introducing the Natural Language Processing Library for Apache Spark – and yes, you can actually use it for free! This post will give you a great overview of John Snow Labs NLP Library for Apache Spark.

A Recipe for inferference: Start with Causal Inference. Add Interference. Mix Well with R.

In causal inference, interference occurs when the treatment of one subject affects the outcome of other subjects. Interference can distort research conclusions about causal effects when not accounted for properly. In the absence of interference, inverse probability weighted (IPW) estimators are commonly used to estimate causal effects from observational data. Recently, IPW estimators have been extended to handle interference. Tchetgen Tchetgen and VanderWeele (2012) proposed IPW methods to estimate direct and indirect (or spillover) effects that allow for interference between individuals within groups. In this paper, we present inferference, an R package that computes these IPW causal effect estimates when interference may be present within groups. We illustrate use of the package with examples from political science and infectious disease.

Enhancing Reproducibility and Collaboration via Management of R Package Cohorts

Science depends on collaboration, result reproduction, and the development of supporting software tools. Each of these requires careful management of software versions. We present a unified model for installing, managing, and publishing software contexts in R. It introduces the package manifest as a central data structure for representing versionspecific, decentralized package cohorts. The manifest points to package sources on arbitrary hosts and in various forms, including tarballs and directories under version control. We provide a high-level interface for creating and switching between side-by-side package libraries derived from manifests. Finally, we extend package installation to support the retrieval of exact package versions as indicated by manifests, and to maintain provenance for installed packages. The provenance information enables the user to publish libraries or sessions as manifests, hence completing the loop between publication and deployment. We have implemented this model across three software packages, switchr, switchrGist and GRANBase, and have released the source code under the Artistic 2.0 license.

How to make Python easier for the R user: revoscalepy

I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist. Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. Unfortunately, many times, this is more challenging because the production system might demand that the entire end to end workflow is built exclusively in Python. That’s sometimes tough because there are aspects of statistical model building in R which are more intuitive than Python. Python has many strengths, such as its robust data structures such as Dictionaries, compatibility with Deep Learning and Spark, and its ability to be a multipurpose language. However, many scenarios in enterprise analytics require people to go back to basic statistics and Machine Learning, which the classic Data Science packages in Python are not as intuitive as R for. The key difference is that many statistical methods are built into R natively. As a result, there is a gap for when R users must build workflows in Python. To try to bridge this gap, this post will discuss a relatively new package developed by Microsoft, revoscalepy.

Rule Your Data with Tidy Validation Reports. Design

The story about design of ruler package: dplyr-style exploration and validation of data frame like objects.

A Library of Parallel Algorithms

This is the toplevel page for accessing code for a collection of parallel algorithms. The algorithms are implemented in the parallel programming language NESL and developed by the Scandal project. For each algorithm we give a brief description along with its complexity (in terms of asymptotic work and parallel depth). In many cases the NESL code is set up so you can run the algorithm using our FORMs bases interface. Feel free to change the data or the algorithms and submit the modified versions. Note that some of the algorithms have stated restrictions on the input (e.g. must be of even length).

Why you should forget ‘for-loop’ for data science code and embrace vectorization

Data science needs fast computation and transformation of data. NumPy objects in Python provides that advantage over regular programming constructs like for-loop. How to demonstrate it in few easy lines of code?

Some new time series packages

This week I have finished preliminary versions of two new R packages for time series analysis. The first (tscompdata) contains several large collections of time series that have been used in forecasting competitions; the second (tsfeatures) is designed to compute features from univariate time series data. For now, both are only on github. I will probably submit them to CRAN after they’ve been tested by a few more people.

Gender Diversity Analysis of Data Science Industry using Kaggle Survey Dataset in R

Kaggle recently released the dataset of an industry-wide survey that it conducted with 16K respondents. This article aims to understand how the argument of Gender Diversity plays out in Data Science Practice. Disclaimer: Yes, I understand this dataset is not the output of a Randomized Experiment hence cannot be a representative of the entire Data Science Practitioners and also contains Selection bias, which I’m well aware. Let us proceed further with this disclaimer in mind.