Everything you know about word2vec is wrong

The original word2vec C implementation does not do what’s explained above, and is drastically different. Most serious users of word embeddings, who use embeddings generated from word2vec do one of the following things:
1. They invoke the original C implementation directly.
2. They invoke the gensim implementation, which is transliterated from the C source to the extent that the variables names are the same.
Indeed, the gensim implementation is the only one that I know of which is faithful to the C implementation.


The Rise of DataOps (from the ashes of Data Governance)

Legacy Data Governance is broken in the ML era. Let’s rebuild it as an engineering discipline to drive orders-of-magnitude improvements.


Snorkel – A Weak Supervision System

Today’s powerful models like DNN’s, produce state-of-the-art results on many tasks and they are easier to spin up than ever before (using state-of-the-art pre-trained models like ULMFiT and BERT). So, instead of spending the bulk of our time carefully engineering features for our models, we can now feed in raw data – images, text etc. to these systems and they can learn their own features. There is a hidden cost to this success though – these models require massive labeled training sets. And these labeled training sets do not exist for most real world tasks or are rather small. And creating these large labeled training datasets can be expensive, slow, time-consuming or even impractical (privacy concerns) at times. And the problem is made worse when domain expertise is required to label data. Also, tasks can change over time and hand-labeled training data is static and doesn’t adapt to change over time. A team at Stanford, developed a set of approaches broadly termed as ‘weak supervision’ to address this data labeling bottleneck. The idea is to programmatically label millions of data points.


Word2Vec – Negative Sampling made easy

This is my second post on Word2Vec. The previous article was about the probabilistic model explaining the mechanics of embedding and appropriately using vector representation. You can find it here. In this part, we will review and implement skig -gram and negative sampling (SGNS) which is a more efficient algorithm for finding word vectors.


Translating math into code with examples in Java, Racket, Haskell and Python

Discrete mathematical structures form the foundation of computer science. These structures are so universal that most research papers in the theory of computation, programming languages and formal methods present concepts in terms of discrete mathematics rather than code. The underlying assumption is that the reader will know how to translate these structures into a faithful implementation as a working program. A lack of material explaining this translation frustrates outsiders. What deepens that frustration is that each language paradigm encodes discrete structures in a distinct way. Many of the encodings are as immutable, purely functional data structures (even in imperative languages), a topic unfortunately omitted from many computer science curricula. Many standard libraries provide only mutable versions of these data structures, which instantly leads to pitfalls.


10 Reads for Data Scientists Getting Started with Business Models

If you’re getting started with data science, you’re probably focusing your attention on mostly stats and coding. There’s nothing wrong with this, in fact, this is the right move – these are essential skills that you need to develop early on in your journey. With this being said, the biggest knowledge gap that I’ve encountered during my data science journey doesn’t deal with either of these areas. Instead, upon starting my first full-time role as a data scientist, I realized, to my surprise, that I didn’t really understand business. I suspect that this is a common theme. If you studied a technical field in college or picked things up using online courses, it’s unlikely that you ever had any reason to deep dive into business concepts like models, strategy, or important metrics. Adding on to this, I didn’t really come across data science interviews that stress-tested this type of understanding. Plenty of them tried to get a sense of product intuition, but I found that it rarely went beyond that. The fact is that business understanding isn’t taught or evangelized in the data science community to the extent that it’s used in practice. The goal of this post is to help bridge this gap by sharing some of the resources that I found most helpful as I got up to speed on how businesses work from the inside-out. That does it for the list. I know all of the above links really helped me out and I hope you take the time to explore them. As you might have noticed, not all of them tie into the day-to-day life of a data scientist – that’s intentional. I said this in my last post, I’ll say it again – data scientists are thinkers. We do our best work when we understand the systems that surround us. This understanding is what sets us up for the cool stuff: exploratory analysis, machine learning, and data visualization. Lay the foundation first and reap the benefits later. That’s what it’s all about.


AI Ethicists: Moral Grounding or Public Relations Trick?

The Wall St. Journal wrote in a March 1, 2019 article that the ‘Need for AI Ethicists Becomes Clearer as Companies Admit Tech’s Flaws’. I’m all for ethics being applied to an uncharted technological domain that could have tremendous consequences. But what’s being described sounds more like ‘AI business risk mitigation’ than ‘AI ethics’ to me. The start of the article points out the difference (carriage return and italics added for clarity): ‘The call for artificial intelligence ethics specialists is growing louder as technology leaders publicly acknowledge that their products may be flawed and harmful to employment, privacy and human rights. Software giants Microsoft Corp. and Salesforce.com Inc. have already hired ethicists to vet data-sorting AI algorithms for racial bias, gender bias and other unintended consequences that could result in a public relations fiasco or a legal headache.’ So the public call for AI ethics is growing louder since AI may be violating human rights. And the response is to uncover areas where AI can cause PR or legal problems. I sense a disconnect.


Introducing TensorNetwork, an Open Source Library for Efficient Tensor Calculations

Many of the world’s toughest scientific challenges, like developing high-temperature superconductors and understanding the true nature of space and time, involve dealing with the complexity of quantum systems. What makes these challenges difficult is that the number of quantum states in these systems is exponentially large, making brute-force computation infeasible. To deal with this, data structures called tensor networks are used. Tensor networks let one focus on the quantum states that are most relevant for real-world problems – the states of low energy, say – while ignoring other states that aren’t relevant. Tensor networks are also increasingly finding applications in machine learning (ML). However, there remain difficulties that prohibit them from widespread use in the ML community: 1) a production-level tensor network library for accelerated hardware has not been available to run tensor network algorithms at scale, and 2) most of the tensor network literature is geared toward physics applications and creates the false impression that expertise in quantum mechanics is required to understand the algorithms. In order to address these issues, we are releasing TensorNetwork, a brand new open source library to improve the efficiency of tensor calculations, developed in collaboration with the Perimeter Institute for Theoretical Physics and X. TensorNetwork uses TensorFlow as a backend and is optimized for GPU processing, which can enable speedups of up to 100x when compared to work on a CPU. We introduce TensorNetwork in a series of papers, the first of which presents the new library and its API, and provides an overview of tensor networks for a non-physics audience. In our second paper we focus on a particular use case in physics, demonstrating the speedup that one gets using GPUs.


Randomized Matrix Decompositions Using R

Matrix decompositions are fundamental tools in the area of applied mathematics, statistical computing, and machine learning. In particular, low-rank matrix decompositions are vital, and widely used for data analysis, dimensionality reduction, and data compression. Massive datasets, however, pose a computational challenge for traditional algorithms, placing significant constraints on both memory and processing power. Recently, the powerful concept of randomness has been introduced as a strategy to ease the computational load. The essential idea of probabilistic algorithms is to employ some amount of randomness in order to derive a smaller matrix from a high-dimensional data matrix. The smaller matrix is then used to compute the desired low-rank approximation. Such algorithms are shown to be computationally efficient for approximating matrices with low-rank structure. We present the R package rsvd, and provide a tutorial introduction to randomized matrix decompositions. Specifically, randomized routines for the singular value decomposition, (robust) principal component analysis, interpolative decomposition, and CUR decomposition are discussed. Several examples demonstrate the routines, and show the computational advantage over other methods implemented in R.


Neuromorphic Chips and the Future of Your Cell Phone

The ability to train large scale CNNs directly on your cell phone without send the data round trip to the cloud is the key to next gen AI applications like real time computer vision and safe self-driving cars. Problem is our current GPU AI chips won’t get us there. But neuromorphic chips look like they will.


7 Simple Tricks to Handle Complex Machine Learning Issues

1. Eliminating sample size effects.
2. Sample size determination, and simple, model-free confidence intervals.
3. Determining the number of clusters in non-supervised clustering.
4. Fixing issues in regression models when the assumptions are violated.
5. Performing joins on poor quality data.
6. Scale invariant techniques.
7. Blending data sets with incompatible data, adding consistency to your metrics.


Linear algebra in R

In this article, you learn how to do linear algebra in R. In particular, I will discuss how to create a matrix in R, Element-wise operations in R, Basic Matrix Operations in R, How to Combine Matrices in R, Creating Means and Sums in R and Advanced Matrix Operations in R.


Visualizations for Algorithmic Trading in R

Visualizations for Algorithmic trading is rising in demand by the economic sector. In R there are a lot of great packages for getting data, visualizations and model strategies for algorithmic trading. In this article, you learn how to perform visualizations and modeling for algorithmic trading in R


Data Scientists Are Thinkers

Data scientists serve a very technical purpose, but one that is vastly different from other individual contributors. Unlike engineers, designers, and project managers, data scientists are exploration-first, rather than execution-first. This isn’t a surprise when you consider the origin of data science. If you quickly look over the early history of the field, you see that things got started with academics researching the possibilities of computational statistics. This researcher-like mindset is still embedded in our DNA. We are constantly surrounded by data that represents the business, product, and customers at scale. This allows us to see things from a 30,000-foot view, where other roles spend most of their time at ground-level, working on execution. It’s important that we realize this fact, and more important that we make the most of it.


Email Automation for Google Trends

The email report will also include important search terms that are ‘rising’ or near a ‘breakout point’. This can be really useful as the breakout keywords indicate users across the globe have recently started paying great attention to these search terms. So you could create original content to attract people using these phrases in their Google searches. This is an amazing way to boost traffic/ sales leads, etc. by taking advantage of new trends. Note, that the results are relative, not absolute search volume terms. But it is still quite useful, as you could narrow down your keywords list, and login to Adsense/ keyword planner to get the exact search volume. Export the list from this script to a .csv and simply upload in Adsense. It also gives you a clear indication of how search volumes compare across similar terms – classic SEO ! We will perform all our data pull requests and manipulations in R. The best part is that you do not need any API keys or logins for this tutorial!


AzureR and AzureKeyVault

Just a couple of announcements regarding my family of packages for working with Azure from R. First, the packages have moved from the cloudyr org on GitHub to the Azure org, thus making them ‘official’. A (rather spartan) homepage is here, containing links to the individual repos: https://…/AzureR The cloudyr repos will remain, but going forward they’ll be mirrors of the main repos in Azure. Please submit issues and PRs to the Azure repos. Second, the AzureKeyVault package is now available, on GitHub and on CRAN! This is an interface to the Key Vault service, which allows secure storage of cryptographic keys, certificates, and generic secrets. Both Resource Manager and client interfaces are provided. The package allows you to carry out operations such as encryption and decryption, certificate signing, and managing storage account keys.


Using cosine similarity to find matching documents: a tutorial using Seneca’s letters to his friend Lucilius

Lately I’ve been interested in trying to cluster documents, and to find similar documents based on their contents. In this blog post, I will use Seneca’s Moral letters to Lucilius and compute the pairwise cosine similarity of his 124 letters. Computing the cosine similarity between two vectors returns how similar these vectors are. A cosine similarity of 1 means that the angle between the two vectors is 0, and thus both vectors have the same direction. Seneca’s Moral letters to Lucilius deal mostly with philosophical topics, as Seneca was, among many other things, a philosopher of the stoic school. The stoic school of philosophy is quite interesting, but it has been unfortunately misunderstood, especially in modern times. There is now a renewed interest for this school, see Modern Stoicism.


Introducing DeclareDesign, a Platform for Research Design

Research design consists of a set of choices about what research procedures to use. For example, how many subjects to interview, which questions to ask them, and what to do in the analysis phase with the data that results from these choices. We do not have good tools for assessing whether the chosen procedures are good ones. DeclareDesign is an R package for learning about, implementing, and communicating research procedures, from data collection to data analysis. Today, most data scientists take raw data as the starting point for analysis, and consider how best to import, tidy, transform, visualize, model, and communicate their results. Yet often during this process, we learn that our data can only provide limited answers to our questions. Perhaps the sample size is too small, we do not have enough observations from men and women to estimate gender differences, or we did not collect data on units’ geographic location, so we cannot merge to administrative data. We rely on rules of thumb or expert opinion to navigate the myriad research design choices we face. We select analysis strategies tailored to the data that we have before us rather than the data that could have arisen. In doing so, we risk introducing biases. Our practices can often lead to bad inferences and bad science.