How to unit test machine learning code.

Over the past year, I’ve spent most of my working time doing deep learning research and internships. And a lot of that year was making very big mistakes that helped me learn not just about ML, but about how to engineer these systems correctly and soundly. One of the main principles I learned during my time at Google Brain was that unit tests can make or break your algorithm and can save you weeks of debugging and training time. However, there doesn’t seem to be a solid tutorial online on how to actually write unit tests for neural network code. Even places like OpenAI only found bugs by staring at every line of their code and try to think why it would cause a bug. Clearly, most of us don’t have that kind of time or self hatred, so hopefully this tutorial can help you get started testing your systems sanely!


Natural Language Understanding (NLU) in Enterprise

“Communication In Focus” (or CIF), a NLU technology that is based on a novel approach that does not require the pre-definition of these terms. Rather, the design uses Context Discriminants to digest these new subjects based on prior understanding of the based language. Context Discriminant reduces complex documents into snippets of words of semantic neighbors consisting of context and points of views on subjects. Higher order derivatives are achieved by applying CD to the result produced by the prior CD. This approach enables us to refine the contexts on related subjects over distant semantic neighbors and to discover higher order dependent subjects that depict entity relationships between subjects.


Artificial Intelligence is not “Fake” Intelligence

The word “artificial” may not be the right term to use to describe “Artificial Intelligence,” because “artificial intelligence” is anything but fake, false, phony, or a sham. Maybe a better term is “Augmented Human Intelligence,” or a phrase that highlights both the importance of augmenting the human’s intelligence as well as to alleviate the fears that AI means humans become “meat popsicles” (quick, name that Bruce Willis movie reference!). And while I don’t expect this name change to stick (if it does, please give me some credit), I’m using this blog as an excuse to introduce some marvelous new training materials on artificial intelligence and machine learning.


Understanding Data Roles

With the rise of Big Data has come the accompanying explosion in roles that in some way involve data. Most who are in any way involved with enterprise technology are at least familiar with them by name, but sometimes it’s helpful to look at them through a comprehensive lens that shows us how they all fit together. In understanding how data roles mesh, think about them in terms of two pools: one is responsible for making data ready for use, and another one that puts that data to use. The latter function includes the tightly-woven roles of Data Analysts and Data Scientist, and the former includes such roles as Database Administrator, Data Architect and Data Governance Manager.


Data Warehouse and Data Lake Analytics Collaboration

So data warehousing may not be cool anymore, you say? It’s yesterday’s technology (or 1990’s technology if you’re as old as me) that served yesterday’s business needs. And while it’s true that recent big data and data science technologies, architectures and methodologies seems to have rendered data warehousing to the back burner, it is entirely false that there is not a critical role for the data warehouse and Business Intelligence in digitally transformed organizations. Maybe the best way to understand today’s role of the data warehouse is with a bit of history. And please excuse us if we take a bit of liberty with history (since we were there for most of this!).


Eager Execution: An imperative, define-by-run interface to TensorFlow

Today, we introduce eager execution for TensorFlow. Eager execution is an imperative, define-by-run interface where operations are executed immediately as they are called from Python. This makes it easier to get started with TensorFlow, and can make research and development more intuitive.


Intel’s New Processors: A Machine-learning Perspective

Machine learning consists of diverse and fast evolving algorithms, using several main compute building blocks and requiring high memory bandwidth. Intel Xeon Phi offers enhanced performance for dense linear algebra, most notably for deep learning training. Intel has optimized the deep-learning software stack for both Intel Xeon and Intel Xeon Phi, including enable efficient scale-out on multiple nodes.


6 Books Every Data Scientist Should Keep Nearby

1. Machine Learning Yearning, by Andrew Ng
2. Hadoop: The Definitive Guide, by Tom White
3. Predictive Analytics, by Eric Siegel
4. Storytelling With Data, by Kole Nussbaumer Knaflic
5. Inflection Point, by Scott Stawski
6. An Introduction to Statistical Learning With Applications in R, by Gareth James et al.


Developing a successful data governance strategy

Data governance has become increasingly critical as more organizations rely on data to make better decisions, optimize operations, create new products and services, and improve profitability. Upcoming data security regulations like the new EU GPDR law will require organizations to have a forward-looking approach in order to comply with these requirements. Additionally, regulated industries, such as health care and finance, spend a tremendous amount of money on compliance with regulations that are constantly changing. Developing a successful data governance strategy requires careful planning, the right people, and the appropriate tools and technologies. It is necessary to implement the required policies and procedures across all of an organization’s data in order to guarantee that everyone acts in accordance with the regulatory framework.


Survey of Kagglers finds Python, R to be preferred tools

Competitive predictive modeling site Kaggle conducted a survey of participants in prediction competitions, and the 16,000 responses provide some insights about that user community. (Whether those trends generalize to the wider community of all data scientists is unclear, however.) One question of interest asked what tools Kagglers use at work. Python is the most commonly-used tool within this community, and R is second. (Respondents could select more than one tool.)


ider: Intrinsic Dimension Estimation with

In many data analyses, the dimensionality of the observed data is high while its intrinsic dimension remains quite low. Estimating the intrinsic dimension of an observed dataset is an essential preliminary step for dimensionality reduction, manifold learning, and visualization. This paper introduces an R package, named ider, that implements eight intrinsic dimension estimation methods, including a recently proposed method based on second-order expansion of probability mass function and generalized linear model. The usage of each function in the package is explained with datasets generated using a function that is also included in the package.


Getting Started with Machine Learning in one hour!

I was planning agenda for my one hour talk. Conveying the learning paths, setting up the environment and explaining the important machine learning concepts finally made it to agenda after a lot of contemplation and thought. I initially thought about various ways this talk could have been done including – hands on python with linear regression, explaining linear regression in detail, or just sharing my learning journey that I went through past 18 months almost. But I wanted to start something that leaves the audience with lots of new information and questions to work on. Create curiosity and interest in them. And I guess I was able to do that to a decent level. Basically, to get them started with Machine Learning. That’s how this guide ended up being called Getting Started with Machine Learning in one hour. The notes for the talk were great for an introductory learning path, but were structured only for myself to help with the talk. Hence I wrote a machine learning getting started guide out of it and here it is. I’m very happy the way this ended up taking shape and I’m excited to share this! There are two main approaches to learn Machine Learning. Theoretical Machine Learning approach and Applied Machine Learning approach. I’ve written about it in my earlier blog post.


What Is Conjoint Analysis?

Say, you’re developing a new product. One thing you’ll want to know is how important various features of a product or service of that type are to consumers. We often try to get at this by asking respondents directly in focus groups or quantitative surveys, but this may mislead us because many people have difficulty answering questions such as these. In surveys, for example, many will claim that just about everything about a product is important. Instead, what conjoint does is force respondents to make trade-offs. It enables researchers to decompose a product, which can be real or hypothetical, into its constituent parts, and estimate the relative importance of each of these parts. Utility is frequently used in conjoint parlance to mean importance. These components can be reassembled in many combinations to form real or hypothetical products, and “what if” simulations run which pit these products against each other. By modifying a product’s mix of features – raising or lowering price, for instance, or adding or deleting a feature – we can see which products grab the highest preference share. Since tastes can vary considerably among consumers, the utilities can be used in segmentation analysis to identify groups of people whose preferences differ from people in other segments. Conjoint analysis has been used in marketing research since the 1970s, sparked by the influential 1974 paper ‘On the Design of Choice Experiments Involving Multifactor Alternatives’ by eminent Wharton professor Paul Green in the Journal of Consumer Research. Conjoint measurement was a term used interchangeably with conjoint analysis for many years, and it is now typically known just as “conjoint.”