Reducing the Need for Labeled Data in Generative Adversarial Networks

Generative adversarial networks (GANs) are a powerful class of deep generative models.The main idea behind GANs is to train two neural networks: the generator, which learns how to synthesise data (such as an image), and the discriminator, which learns how to distinguish real data from the ones synthesised by the generator. This approach has been successfully used for high-fidelity natural image synthesis, improving learned image compression, data augmentation, and more.

Deep Compression: Optimization Techniques for Inference & Efficiency

As technology cozies up to the physical limits of Moore’s law, computation is increasingly limited by heat dissipation rather than the number of transistors that can be packed onto a given area of silicon. Modern chips already routinely idle whole sections of their area forming what’s referred to as ‘dark silicon,’ referring to design constraining the proportion of the chip that can be powered on for extended periods of time before dangerously surpassing thermal design limits. An important metric for any machine learning accelerator, including tried and true general purpose graphics processing units (GPUs), is therefore the training or inference performance of the device at about 250 W of power consumption. It’s this drive for efficiency that prompted Google to develop their own tensor processing unit (TPU) in-house for increasing the efficiency of model inference, as well as training in the v2 unit onward, in their data centers. The last few years have seen a plethora of deep learning hardware startups that seek to produce energy-efficient deep learning chips that specialize in the limited mix of operations required for currently fashionable deep learning architectures. If deep learning mostly consists of matrix multiplies and convolutions, why not build a chip that does just that, trading general computing capabilities for specialized efficiency? But there’s more to this tradeoff then meets the eye.

Proposals for model vulnerability and security

Like many others, I’ve known for some time that machine learning models themselves could pose security risks. A recent flourish of posts and papers has outlined the broader topic, listed attack vectors and vulnerabilities, started to propose defensive solutions, and provided the necessary framework for this post. The objective here is to brainstorm on potential security vulnerabilities and defenses in the context of popular, traditional predictive modeling systems, such as linear and tree-based models trained on static data sets. While I’m no security expert, I have been following the areas of machine learning debugging, explanations, fairness, interpretability, and privacy very closely, and I think many of these techniques can be applied to attack and defend predictive modeling systems. In hopes of furthering discussions between actual security experts and practitioners in the applied machine learning community (like me), this post will put forward several plausible attack vectors for a typical machine learning system at a typical organization, propose tentative defensive solutions, and discuss a few general concerns and potential best practices.

Model Based Reinforcement Learning

If you have ever played a real-time strategy game (RTS) such as Age of Empires or others, you surely know that you start by an almost black screen. The first thing you do is to send units in every direction to scout the terrain and discover the enemy location, as well as the strategic locations and resources such as mines, forests etc…

“Microsoft Malware Prediction” and its 9 million machines

Kaggle recently finished the competition ‘Microsoft Malware’ which asks you to predict if a computer is infected. The competition states
‘The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways. With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security.’
As a cybersecurity practitioner, this was a fantastic opportunity with a substantial amount of data to try and ask questions like:
• ‘Why are computers infected?’
• ‘What are the biggest contributors?’
• ‘How is patching going?’
Unfortunately, the competition answers more questions around machine learning rather than malware prevention or patching best practices.

Reinforcement learning: Temporal-Difference, SARSA, Q-Learning & Expected SARSA on python

TD, SARSA, Q-Learning & Expected SARSA along with their python implementation and comparison. Temporal Difference learning is the most important reinforcement learning concept. It’s further derivatives like DQN and double DQN (I may discuss them later in another post) have achieved groundbreaking results renowned in the field of AI. Google’s alpha go used DQN algorithm along with CNNs to defeat the go world champion. You are now equipped with the theoretical and practical knowledge of basic TD, go out and explore!

Complete Guide to Machine Learning Project Structuring for Managers

Structure your machine learning projects to create tangible results and increase the team’s efficiency. AI is now on everyone’s mind. Established companies are disrupting themselves and making a painfully slow shift towards becoming data-driven organizations, while startups need to implement clear and efficient data strategies to be relevant. Although the necessity of having a data strategy is now widely accepted across small and large companies, a common challenge remains: how to structure and manage a machine learning project? This article offers a framework to help you manage a machine learning project. Of course, you will have to adapt it to your company’s specific needs, but it will get you in the right direction.

Bias Variance Trade Off

Deep Learning is a highly empirical domain which majorly focusses on fine-tuning the various parameters. The choice of these parameters defines the accuracy of the model. So, it becomes important to choose such parameters wisely. Choosing the parameters based on intuition might not work every time which can degrade the performance of the model drastically. Also, a considerable amount of time and effort will go in vain while selecting the values of parameters based on intuition. So, the methodological approach should be followed to select the optimal values of parameters to maximize the accuracy of the model.

Training for Context: The Importance of Word2Vec

We’ve all experienced the great data rush as companies push to use analytics to drive business decisions. After all, the proliferation of data and its intelligent analysis can change entire company trajectories. But to make the quintillions of data created each day truly useful, as well as all that has come before, it must be understandable to an artificial intelligence (AI) system. Dealing with numbers is one thing, but human language is something else entirely. Making texts, comments, forum posts, tweets, web pages, blog posts, reviews and books digestible to a system that cannot draw on context is a whole other matter. There are many techniques that address this problem, but they usually generate very large data tables. Word2Vec, however, is a technique that offers a compact representation of a word and its context. Let’s take a look.

How to Do A/B Testing: A Checklist You’ll Want to Bookmark

A /B testing is a part of Conversion Rate Optimization (CRO). Turning visitors into customers or towards the desired action is a challenge that most companies face. CRO is about using analytics and user insights to improve the performance of your website.

Should Data Science Become A Universal Skill?

On April 2, 2019 at the Galvanize Campus in San Francisco, California, Data.World will host an Afternoon of Data to raise questions and brainstorm insights surrounding the very prominent concept of data literacy in today’s society. With the event two weeks out, its speakers, all prominent figures in the data space, weighed in on some widespread issues that surface as data literacy grows across industries.

Predictive Analysis in R

Predictive analysis is heavily used today to gain insights on a level that are not possible to detect with human eyes. And R is an extremely powerful and easy tool to implement the same. In this piece, we will explore how we can predict the status of breast cancer using predictive modeling in less than 30 lines of code.

Machine Learning Algorithms Every Data Scientist Should Know

There are a huge number of ML algorithms out there. Trying to classify them leads to the distinction being made in types of the training procedure, applications, the latest advances, and some of the standard algorithms used by ML scientists in their daily work. There is a lot to cover, and we shall proceed as given in the following listing:
1. Statistical Algorithms
2. Classification
3. Regression
4. Clustering
5. Dimensionality Reduction
6. Ensemble Algorithms
7. Deep Learning
8. Reinforcement Learning
9. AutoML (Bonus)

Evaluating Keras neural network performance using Yellowbrick visualizations

When working with neural networks, having a library of advanced visualizations you can use to dig into specific properties of your model is essential to your ability to iterate on your model builds quickly and effectively. In this post we saw how we can leverage yellowbrick with keras to build some of these kinds of graphs. Hopefully having read this post, you’re now ready to replace one or two hacky matplotlib code gists lying around with well-maintained, well-architected visualization recipes from this nifty new library. Interested in learning more about the Python data visualization ecosystem? I recommend watching Jake Vanderplas’s highly entertaining PyCon 2017 talk ‘The Python Visualization Landscape’. Interested in learning more about Yellowbrick? In additional to the model evaluation plots showcased here, the library also provides plots for evaluating unsupervised clustering, modeling text, and modeling data features. Take a look. Unfortunately there are many more advanced model evaluation plots in the library that didn’t appear here because they aren’t as easily emulated. But, watch this space?-?I suspect using using yellowbrick visualizations with keras models is going to get easier in the future!


Yellowbrick is a suite of visual diagnostic tools called ‘Visualizers’ that extend the scikit-learn API to allow human steering of the model selection process. In a nutshell, Yellowbrick combines scikit-learn with matplotlib in the best tradition of the scikit-learn documentation, but to produce visualizations for your machine learning workflow! For complete documentation on the Yellowbrick API, a gallery of available visualizers, the contributor’s guide, tutorials and teaching resources, frequently asked questions, and more, please visit our documentation at

BERT in Keras with Tensorflow hub

At Strong Analytics, many of our projects involve using deep learning for natural language processing. In one recent project we worked to encourage kids to explore freely online while making sure they stayed safe from cyberbullying and online abuse, while another involved predicting deductible expenses from calendar and email events. A key component of any NLP project is the ability to rapidly test and iterate using techniques. Keras offers a very quick way to prototype state-of-the-art deep learning models, and is therefore an important tool we use in our work. In a previous post, we demonstrated how to integrate ELMo embeddings as a custom Keras layer to simplify model prototyping using Tensorflow hub. BERT, a language model introduced by Google, uses transformers and pre-training to achieve state-of-the-art on many language tasks. It has recently been added to Tensorflow hub, which simplifies integration in Keras models.

Data science productionization: portability

This is the second part of a five-part series on data science productionization. I’ll update the following list with links as the posts become available:
1. What does it mean to ‘productionize’ data science?
2. Portability
3. Maintenance
4. Scale
5. Trust