How to Write a Jupyter Notebook Extension

Jupyter Notebook Extensions are simple add-ons which can significantly improve your productivity in the notebook environment. They automate tedious tasks such as formatting code or add features like creating a table of contents. While there are numerous existing extensions, we can also write our own extension to do extend the functionality of Jupyter. In this article, we’ll see how to write a Jupyter Notebook extension that adds a default cell to the top of each new notebook which is useful when there are libraries you find yourself importing into every notebook. If you want the background on extensions, then check out this article. The complete code for this extension is available on GitHub.


A short guide to using Docker for your data science environment

One of the most time consuming part of starting your work on a new system/starting a new job or just plain sharing your work is the variation of tools available (or lack thereof) due to differences in hardware/software/security policies and what not. Containerization has risen up in recent years as a ready to use solution to work around platform differences across a variety of applications from single user development environments (try out JetBrains tools with docker) to highly distributed production environments (e.g. Google Cloud, Docker ). In this article we’ll go through Docker as our container service and steps needed to get started with building your custom development platform for data science. Personally, I believe that the ability to build upon other’s systems is one of the biggest advantages while using docker. You can get started with simply cloning an already existing environment if it satisfies all your needs, or, you can build on top of it to make it even more useful for your specific use case using your dockerfile!


Data network effects for an artificial intelligence startup

Shifting attention from product and data collection to network and data sharing. Artificial intelligence (AI) ecosystem matures and it is becoming increasingly difficult to impress customers, investors, and potential acquirers by just attaching an .ai domain to whatever you are doing. Therefore, the significance of building a defensible business model in the long run becomes obvious. In this post, I explore how an AI startup may unlock various data network effects. I explain why it’s important to go beyond the conventional definition of data network effects as a way to collect data from clients for the sake of improving your model/product.


Shortcoming of Under-sampling Algorithms: CCMUT and E-CCMUT

I have applied Cluster Centroid based Majority Under-sampling Technique (CCMUT) on Adult Census Data and proved the Model Performance Improvement w.r.t State-of-the-Art Model, ‘A Statistical Approach to Adult Census Income Level Prediction'[1]. But there are is a latent shortcoming of the improved model developed using CCMUT.


Ontology and Data Science

If you are new to the word ontology don’t worry, I’m going to give a primer on what it is, and then why it matters for the data world. I’ll be explicit in the difference between philosophical ontology and the ontology related to information and data in computer science.


Random Forest

Random forests are popularly applied to both data science competitions and practical problems. They are often accurate, do not require feature scaling, categorical feature encoding, and need little parameter tunning. They can also be more interpretable than other complex models such as neural networks.


Building an End-To-End Data Science Project

It is often said that the majority of a Data Scientist’s work is not the actual analysis and modeling, but rather the data wrangling and cleaning part. As a result, full-cycle data science projects that involve these stages will be more valuable since they prove the author’s abilities to work independently with real data, as opposed to a given cleaned dataset.


Timing Grouped Mean Calculation in R

These timings are of the kind of small task large number of repetition breed that Matt Dowle writes against. So they at first wouldn’t seem that decisive.


Automatic Dashboard visualizations with Time series visualizations in R

In this article, you learn how to make Automatic Dashboard visualizations with Time series visualizations in R. First you need to install the rmarkdown rmarkdown package into your R library. Assuming that you installed the rmarkdown rmarkdown , next you create a new rmarkdown rmarkdown script in R.


TensorFlow Filesystem – Access Tensors Differently

Tensorflow is great. Really, I mean it. The problem is it’s great up to a point. Sometimes you want to do very simple things, but tensorflow is giving you a hard time. The motivation I had behind writing TFFS (TensorFlow File System) can be shared by anyone who has used tensorflow, including you.


Review: ResNeXt – 1st Runner Up of ILSVRC 2016 (Image Classification)

In this story, ResNeXt, by UC San Diego and Facebook AI Research (FAIR), is reviewed. The model name, ResNeXt, contains Next. It means the next dimension, on top of the ResNet. This next dimension is called the ‘cardinality’ dimension. And ResNeXt becomes the 1st Runner Up of ILSVRC classification task.


A gentle journey from linear regression to neural networks

Deep-learning is a very trendy term. The main reason is that related techniques have recently shown incredible ability to produce really good, if not state of the art, results on various kind of problems, ranging from image recognition to text translation. Standing in front of this growing technology, made possible by the increase of data and computer power available, it can be sometimes difficult for non-initiated people to really know ‘what happens behind the scene’. What is deep-learning? How do neural networks work?


Edge-to-Cloud Data Fabric

If data is not already the lifeblood of your business, it will soon be a critical competitive imperative. Several major impediments keep most organizations from taking full advantage of their data, but new technology is now making possible the creation of a modern global data fabric which is able to radically modernize an organization’s data management strategy while also enabling unlocking the business value to directly drive transformation of the business in a more compelling way. MapR has a unique vision and technology to create such a data fabric while also operationalizing the data for business impact.


Visual Studio IntelliCode – Preview

This extension provides AI-assisted IntelliSense by showing recommended completion items for your code context at the top of the completions list. When it comes to overloads, rather than taking the time to cycle through the alphabetical list of member, IntelliCode presents the most relevant one first. In the example shown above, you can see that the predicted APIs that IntelliCode elevates appear in a new section of the list at the top with members prefixed by a star icon. Similarly, a member’s signature or overloads shown in the IntelliSense tool-tip will have additional text marked by a small star icon and wording to explain the recommended status. This visual experience for members in the list and the tool-tip that IntelliCode provides is not intended as final – it is intended to provide you with a visual differentiation for feedback purposes only. Contextual recommendations are based on practices developed in thousands of high quality, open-source projects on GitHub each with high star ratings. This means you get context-aware code completions, tool-tips, and signature help rather than alphabetical or most-recently-used lists. By predicting the most likely member in the list based on your coding context, AI-assisted IntelliSense stops you having to hunt through the list yourself.


Physics-guided Neural Networks (PGNNs)

Physics-based models are at the heart of today’s technology and science. Over recent years, data-driven models started providing an alternative approach and outperformed physics-driven models in many tasks. Even so, they are data hungry, their inferences could be hard to explain and generalization remains to be a challenge. Combining data and physics could reveal the best of both worlds.


Simpson’s Paradox and Interpreting Data

Edward Hugh Simpson, a statistician and former cryptanalyst at Bletchley Park, described the statistical phenomenon that takes his name in a technical paper in 1951. Simpson’s paradox highlights one of my favourite things about data: the need for good intuition regarding the real world and how most data is a finite dimensional representation of a much larger, much more complex domain. The art of data science is seeing beyond the data?-?using and developing methods and tools to get an idea of what that hidden reality looks like. Simpson’s paradox showcases the importance of skepticism and interpreting data with respect to the real world, and also the dangers of oversimplifying a more complex truth by trying to see the whole story from a single data-viewpoint.


Observing is not intervening

In causal inference we are interested in measuring the effect that a variable A , say a treatment for some particular disease, has on some other variable B, say the probability of recovery, often from observational data. This means that we are interested in measuring the differences of the probability of recovery between the cases A=treated vs A=untreated. In data science and machine learning we are used to work with conditional probabilities, which may seem useful for this purpose. However, we will see with a simple example that conditioning is not enough.


Probability and Statistics explained in the context of deep learning

This article is intended for beginners in deep learning who wish to gain knowledge about probability and statistics and also as a reference for practitioners. In my previous article, I wrote about the concepts of linear algebra for deep learning in a top down approach ( link for the article ) (If you do not have enough idea about linear algebra, please read that first).The same top down approach is used here.Providing the description of use cases first and then the concepts.


Deep Learning Model Training Loop

Several Months ago I Started Exploring PyTorch?-?a Fantastic and Easy to use Deep Learning Framework. In the Previous Post, I was Describing how to Implement a Simple Recommendation System Using MovieLens Dataset. This Time I Would Like to Focus on the Topic Essential to any Machine Learning Pipeline?-?a Training Loop. The PyTorch Framework Provides you With all the Fundamental Tools to Build a Machine Learning Model. It Gives you CUDA-Driven Tensor Computations, Optimizers, Neural Networks Layers, and so on. However, to Train a Model, you Need to Assemble all These Things Into a Data Processing Pipeline.


Failure Pressure Prediction Using Machine Learning

In this post, the failure pressure will be predicted for a pipeline containing a defect based solely on burst test results and learning machine models. For this purpose, various Machine Learning models will be fitted to test data under R using the caret package, and in the process compare the accuracy of the models in order to identify the best performing one(s).


What’s New in Deep Learning Research

Knowledge generalization represents the biggest challenge in modern artificial intelligence(AI) systems and one that highly contrasts with human’s cognitive skills. While humans exhibit a native ability to reuse knowledge and experiences across different domains, AI agents struggle to generalize knowledge acquired in training environments into test environments. Is very common for AI agents that master ten different tasks in a given training environment to fail in the eleventh task in a test environment. Deep reinforcement learning(DRL) is one of AI’s disciplines most vulnerable to the generalization challenge as agents need to learn by interacting with relatively unknown environments. Recently, researchers from OpenAI released the first version of CoinRun, a new training environment that can quantify generalization in DRL agents. The principles behind CoinRun were captured in a research paper published a few days ago.


Anzo – Data Fabric for Automated Data Management and Analytics

A Semantic Layer for the Enterprise. Enabling Connected Data Access and Analytics on Demand. Until now, no technology has been able to deliver a Semantic Layer at enterprise scale – with security, governance and accessibility. Finally there is a production technology with the breadth of functionality and performance to make it not only possible – but practical – for organizations to truly treat enterprise data as an asset. Use Anzo to take the first step towards a fully connected Enterprise Data Fabric.