How to Design a Successful Data Lake

Business users are continuously envisioning new and innovative ways to use data for operational reporting and advanced analytics. The Data Lake, a next-generation data storage and management solution, was developed to meet the ever-evolving needs of increasingly savvy users. This white paper explores existing challenges with the enterprise data warehouse and other existing data management and analytic solutions. It describes the necessary features of the Data Lake architecture and the capabilities required to leverage a Data and Analytics as a Service (DAaaS) model. It also covers the characteristics of a successful Data Lake implementation and critical considerations for designing a Data Lake.


Finding and fixing software bugs automatically with SapFix and Sapienz

Debugging code is drudgery. But SapFix, a new AI hybrid tool created by Facebook engineers, can significantly reduce the amount of time engineers spend on debugging, while also speeding up the process of rolling out new software. SapFix can automatically generate fixes for specific bugs, and then propose them to engineers for approval and deployment to production. SapFix has been used to accelerate the process of shipping robust, stable code updates to millions of devices using the Facebook Android app – the first such use of AI-powered testing and debugging tools in production at this scale. We intend to share SapFix with the engineering community, as it is the next step in the evolution of automating debugging, with the potential to boost the production and stability of new code for a wide range of companies and research organizations. SapFix is designed to operate as an independent tool, able to run either with or without Sapienz, Facebook´s intelligent automated software testing tool, which was announced at F8 and has already been deployed to production. In its current, proof-of-concept state, SapFix is focused on fixing bugs found by Sapienz before they reach production. The process starts with Sapienz, along with Facebook´s Infer static analysis tool, helping localize the point in the code to patch. Once Sapienz and Infer pinpoint a specific portion of code associated with a crash, it can pass that information to SapFix, which automatically picks from a few strategies to generate a patch.


List of research papers with links to the source code, updated weekly.

(This work is in continuous progress and update. More of PWC is coming soon.)


erd

This utility takes a plain text description of entities, their attributes and the relationships between entities and produces a visual diagram modeling the description. The visualization is produced by using Dot with GraphViz. There are limited options for specifying color and font information. Also, erd can output graphs in a variety of formats, including but not limited to: pdf, svg, eps, png, jpg, plain text and dot.


Anatomy of an AI system

An anatomical case study of the Amazon echo as a artificial intelligence system made of human labor.


Visualizing Bias in Machine Learning Models

Google has published a new tool called the What-If Tool that allows users to visualize possible bias in machine learning models without knowing how to code. With the What-If Tool, users can manually edit their data and a model´s parameters to visualize how changing them would affect the model´s prediction. This allows users to identify when a model´s data or parameters introduces bias into the model, which can have undesirable effects such as unfairly discriminating against certain populations.


Visual Analytics becoming crucial like never before.!?

Think of all the self-service things you use in a day. Petrol stations, ATMs, Online/Apps for stock trading, banking, and shopping. The idea behind all this:
1. Convenience to customers
2. Reduced costs for providers
3. Freedom of choice about what I want, when I want without involvement of others in my minute-to-minute decisions.
We love to see and track health stats on Fitbit, Apple Watch, and Garmin. The same goes with self service interactive data exploration and easy to cater analytics. People from all stages in variety of industries are now wanting to gain access to the source of truth without interventions and limitations of IT. Visual Analytics comes to rescue when self serve is the goal and comes handy when there is a need to
• deliver better in lesser efforts
• find simple answers for the sophisticated questions.


Designing Data-oriented Processes

An important part of my job – which I don’t recall ever learning in school – is designing data-oriented processes. I can program in a number of computer programs. Writing a computer program certainly involves designing a process. Many people have the experience of being at a line-up – perhaps at an airport or cashier. At some point, somebody designed that line premised on how people would be processed. A web designer might create an online registration system for students that wish to enroll in courses. So it isn’t unusual for people to construct elaborate processes. The main difference as it relates to my job is that my processes create persistent assets – such as databases – normally intended to enable analysis for business purposes.


Interpretation of the AUC

The AUC* or concordance statistic c is the most commonly used measure for diagnostic accuracy of quantitative tests. It is a discrimination measure which tells us how well we can classify patients in two groups: those with and those without the outcome of interest. Since the measure is based on ranks, it is not sensitive to systematic errors in the calibration of the quantitative tests.


Winner Interview | Particle Tracking Challenge first runner-up, Pei-Lien Chou

What does it take to get almost to the top? Meet Pei-Lien Chou, the worthy runner-up in our recent MLTrack Particle Tracking Challenge. We invited him to tell us about how he placed so well in this challenge. In this contest, Kagglers were challenged to build an algorithm that would quickly reconstruct particle tracks from 3D points left in the silicon detectors. This was part one of a two-phase challenge. In the accuracy phase, which ran from May to August 13th 2018, we focused on the highest score, irrespective of the evaluation time. The second phase is an official NIPS competition (Montreal, December 2018) focused on the balance between accuracy and algorithm speed.


AI Safety

AI Safety is collective termed ethics that we should follow so as to avoid problem of accidents in machine learning systems, unintended and harmful behavior that may emerge from poor design of real-world AI systems.


Building Deep Learning Solutions in the Real World: Debugging and Interpretability

The implementation of large scale deep learning solutions in the real world is a road full of challenges. Many of those challenges are based on the fact that many of the tools and techniques we use during the lifecycle of a typical software application don’t apply in the deep learning space. As a result, data scientists and engineers are constantly trying to re-imagine solutions to problems that have been solved in traditional software development for decades. One of those areas that we often ignore and that can become a nightmare for data science teams is debugging.


Hands on Tensorflow Data Validation

Google just released their final piece for an end to end big data platform, TFDV! One of the big pain in data science is to handle data quality issue, ie data validation. Let’s see how google delivers on this first version and how useful this new library is.


TensorFlow Data Validation

TensorFlow Data Validation is a library for exploring and validating machine learning data. tf.DataValidation is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).


Decrypt Generative Artificial Intelligence and GANs

Today’s topic is a very exciting aspect of AI called generative artificial intelligence. In a few words, generative AI refers to algorithms that make it possible for machines to use things like text, audio files and images to create/generate content. In a previous post, I talked about Variational Autoencoders and how they used to generate new images. I mentioned that they are a part of a bigger set of models called generative models and I will talk more about them in a next article. So here we are. As I briefly explained in that post, there are two types to modes. Discriminative and generative. The first are the most common models, such as convolutional or recurrent neural networks, which used to distinguish/discriminate patterns in data in order to categorize them in classes. Applications such as image recognition, skin-cancer diagnosis, Ethereum prediction are all fall in the category of discriminative modes. The latter are able to generate new patterns in data. As a result, they can produce new images, new text, new music. To put it in a strict mathematical form, discriminative models try to estimate the posterior probability p(y|x), which is the probability of an output sample (e.g the handwritten digit) given an input sample (an image of a handwritten digit). On the other hand, generative models estimate the joint probability p(x,y) , which is the probability of both input sample and sample output to be true at the same time. In reality, it tries to calculate the distribution of a set of classes not the boundary between them.


An Approach to Choosing the Number of Components in a Principal Component Analysis (PCA)

In this article we’ll discover a simple way to choose the number of components in a Principal Component Analysis (PCA). This technique is so much used to reduce the number of dimensions in a data set, in order to use only the components that most contribute for tasks such as classification, in Machine Learning.


Why you need to care about interpretable machine learning

Machine Learning (ML) models are making their way into real-world applications. We all hear news about ML systems for credit scoring, health-care, crime prediction. We can easily foresee a social scoring system powered by ML. Thanks to the fast pace of ML research and the great results obtained in controlled experiments, more and more people seem now open to the possibility of having statistical models ruling important parts of our lives. Yet, most such systems are seen as black-boxes. Obscure number-crunching machines producing a simple Yes/No answer; at most, the answer is followed by an unsatisfactory ‘percentage of confidence’. This is often an obstacle to the full adoption of such systems for critical decisions. In health-care or lending, a specialist does not examine an enormous database to find out complex relationships. A specialist applies prior education and domain knowledge to make the best decision for a given problem. Very likely, an assessment grounded on data analysis, even involving the help of an automated tool. But, ultimately, a decision backed by a plausible and explainable reasoning. That’s why there are motivations for rejected loans and medical treatments. Moreover, such explanations often help us judging a good specialist from a bad one.


The Mythos of Model Interpretability

Supervised machine learning models boast remarkable predictive capabilities. But can you trust your model? Will it work in deployment? What else can it tell you about the world? We want models to be not only good, but interpretable. And yet the task of interpretation appears underspecified. Papers provide diverse and sometimes non-overlapping motivations for interpretability, and offer myriad notions of what attributes render models interpretable. Despite this ambiguity, many papers proclaim interpretability axiomatically, absent further explanation. In this paper, we seek to refine the discourse on interpretability. First, we examine the motivations underlying interest in interpretability, finding them to be diverse and occasionally discordant. Then, we address model properties and techniques thought to confer interpretability, identifying transparency to humans and post-hoc explanations as competing notions. Throughout, we discuss the feasibility and desirability of different notions, and question the oft-made assertions that linear models are interpretable and that deep neural networks are not.