How to Safeguard Your “Data Lake” for Better Decision Making

It is absolutely critical for organizations that deal with large amounts of data to carefully protect their information repositories, just as natural lakes often come with rules and regulations to make sure those who swim are safe. The reasons to do so are overwhelming, as 40 percent of people surveyed recently by PwC report having made a decision based on data. But, if decisions are being made with flawed, inconsistent data, it makes the whole effort of collecting and storing it in the first place irrelevant. It is critical that IT departments set boundaries and give clear directives on how all data is used when the rest of their organization begins ‘swimming’ – or using this data to make decisions – in the data lake. To do so, they should consider setting the following rules:
• Create swimmable areas
• Clear the water deliberately
• Make sure there is a lifeguard on duty
• Always keep a current map

DataRobot Announces Automated Time Series Solution

DataRobot automation platform constructs and evaluates hundreds to thousands of different time series models and scores their performance – taking into account all the different temporal conditions to determine real world accuracy.

Complete tutorial on Text Classification using Conditional Random Fields Model (in Python)

Complete tutorial on Text Classification using Conditional Random Fields Model (in Python)

How to Overcome the Curse of Dimensionality

Dimensionality reduction is an important technique to overcome the curse of dimensionality in data science and machine learning. As the number of predictors in the dataset, or dimensions or features, increase, it becomes computationally more expensive (ie. increased storage space, longer computation time) and exponentially more difficult to produce accurate predictions in classification or regression models. Moreover, it is hard to wrap our head around visualizing the data points in more than 3 dimensions.

Using genetic data to strengthen causal inference in observational research

Causal inference is essential across the biomedical, behavioural and social sciences.By progressing from confounded statistical associations to evidence of causal relationships, causal inference can reveal complex pathways underlying traits and diseases and help to prioritize targets for intervention. Recent progress in genetic epidemiology – including statistical innovation, massive genotyped data sets and novel computational tools for deep data mining – has fostered the intense development of methods exploiting genetic data and relatedness to strengthen causal inference in observational research. In this Review, we describe how such genetically informed methods differ in their rationale, applicability and inherent limitations and outline how they should be integrated in the future to offer a rich causal inference toolbox.

What is the difference between RDF and OWL?

• Triples & URIs
Subject – Predicate – Object
These describe a single fact. Generally URI’s are used for the subject and predicate. The object is either another URI or a literal such as a number or string. Literals can have a type (which is also a URI), and they can also have a language. Yes, this means triples can have up to 5 bits of data!
For example a triple might describe the fact that Charles is Harrys father.
<http://…/harry> <http://…/1.0#hasFather> <http://…/charles> .
Triples are database normalization taken to a logical extreme. They have the advantage that you can load triples from many sources into one database with no reconfiguration.

• RDF and RDFS
The next layer is RDF – The Resource Description Framework. RDF defines some extra structure to triples. The most important thing RDF defines is a predicate called “rdf:type”. This is used to say that things are of certain types. Everyone uses rdf:type which makes it very useful.
RDFS (RDF Schema) defines some classes which represent the concept of subjects, objects, predicates etc. This means you can start making statements about classes of thing, and types of relationship. At the most simple level you can state things like http://…/1.0#hasFather is a relationship between a person and a person. It also allows you to describe in human readable text the meaning of a relationship or a class. This is a schema. It tells you legal uses of various classes and relationships. It is also used to indicate that a class or property is a sub-type of a more general type. For example “HumanParent” is a subclass of “Person”. “Loves” is a sub-class of “Knows”.

• RDF Serialisations
RDF can be exported in a number of file formats. The most common is RDF+XML but this has some weaknesses.
N3 is a non-XML format which is easier to read, and there’s some subsets (Turtle and N-Triples) which are stricter.
It’s important to know that RDF is a way of working with triples, NOT the file formats.

XSD is a namespace mostly used to describe property types, like dates, integers and so forth. It’s generally seen in RDF data identifying the specific type of a literal. It’s also used in XML schemas, which is a slightly different kettle of fish.

OWL adds semantics to the schema. It allows you to specify far more about the properties and classes. It is also expressed in triples. For example, it can indicate that “If A isMarriedTo B” then this implies “B isMarriedTo A”. Or that if “C isAncestorOf D” and “D isAncestorOf E” then “C isAncestorOf B”. Another useful thing owl adds is the ability to say two things are the same, this is very helpful for joining up data expressed in different schemas. You can say that relationship “sired” in one schema is owl:sameAs “fathered” in some other schema. You can also use it to say two things are the same, such as the “Elvis Presley” on wikipedia is the same one on the BBC. This is very exciting as it means you can start joining up data from multiple sites (this is “Linked Data”).

Four Quadrants of the Enterprise AI business case

In Progressive Orders of Complexity (and Opportunity) the Four Quadrants for the Enterprise AI Business Case Are:
• Experiment Driven: Machine Learning and Deep Learning
• Data Driven: Enterprise Platforms and Data
• Scale Driven: AI Pipeline and Scalability
• Talent Driven: AI Disruption and Stagnation

Using Uncertainty to Interpret your Model

As deep neural networks (DNN) become more powerful, their complexity increases. This complexity introduces new challenges, including model interpretability. Interpretability is crucial in order to build models that are more robust and resistant to adversarial attacks. Moreover, designing a model for a new, not well researched domain is challenging and being able to interpret what the model is doing can help us in the process.

Comparison of the Most Useful Text Processing APIs

Nowadays, text processing is developing rapidly, and several big companies provide their products which help to deal successfully with diverse text processing tasks. In case you need to do some text processing there are 2 options available. The first one is to develop the entire system on your own from scratch. This way proves to be very time and resource consuming. On the other hand, you can use the already accessible solutions developed by well-known companies. This option is usually faster and simpler. No specific knowledge or experience in the natural language processing is required. It would suffice to understand the fundamentals of the text processing. At the same time, if you need something exclusive, it is better to implement own solution rather than to apply one of the above mentioned.
Working with text processing, the data analyst faces the following tasks:
• Keyphrase extraction;
• Sentiment analysis;
• Text analysis;
• Entity recognition;
• Translation;
• Language detection;
• Topic modeling.
There are several high-level APIs which may be used to perform these tasks. Among them:
• Amazon Comprehend;
• IBM Watson Natural Language Understanding;
• Microsoft Azure (Text analytics API);
• Google Cloud Natural Language;
• Microsoft Azure (Linguistic Analysis API) – beta;
• Google Translate API;
• IBM Watson Translator;
• Amazon Translate;
• Microsoft Azure Translator Text API.

Collections in R: Review and Proposal

R is a powerful tool for data processing, visualization, and modeling. However, R is slower than other languages used for similar purposes, such as Python. One reason for this is that R lacks base support for collections, abstract data types that store, manipulate, and return data (e.g., sets, maps, stacks). An exciting recent trend in the R extension ecosystem is the development of collection packages, packages that provide classes that implement common collections. At least 12 collection packages are available across the two major R extension repositories, the Comprehensive R Archive Network (CRAN) and Bioconductor. In this article, we compare collection packages in terms of their features, design philosophy, ease of use, and performance on benchmark tests. We demonstrate that, when used well, the data structures provided by collection packages are in many cases significantly faster than the data structures provided by base R. We also highlight current deficiencies among R collection packages and propose avenues of possible improvement. This article provides useful recommendations to R programmers seeking to speed up their programs and aims to inform the development of future collection-oriented software for R.

News from the Bioconductor Project

The Bioconductor project provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor 3.7 was released on 1 May, 2018. It is compatible with R 3.5.1 and consists of 1560 software packages, 342 experiment data packages, and 919 up-to-date annotation packages. The release announcement includes descriptions of 98 new software packages and updated NEWS files for many additional packages.

Changes in R

From version 3.5.0 to version 3.5.0 patched

The Blacker the Box

There has been a lot of discussion in the data science community about the use of black-box models, and there is lots of really fascinating ongoing research into methods, algorithms, and tools to help data scientists better introspect their models. While those discussions and that research are important, in this post I discuss the macro-framework I use for evaluating how black the box can be for a prediction product. In this post I do not get into the weeds of complexity penalization algorithms or even how to weigh the tech debt associated with additional complexity. Instead, I want to take a step back and discuss how I think about ‘prediction’ problems at a more macro level, and how I value accuracy and complexity for different types of problems. The thesis of this post is: The faster the feedback on prediction accuracy, the blacker the box can be. The slower the feedback, the more your models should be explicit and formal.

Datasets From Images

This tutorial that will demonstrate how you can make datasets in CSV format from images and use them for Data Science, on your Laptop.

Unsupervised learning demystified

Unsupervised learning may sound like a fancy way to say ‘let the kids learn on their own not to touch the hot oven’ but it´s actually a pattern-finding technique for mining inspiration from your data. It has nothing to do with machines running around without adult supervision, forming their own opinions about things. Let´s demystify!