An Access Control Model for Linked Data

Linked Open Data refers to a set of best practices for the publication and interlinking of structured data on the Web in order to create a global interconnected data space called Web of Data. To ensure the resources featured in a dataset are richly described and, at the same time, protected against malicious users, we need to specify the conditions under which a dataset is accessible. Being able to specify access terms should also encourage data providers to publish their data. We introduce a lightweight vocabulary, called Social Semantic SPARQL Security for Access Control Ontology (S4AC), allowing the definition of fine-grained access control policies formalized in SPARQL, and enforced when querying Linked Data. In particular, we define an access control model providing the users with means to define policies for restricting the access to specific RDF data, based on social tags, and contextual information.


Statistical Significance Explained

As the dean at a major university, you receive a concerning report showing your students get an average of 6.80 hours of sleep per night compared to the national college average of 7.02 hours. The student body president is worried about the health of students and points to this study as proof that homework must be reduced. The university president on the other hand dismisses the study as nonsense: ‘Back in my day we got four hours of sleep a night and considered ourselves lucky.’ You have to decide if this is a serious issue. Fortunately, you’re well-versed in statistics and finally see a chance to put your education to use!


Advanced Topics in Deep Convolutional Neural Networks

Throughout this article, I will discuss some of the more complex aspects of convolutional neural networks and how they related to specific tasks such as object detection and facial recognition.


Understanding Tensor Processing Units

In 2017, Google announced a Tensor Processing Unit (TPU) – a custom application-specific integrated circuit (ASIC) built specifically for machine learning. A year later, TPUs were moved to the cloud and made open for commercial use. Following the line of CPUs and GPUs, Tensor Processing Units (TPUs) are Google’s custom-developed application-specific integrated circuits (ASICs) that are supposed to accelerate machine learning workloads. They are designed specifically for Google’s TensorFlow framework, a symbolic math library that is used for neural networks. TensorFlow is known to be not an easy nut to crack. To ensure a better understanding of the concept of a tensor, the TPU structure and how it works, we’ll try to give a brief and simple overview of the technology.


Software Engineering for Machine Learning: A Case Study

Recent advances in machine learning have stimulated widespread interest within the Information Technology sector on integrating AI capabilities into software and services. This goal has forced organizations to evolve their development processes. We report on a study that we conducted on observing software teams at Microsoft as they develop AI-based applications. We consider a nine-stage workflow process informed by prior experiences developing AI applications (e.g., search and NLP) and data science tools (e.g. application diagnostics and bug reporting). We found that various Microsoft teams have united this workflow into preexisting, well-evolved, Agile-like software engineering processes, providing insights about several essential engineering challenges that organizations may face in creating large-scale AI solutions for the marketplace. We collected some best practices from Microsoft teams to address these challenges. In addition, we have identified three aspects of the AI domain that make it fundamentally different from prior software application domains: 1) discovering, managing, and versioning the data needed for machine learning applications is much more complex and difficult than other types of software engineering, 2) model customization and model reuse require very different skills than are typically found in software teams, and 3) AI components are more difficult to handle as distinct modules than traditional software components – models may be ‘entangled’ in complex ways and experience non-monotonic error behavior. We believe that the lessons learned by Microsoft teams will be valuable to other organizations.


Text-based Editing of Talking-head Video

Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novelmethod to edit talking-head video based on its transcript to produce a realistic output video in which the dialogueof the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our methodautomatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance,expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and anoptimization strategy then chooses segments of the input corpus as base material. The annotated parameterscorresponding to the selected segments are seamlessly stitched together and used to produce an intermediate videorepresentation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrentvideo generation network transforms this representation to a photorealistic video that matches the editedtranscript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as wellas convincing language translation and full sentence synthesis.


Training AI to Predict Myers-Briggs Personality Types From Texts

When I was in college, I used to watch Gilmore Girls, an American comedy-drama series created by Amy Sherman-Palladino, starring Lauren Graham and Alexis Bledel. I watched the 7 seasons, plus the sequel released in 2016. I loved the mix of characters with different personalities. I was a big fan of Rory Gilmore and Logan Huntzberger! So, I found a great dataset to play with, the Myers-Briggs 16 Personality Types that I can use to train a model and detect different types of personalities in texts and even better, test it on Gilmore Girls characters to predict their personalities!


Multi-Class Text Classification Using PySpark, MLlib and Doc2Vec

How to use Apache Spark MLlib with PySpark for NLP problems and how to simulate Doc2Vec in Spark MLlib


Why The New Era of Big Data Requires Innovative Privacy Initiatives

Privacy and data collection go together like peanut butter and jelly, but in the world of big data, it’s becoming increasingly difficult to work with anonymous data without crossing a privacy line. So what’s a business supposed to do? Steve Touw offers some insights into how organizations can honor privacy without rendering data useless in his talk Privacy and Machine Learning: Peanut Butter and Jelly.


Siri’s Gone M.I.A.

But Apple still has some A.I. tricks up its sleeve


Ten more random useful things in R you may not know about

I was surprised by the positive reaction to my article a couple of months ago which itemized ten random things in R that people might not know about. I had a feeling that R has developed as a language to such a degree that many of us are using it now in completely different ways. This means that there are likely to be numerous tricks, packages, functions, etc that each of us use, but that others are completely unaware of, and would find useful if they knew about them. As Mike Kearney also pointed out, none of my list of ten had anything to do with stats, which just shows how far R has come in recent years. To be honest, I struggled to keep it to ten last time, so here are ten more things about R that help make my work easier and which you might find useful. Do drop a note here or on Twitter if any of these helpful to your current work, or if you have further suggestions for things which others should know about.


Industrial strength Natural Language Processing

I have spent much of my career as a graduate student researcher, and now as a Data Scientist in the industry. One thing I have come to realize is that a vast majority of solutions proposed both in academic research papers and in the workplace are just not meant to ship – they just don’t scale!
And when I say scale, I mean:
• Handling real world uses cases
• Handling large amounts of data
• Ease of deployment in a production environment.
Some of these approaches either work on extremely narrow use cases, or have a tough time generating results in a timely manner.
More often than not, the problem lies is in the approach that was used although when things go wrong, we tend to render the problem ‘unsolvable’. Remember, there will almost always be more than one way to solve a Natural Language Processing (NLP) or Data Science problem. Optimizing your choices will increase your chance of success in deploying your models to production.
Over the past decade I have shipped solutions that serve real users. From this experience, I now follow a set of best practices that maximizes my chance of success every time I start a new NLP project.
In this article, I will share some of these with you. I swear by these principles and I hope these become handy to you as well.


Measuring Performance: The Confusion Matrix

Confusion matrices are calculated using the predictions of a model on a data set. By looking at a confusion matrix, you can gain a better understanding of the strengths and weaknesses of your model, and you can better compare two alternative models to understand which one is better for your application. Traditionally, a confusion matrix is calculated using a model’s predictions on a held-out test set.


Measuring Performance: AUC (AUROC)

The area under the receiver operating characteristic (AUROC) is a performance metric that you can use to evaluate classification models.


Word Embedding (Part I)

Intuition and (some) maths to understand end-to-end Skip-gram model.


Word Embedding (Part II)

The original issue of NLP (Natural Language Processing) is the encoding of a word/sentence into an understandable format for computer processing. Representation of words in a vector space allows NLP models to learn word meaning. In our previous post, we saw the Skip-gram model that captures the meaning of words given their local context. Let’s remember that by context we mean a fixed window of size n of words surrounding a target word. In this post, we are going to study the GloVe model (Global Vectors) which has been created to look at both local context and global statistics of words to embed them. The main idea behind the GloVe model is to focus on the co-occurence probabilities (equation 0 below) of words within a corpus of texts to embed them in meaningful vectors. In other terms, we are going to look at how often a word j appears in the context of a word i within all our corpus of texts.


PyViz: Simplifying the Data Visualisation process in Python.

An overview of the PyViz ecosystem to make data visualizations in Python easier to use, learn and more powerful.


Top 10 Statistics Mistakes Made by Data Scientists

A data scientist is a ‘person who is better at statistics than any software engineer and better at software engineering than any statistician’. In Top 10 Coding Mistakes Made by Data Scientists we discussed how statisticians can become a better coders. Here we discuss how coders can become better statisticians. Detailed output and code for each of the examples is available in github and in an interactive notebook. The code uses workflow management library d6tflow and data is shared with dataset management library d6tpipe.
Advertisements