Metadata are associated to most of the information we produce in our daily interactions and communication in the digital world. Yet, surprisingly, metadata are often still categorized as non-sensitive. Indeed, in the past, researchers and practitioners have mainly focused on the problem of the identification of a user from the content of a message. In this paper, we use Twitter as a case study to quantify the uniqueness of the association between metadata and user identity and to understand the effectiveness of potential obfuscation strategies. More specifically, we analyze atomic fields in the metadata and systematically combine them in an effort to classify new tweets as belonging to an account using different machine learning algorithms of increasing complexity. We demonstrate that through the application of a supervised learning algorithm, we are able to identify any user in a group of 10,000 with approximately 96.7% accuracy. Moreover, if we broaden the scope of our search and consider the 10 most likely candidates we increase the accuracy of the model to 99.22%. We also found that data obfuscation is hard and ineffective for this type of data: even after perturbing 60% of the training data, it is still possible to classify users with an accuracy higher than 95%. These results have strong implications in terms of the design of metadata obfuscation strategies, for example for data set release, not only for Twitter, but, more generally, for most social media platforms.
Human in the loop’ software development will be a big part of the future. Machine learning is poised to change the nature of software development in fundamental ways, perhaps for the first time since the invention of FORTRAN and LISP. It presents the first real challenge to our decades-old paradigms for programming. What will these changes mean for the millions of people who are now practicing software development Will we see job losses and layoffs, or will see programming evolve into something different – perhaps even something more focused on satisfying users We´ve built software more or less the same way since the 1970s. We´ve had high-level languages, low-level languages, scripting languages, and tools for building and testing software, but what those tools let us do hasn´t changed much. Our languages and tools are much better than they were 50 years ago, but they´re essentially the same. We still have editors. They´re fancier: they have color highlighting, name completion, and they can sometimes help with tasks like refactoring, but they´re still the descendants of emacs and vi. Object orientation represents a different programming style, rather than anything fundamentally new – and, of course, functional programming goes all the way back to the 50s (except we didn´t know it was called that). Can we do better
Accessing the internal component of digital images using Python packages becomes more convenient to understand its properties as well as nature.
Advanced analytics is the application of mathematical and statistical modeling techniques to describe what is happening in the world. Companies that can apply these models in an automated fashion, without human intervention, can generate significant economic value. This creates the potential for fast-paced, large-scale decision making with a high degree of precision and accuracy. But the influx of innovation at the executive level can be a boon for technologists wanting to try out new and leading-edge technology. The challenge becomes how to do it efficiently and effectively without making dramatic mistakes along the way.
oding is fun, especially when your ‘weapon of choice’ is Python! So, I would like to take you through this Python Matplotlib tutorial. In this tutorial, I will be talking about various platforms in matplotlib.
For my Insight Data Engineering project, I built an Elasticsearch plugin to simplify the implementation of large-scale K-Nearest Neighbors (KNN) in online applications.
XGBoost is one of the most popular machine learning algorithm these days. Regardless of the type of prediction task at hand; regression or classification.
Learn about Lime and how it works along with the potential pitfalls that come with using it.
Machine Learning / Deep Learning models can be used in different ways to do predictions. My preferred way is to deploy an analytic model directly into a stream processing application (like Kafka Streams or KSQL). You could e.g. use the TensorFlow for Java API. This allows best latency and independence of external services. Several examples can be found in my Github project: Model Inference within Kafka Streams Microservices using TensorFlow…. However, direct deployment of models is not always a feasible approach. Sometimes it makes sense or is needed to deploy a model in another serving infrastructure like TensorFlow Serving for TensorFlow models. Model Inference is then done via RPC / Request Response communication. Organisational or technical reasons might force this approach. Or you might want to leverage the built-in features for managing and versioning different models in the model server.
Though CNNs have mostly been used for computer vision tasks, nothing stops them from being used in NLP applications. One such application for which CNNs have been used effectively is sentence classification. In sentence classification, a given sentence should be classified to a class. We will use a question database, where each question is labeled by what the question is about. For example, the question, ‘Who was Abraham Lincoln ‘ will be a question and its label will be Person. We will use the CNN network introduced in the paper by Yoon Kim, Convolutional Neural Networks for Sentence Classification, to understand the value of CNNs for NLP tasks.