German Policymakers Should Not Force Platforms to Allow Anonymity

A Berlin state court has ruled that Facebook’s policy of requiring users to disclose their real names violates privacy law. The court said the relevant clause in Facebook’s terms of service is a “pre-formulated statement” that does not establish genuine consent and is therefore illegal. The case was brought by the Verbraucherzentrale Bundesverband (VZBV), or Federation of Consumer Organizations, which argues Facebook must accept anonymous users. Hamburg’s data protection authority made a similar claim in 2015, which a Hamburg court later rejected. Forcing Facebook to abandon the real names policy in Germany would create a backdoor that anyone anywhere in the world could exploit, including online trolls, extremists, racists, and bigots, thereby hurting consumers and the many other sites and services that depend on social media platforms. There are good reasons why users would choose a platform where everyone uses their real name. Anonymity encourages people to speak their minds, protects whistleblowers, and lets writers publish without their identities influencing readers. But anonymity can get in the way of civilized discourse when it emboldens people to harass, threaten, and incite violence without fear of any legal, professional, or social repercussions. The real names policy is one reason some services—like dating apps and news sites—require their users to login using their Facebook account. For example, one study found that users were significantly more likely to post vitriolic comments on news articles when they could post anonymously than when they had to use a Facebook profile. If Facebook allowed anonymous users, incivility on these services would rise.


Deep learning in the enterprise

Deep learning is a class of machine learning (ML) algorithms inspired by the human brain. Also called neural networks, these algorithms are especially good at detecting patterns across both noisy data and data that was once completely opaque to machines. While the technical details of neural nets may thrill mathematics and computer science Ph.D.s, the technology’s real significance has a much broader appeal. It represents one more step toward truly self-learning machines. Not surprisingly, this new wave of algorithms has captured attention with applications that range from machine translation to self-driving cars. Enterprises—and not just web-scale digital giants—have begun to use it to solve a wide variety of problems. Early adopters are demonstrating high-impact business outcomes in fraud detection, manufacturing performance optimization, preventative maintenance, and recommendation engines. It’s becoming clear that these new machine-intelligence-powered initiatives have the potential to redefine industries and establish new winners and losers in the next five years.


DALEX: how would you explain this prediction?

Last week I wrote about single variable explainers implemented in the DALEX package. They are useful to plot relation between a model output and a single variable. But sometimes we are more focused on a single model prediction. If our model predicts possible drug response for a patient, we really need to know which factors drive the model prediction for a particular patient. For linear models it is relatively easy as the structure of the model is additive. In 2017 we have developed breakDown package for lm/glm models.


MLE in R

When I learned and experimented a new model, I always like to start with its likelihood function in order to gain a better understanding about the statistical nature. That’s why I extensively used the SAS/NLMIXED procedure that gives me more flexibility. Today, I spent a couple hours playing the optim() function and its wrappers, e.g. mle() and mle2(), in case that I might need a replacement for my favorite NLMIXED in the model estimation. Overall, I feel that the optim() is more flexible. The named list required by the mle() or mle2() for initial values of parameters is somewhat cumbersome without additional benefits. As shown in the benchmark below, the optim() is the most efficient.


The New Neural Internet is Coming

Think of the typical and well-studied neural networks (such as image classifier) as a left hemisphere of the neural network technology. With this in mind, it is easy to understand what is Generative Adversarial Network. It is a kind of right hemisphere?—?the one that is claimed to be responsible for creativity. The Generative Adversarial Networks (GANs) are the first step of neural networks technology learning creativity. Typical GAN is a neural network trained to generate images on the certain topic using an image dataset and some random noise as a seed. Up until now images created by GANs were of low quality and limited in resolution. Recent advances by NVIDIA showed that it is within a reach to generate photorealistic images in high-resolution and they published the technology itself in open-access.


Understanding Travel Times with Uber Data

Uber has made anonymized trip data generated from over 2 billion trips in nine urban areas freely available for bulk download to help researchers, urban planners, and policymakers better understand transportation issues. The data includes average travel times between census tracts, and their equivalents in cities outside the United States, within a city, broken down by time of day and day of the week. Uber has made data available for Bogotá, Boston, Cincinnati, Johannesburg and Pretoria, Manila, Paris, San Francisco, Sydney, and Washington, D.C.


Topology Data Analysis (TDA)

Topological Data Analysis (TDA) allows you to interact with and represent structured and unstructured data through a topological network. A topological network provides a map of all the points in the data set, so that nearby points are more similar than distant points and clarifies the structure of the data set without having to query it or to perform any algebraic analysis on only a subset of variables. In essence, one can discover the true meaning of the data by analyzing a compressed representation of the data set retaining all of the subtle features and data points that have a degree of similarity to each other.


Digging into the complexity of the data labeling challenge

The movie industry frequently characterizes a future where we are living with embedded Autonomous Mobile Agents (AMA) that seamlessly perceive, make decisions, and behave like humans. The degree and timing for which this “seamlessness” could potentially become reality depends largely on companies, across all industries, overcoming a key data challenge related to employing artificial intelligence—how to obtain enough data and make sense of it to develop models that will power the agents. Data acquisition is step one; making sense of that data means finding patterns in data, assigning a standard meaning to that pattern, and using it to derive insights and develop models. This is the core of labeling data. Across industries and companies, this is a common fundamental challenge—you can see the symptoms through the investments. For example, Intel with Intel Saffron, Nervana, Altera, Movidius, and Mobileye; Google with DeepMind Technologies, MoodStock, Api.ai, and Hailie labs; Apple with Lattice Data, RealFace, and Sensomotoric Technologies, to name a few. In this post, I would like to demonstrate the complexity of the data labeling challenge through the frame of AMA development.


Installing Package Dependencies without external http(s) requests

Consider you have a server that is running behind a firewall and, for security reasons, cannot make external http(s ) requests. Further, you have R running on this server and you need to install a set of packages. The simple approach of install.packages(‘<pkg-name>’, repo = ‘<favorite cran mirror>’) is not an option since you will have no access to the CRAN repository.


How to set up a sparklyr cluster in 5 minutes

If you’ve ever wanted to play around with big data sets in a Spark cluster from R with the sparklyr package, but haven’t gotten started because setting up a Spark cluster is too hard, well … rest easy. You can get up and running in about 5 minutes using the guide SparklyR on Azure with AZTK, and you don’t even have to install anything yourself. I’ll summarize the steps below, but basically you’ll run a command-line utility to launch a cluster in Azure with everything you need already installed, and then connect to RStudio Server using your browser to analyze data with sparklyr.


R’s S3 generic-function object-oriented system

In Data Science, there are numerous instances where different techniques call for the use of different tools. For me, this means hopping between R and python on a weekly basis. I’ve been fortunate enough to have taken formal courses in python & R in the last few years and just by circumstances have chosen R as my primary language in my data science toolkit. This usually equates to a real mental struggle when jumping into a Jupyter Notebook and making trivial mistakes in the first 15 minutes, like below.