Do you really need to implement Big Data technologies in your ecosystem?

From many years, ‘Big Data’ has become widespread and trendy. The Big Data technologies started to fill the gap between the traditional data technologies (RDBMS, File systems … ) and the high evolution of the data and business needs. While implementing these technologies is a must for many large-scale organization to ensure the business continuity, many organization are aiming to adopt these technologies without really knowing if they can improve their business. Before making your decision, there are many things you should take into consideration.


The current state of AutoML

I work for a startup in Stuttgart, Germany called Vialytics where we are using Artificial Intelligence to automatically detect different damages on the streets so authorities can manage those automatically and react quicker on fixing those damages. It’s pretty cool, check it out here: vialytics.de. First of all, you are probably asking yourself right now what even is AutoML. The idea behind AutoML is to make machine learning available for non-experts. It tries to automate the process of finding the best algorithm to solve your particular problem. It could, for example, generate very complex convolutional neural networks to solve an image segmentation task. We thought maybe we could use AutoML to build a network that beats our current network on our huge dataset of road images. So I began looking into different frameworks that are available to do AutoML, here is what I found out:


Microsoft looks to ‘do for data sharing what open source did for code’

As Microsoft seeks to make data-sharing across companies easier and more pervasive, company officials have seen areas where roadblocks can occur. Prevalent among these are the lack of consistent, standardized data-sharing terms and licensing agreements. On July 23, the company took a first potential step toward remedying this gap. Microsoft is making publicly available today the first drafts of three proposed data-sharing agreements. It is looking for community feedback and input on them over the next few months. Each of the three is designed for particular data-sharing scenarios between companies — not individuals — and is covered by the Creative Commons license. Some of these agreements will be published on Microsoft’s GitHub code-sharing site. Microsoft officials said they believe these kinds of agreements could alleviate the need of companies to spend months or years negotiating and creating data-sharing governance agreements.


Data science team sizing and allocation

Centralized management of a team is not without challenges. The most prominent of these management challenges is that of sizing and allocation. In this document, I propose an easy-to-follow procedure to handle this challenge. Before presenting this approach, it’s important to share the assumptions we accept as true.


Three pitfalls to avoid in machine learning

As scientists from myriad fields rush to perform algorithmic analyses, Google’s Patrick Riley calls for clear standards in research and reporting. Machine learning is driving discovery across the sciences. Its powerful pattern finding and prediction tools are helping researchers in all fields – from finding new ways to make molecules and spotting subtle signals in assays, to improving medical diagnoses and revealing fundamental particles. Yet, machine-learning tools can also turn up fool’s gold – false positives, blind alleys and mistakes. Many of the algorithms are so complicated that it is impossible to inspect all the parameters or to reason about exactly how the inputs have been manipulated. As these algorithms begin to be applied ever more widely, risks of misinterpretations, erroneous conclusions and wasted scientific effort will spiral.


Never start with a hypothesis

Setting up hypothesis testing is a ballroom dance; its steps are action-action-worlds-worlds. There’s a nice foxtrot rhythm to it. Unfortunately, most people bungle it by starting on the wrong foot. Here’s how to dance it right.


The R Graph Gallery

Welcome the R graph gallery, a collection of charts made with the R programming language. Hundreds of charts are displayed in several sections, always with their reproducible code available. The gallery makes a focus on the tidyverse and ggplot2. Feel free to suggest a chart or report a bug; any feedback is highly welcome.


Hands On Bayesian Statistics with Python, PyMC3 & ArviZ

If you think Bayes’ theorem is counter-intuitive and Bayesian statistics, which builds upon Baye’s theorem, can be very hard to understand. I am with you. There are countless reasons why we should learn Bayesian statistics, in particular, Bayesian statistics is emerging as a powerful framework to express and understand next-generation deep neural networks. I believe that for the things we have to learn before we can do them, we learn by doing them. And nothing in life is so hard that we can’t make it easier by the way we take it. So, this is my way of making it easier: Rather than too much of theories or terminologies at the beginning, let’s focus on the mechanics of Bayesian analysis, in particular, how to do Bayesian analysis and visualization with PyMC3 & ArviZ. Prior to memorizing the endless terminologies, we will code the solutions and visualize the results, and using the terminologies and theories to explain the models along the way. PyMC3 is a Python library for probabilistic programming with a very simple and intuitive syntax. ArviZ, a Python library that works hand-in-hand with PyMC3 and can help us interpret and visualize posterior distributions.


How Karl Popper can make you as good a data scientist as George Soros

Karl Popper is best known for the view that science proceeds by ‘falsifiability’ – the idea that one cannot prove a hypothesis is true, or even have evidence of truth by induction (yikes!), but one can refute a hypothesis if it is false.


The League of Entropy Is Making Randomness Truly Random

Creating reliably random numbers isn’t as easy as you think, but a new alliance of organizations and individuals is decentralizing randomness for more equitable and trustworthy applications.


Inside Pluribus: Facebook’s New AI That Just Mastered the World’s Most Difficult Poker Game

Poker has remained as one of the most challenging games to master in the fields of artificial intelligence(AI) and game theory. From the game theory-creator John Von Neumann writing about poker in his 1928 essay ‘Theory of Parlor Games, to Edward Thorp masterful book ‘Beat the Dealer’ to the MIT Blackjack Team, poker strategies has been an obsession to mathematicians for decades. In recent years, AI has made some progress in poker environments with systems such as Libratus, defeating human pros in two-player no-limit Hold’em in 2017. Last week, a team of AI researchers from Facebook in collaboration with Carnegie Mellon University achieved a major milestone in the conquest of Poker by creating Pluribus, an AI agent that beat elite human professional players in the most popular and widely played poker format in the world: six-player no-limit Texas Hold’em poker.


Why You Need a Modern Infrastructure to Accelerate AI and ML Workloads

Recent years have seen a boom in the generation of data from a variety of sources: connected devices, IoT, analytics, healthcare, smartphones, and much more. In fact, as of 2016, 90% of all data ever created had been created in the previous two years. Gaining insights from all of this data presents a tremendous opportunity for organizations to further their businesses, expand more quickly into new markets, to advance research in healthcare or climate – just to name a few. However, the urgency of managing the sheer amount of data, coupled with the need to more and more quickly glean insights from it, is palpable. According to Gartner, organizations have been reporting unstructured data growth of over 50% year over year, while at the same time an Accenture survey found that 79% of enterprise executives agree that not extracting value and insight from this data will lead to extinction for their businesses. This data management problem is particularly acute in the areas of artificial intelligence (AI) and machine learning workloads where there are both extreme compute requirements and the need to store massive amounts of data that will be analyzed in some form.


Data Science: Scientific Discipline or Business Process?

Simply put, data science is an attempt to understand given data using the scientific method. That’s why data science is a scientific discipline. You are free (and encouraged!) to apply data science to business use cases, just as you are encouraged to apply it to many other domains.


Data Science Made Easy: Interactive Data Visualization using Orange

An open-source machine learning and data visualization tools to speed up your data analysis without writing a single code! The topic for today is about performing simple data visualization using an open-source software called Orange. If you are looking an alternative to visualize the dataset without code, Orange is the right choice for you!


Orange

Orange is a component-based data mining software. It includes a range of data visualization, exploration, preprocessing and modeling techniques. It can be used through a nice and intuitive user interface or, for more advanced users, as a module for the Python programming language.