15 Trending Data Science GitHub Repositories you can not miss in 2017

GitHub is much more than a software versioning tool, which it was originally meant to be. Now people from different backgrounds and not just software engineers are using it to share their tools / libraries they developed on their own, or even share resources that might be helpful for the community. Following the best repos on GitHub can be an immense learning experience. You not only see what are the best open contributions, but also see how their code was written and implemented. Being an avid data science enthusiast, I have curated a list of repositories that have been particularly famous in the year 2017. Enjoy and Keep learning!

Neural Networks Are Learning What to Remember and What to Forget

Deep learning is changing the way we use and think about machines. Current incarnations are better than humans at all kinds of tasks, from chess and Go to face recognition and object recognition. But many aspects of machine learning lag vastly behind human performance. In particular, humans have the extraordinary ability to constantly update their memories with the most important knowledge while overwriting information that is no longer useful. That’s an important skill. The world provides a never-ending source of data, much of which is irrelevant to the tricky business of survival, and most of which is impossible to store in a limited memory. So humans and other creatures have evolved ways to retain important skills while forgetting irrelevant ones.

Testing if Algorithms Learn to Cheat – DeepMind Has Simple Tests That Might Prevent Elon Musk’s AI Apocalypse

You don’t have to agree with Elon Musk’s apocalyptic fears of artificial intelligence to be concerned that, in the rush to apply the technology in the real world, some algorithms could inadvertently cause harm. This type of self-learning software powers Uber’s self-driving cars, helps Facebook identify people in social-media posts, and let’s Amazon’s Alexa understand your questions. Now DeepMind, the London-based AI company owned by Alphabet Inc., has developed a simple test to check if these new algorithms are safe. Researchers put AI software into a series of simple, two-dimensional video games composed of blocks of pixels, like a chess board, called a gridworld. It assesses nine safety features, including whether AI systems can modify themselves and learn to cheat. AI algorithms that exhibit unsafe behavior in gridworld probably aren’t safe for the real world either, Jan Leike, DeepMind’s lead researcher on the project said in a recent interview at the the Neural Information Processing Systems (NIPS) conference, an annual gathering of experts in the field.

Machine Learning 101

This presentation was posted by Jason Mayes, senior creative engineer at Google, and was shared by many data scientists on social networks. Chances are that you might have seen it already. Below are a few of the slides. The presentation provides a list of machine learning algorithms and applications, in very simple words. It also explain the differences between AI, ML and DL (deep learning.)

Comparison of Deepnet & Neuralnet

In this article, I compare two available R packages for using neural networks to model data: neuralnet and deepnet. Through the comparisons I highlight various challenges in finding good hyperparameter values. I show that some needed hyperparameters differ when using these two packages, even with the same underlying algorithmic approach. neuralnet was developed by Stefan Fritsch and Frauke Guenther with contributors Marc Suling and Sebastian M. Mueller. deepnet was created by Xiao Rong. Both packages can be obtained via the R CRAN repository (see links at the end). I will focus on a simple time series example, composed of two predictors and the performance of the packages to predict future data after being trained on past data using a simple 5-neuron network. Note that most of what you read about in deep learning with neural networks are “classification” problems (more later); nonetheless such networks have promise for predicting continuous data including time series.

Create a Character-based Seq2Seq model using Python and Tensorflow

In this article, I will share my findings on creating a character-based Sequence-to-Sequence model (Seq2Seq) and I will share some of the results I have found. All of this is just a tiny part of my Master Thesis and it took quite a while for me to learn how to convert the theoretical concepts into practical models. I will also share the lessons that I have learned.

Deep Learning for Disaster Recovery

With global climate change, devastating hurricanes are occurring with higher frequency. After a hurricane, roads are often flooded or washed out, making them treacherous for motorists. According to The Weather Channel, almost two of every three U.S. flash flood deaths from 1995-2010, excluding fatalities from Hurricane Katrina, occurred in vehicles. During my Insight A.I. Fellowship, I designed a system that detects flooded roads and created an interactive map app. Using state of the art computer vision deep learning methods, the system automatically annotates flooded, washed out, or otherwise severely damaged roads from satellite imagery.

How Computers Learn

This Vienna Gödel Lecture provides a fascinating talk by Peter Norvig, Research Director at Google Inc. in the field of intelligent computers. Norvig talks about his long experience in AI and Machine Learning. Computer scientists have developed complex programming languages and systems to allow us to describe, step-by-step how to solve problems such as keeping bank statements balanced. But there are other problems that we can’t articulate how to solve them: how do we recognize a person’s face, or translate a paragraph from German to English? We can’t describe how we do it, and so we can’t easily program a computer to do it, but we can train a computer to learn how to do it. This talk explains how computers learn from examples and what are the promises and limitations of these techniques. Previously, Norvig was head of Google’s core search algorithms group, and of NASA Ames’s Computational Sciences Division, making him NASA’s most senior computer scientist. He received the NASA Exceptional Achievement Award in 2001. He has taught at the University of Southern California and the University of California at Berkeley, from which he received a Ph.D. in 1986 and the distinguished alumni award in 2006.

Choosing the right level of abstraction with TensorFlow

TensorFlow is the library that revolutionized the way we approach machine learning problems. It was designed to build deep neural networks, train them, and evaluate and serve the solutions. The result of its popularity is the genuine democratization of AI. Like any library, it provides classes and functions designed to tackle deep learning process. This introduces an interesting black box dilemma. In one way, it gives you ready-to-use, very often one line of code, solutions. On the other hand, it hides most of the implementation details from the user. Fortunately, TensorFlow offers different levels of abstraction, giving the opportunity to determine the level of control in the hands of the programmer. In this article, we’ll showcase TensorFlow’s abstraction by building and training a neural network for the canonical classification task of recognizing handwritten digits from the MNIST data set. This is an elementary computer vision problem. Because the digits are represented as the arrays of pixels (either 2D or flattened 1D), they can be fed in as input to the neural network. The architectures tackling image recognition tasks are usually a combination of fully connected, convolutional, and polling layers. Once set, the model is trained, evaluated, and can be used to classify new examples.

Using R: reshape2 to tidyr

Tidy data — it’s one of those terms that tend to confuse people, and certainly confused me. It’s Codd’s third normal form, but you can’t go around telling that to people and expect to be understood. One form is ”long”, the other is ”wide”. One form is ”melted”, another ”cast”. One form is ”gathered”, the other ”spread”. To make matters worse, I often botch the explanation and mix up at least two of the terms. The word is also associated with the tidyverse suite of R packages in a somewhat loose way. But you don’t need to write in a tidyverse-style (including the %>%s and all) to enjoy tidy data.

Transitioning to Data Science: How to become a data scientist, and how to create a data science team

It is difficult to define data science these days: every company claims to be doing data science and everyone claims to be a data scientist. Practitioners are puzzled by their fuzzy job descriptions, and people who are trying to become data scientists are frustrated by the lack of standard definitions. In this conversation, at Toronto Machine Learning Summit 2017, we have tried to demystify data science and clarify what it means to be a data scientist.
Data science is applying scientific methods to understanding data in order to solve business problems. “A good data scientist in my mind is the person that takes the science part in data science very seriously; a person who is able to find problems and solve them using statistics, machine learning, and distributed computing.” said Amir Hajian, Director of Research at Thomson Reuters Labs. In other words, data scientists are people who can “reason through data using inferential reasoning, think in terms of probabilities, be scientific and systematic, and make data work at scale using software engineering best practices,” said Baiju Devani, Vice President of Analytics at Aviva. He added that It is important to recognize that, “there is no deterministic path to the problems you’re solving or the solutions you find, so you have to be ok with fuzziness,” He also states that you need to, “have that experimental mindset that allows you to work with vague problem and solution definitions.” In some sense, the best data scientists are people with “good statistical knowledge, programming and technical skills, and industry experience” according to Lindsay Farber, Senior Data Scientist at MoneyKey.

DataFrames Vs RDDs in Spark – Part 1

Spark is a fast and general engine for large-scale data processing. It is a cluster computing framework which is used for scalable and efficient analysis of big data. With Spark, we can use many machines, which divide the tasks among themselves, and perform fault tolerant computations by distributing the data over a cluster. Among the many capabilities of Spark, which made it famous, is its ability to be used with various programming languages through APIs. We can write Spark operations in Java, Scala, Python or R. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. Spark components consist of Core Spark, Spark SQL, MLlib and ML for machine learning and GraphX for graph analytics. To help big data enthusiasts master Apache Spark, I have started writing tutorials. The first one is here and the second one is here. For the next couple of weeks, I will write a blog post series on how to perform the same tasks using Spark Resilient Distributed Dataset (RDD), DataFrames and Spark SQL and this is the first one. I am using pyspark, which is the Spark Python API that exposes the Spark programming model to Python. The data can be downloaded from my GitHub repository. The size of the data is not large, however, the same code works for large volume as well. Therefore, we can practice with this dataset to master the functionalities of Spark. For this tutorial, we will work with the SalesLTProduct.txt data. Let’s answer a couple of questions using RDD way, DataFrame way and Spark SQL.