Programming Exercises for the Analysis of Knowledge Graphs
This is a repository, which allows interested students and researchers to perform hands-on analysis of knowledge graphs. It is primarily developed as part of the knowledge graph analysis lecture of the SDA Group at the University of Bonn. However, the material itself is also useful for anyone else.
How to write your favorite R functions in Python
R or Python? This Python script mimics convenient R-style functions for doing statistics nice and easy.
JSON Data in Python
In this tutorial, you’ll learn how to use JSON in Python.
Top 13 Python Deep Learning Libraries
1. TensorFlow (Contributors – 1700, Commits – 42256, Stars – 112591)
2. PyTorch (Contributors – 806, Commits – 14022, Stars – 20243)
3. Apache MXNet (Contributors – 628, Commits – 8723, Stars – 15447)
4. Theano (Contributors – 329, Commits – 28033, Stars – 8536)
5. Caffe (Contributors – 270, Commits – 4152, Stars – 25927)
6. fast.ai (Contributors – 226, Commits – 2237, Stars – 8872)
7. CNTK (Contributors – 189, Commits – 15979, Stars – 15281)
8. TFLearn (Contributors – 118, Commits – 599, Stars – 8632)
9. Lasagne (Contributors – 64, Commits – 1157, Stars – 3534)
10. nolearn (Contributors – 14, Commits – 389, Stars – 909)
11. Elephas (Contributors – 13, Commits – 249, Stars – 1046)
12. spark-deep-learning (Contributors – 12, Commits – 83, Stars – 1131)
13. Distributed Keras (Contributors – 5, Commits – 1125, Stars – 523)
Data Representation for Natural Language Processing Tasks
In NLP we must find a way to represent our data (a series of texts) to our systems (e.g. a text classifier). As Yoav Goldberg asks, ‘How can we encode such categorical data in a way which is amenable for us by a statistical classifier?’ Enter the word vector.
Visualize the Business Value of your Predictive Models with modelplotr
In this blog we explain four most valuable evaluation plots to assess the business value of a predictive model. These plots are the cumulative gains, cumulative lift, response and cumulative response. Since these visualisations are not included in most popular model building packages or modules in R and Python, we show how you can easily create these plots for your own predictive models with our modelplotr r package and our modelplotpy python module (Prefer python? Read all about modelplotpy here!). This will help you to explain your model’s business value in laymans terms to non-techies.
Automated Email Reports with R
R is an amazing tool to perform advanced statistical analysis and create stunning visualizations. However, data scientists and analytics practitioners do not work in silos, so these analysis have to be copied and emailed to senior managers and partners teams. Cut-copy-paste sounds great, but if it is a daily or periodic task, it is more useful to automate the reports. So in this blogpost, we are going to learn how to do exactly that.
Building an Interactive Globe Visualization in R
This post describes how to use the threejs package to plot data on a globe, allowing rotation and zoom. Location markers are added as lines, allowing geographic data to be visualized.
Communicating results with R Markdown
In my training as a consultant, I learned that long hours of analysis were typically followed by equally long hours of preparing for presentations. I had to turn my complex analyses into recommendations, and my success as a consultant depended on my ability to influence decision makers. I used a variety of tools to convey my insights, but over time I increasingly came to rely on R Markdown as my tool of choice. R Markdown is easy to use, allows others to reproduce my work, and has powerful features such as parameterized inputs and multiple output formats. With R Markdown, I can share more work with less effort than I did with previous tools, making me a more effective data scientist. In this post, I want to examine three commonly used communication tools and show how R Markdown is often the better choice.
Understand how to transfer your paragraph to vector by doc2vec
In previous story, mentioned word2vec is introduced by Mikolov et al. (2013). Mikolov and Le released sentence/ document vectors transformation. It is another breakthrough on embeddings such that we can use vector to represent a sentence or document. Mikolov et al. call it as ‘Paragraph Vector’.
After reading this article, you will understand:
• Paragraph Vector Design
• Take Away
Capsule Neural Networks – Part 2: What is a Capsule?
What are these ‘Capsules’ in Capsule Neural Networks about? This post will give you the complete intuition and insights you need from it in a simple language (and with dog faces), and later the technical details to understand them in depth.
Review: SSD – Single Shot Detector (Object Detection)
This time, SSD (Single Shot Detection) is reviewed. We only need to take one single shot to detect multiple objects within the image, while regional proposal network (RPN) based approaches such as R-CNN series that need two shots, one for generating region proposals, one for detecting the object of each proposal. Thus, SSD is much faster compared with two-shot RPN-based approaches.
Algorithms for hyperparameter optimisation in Python
Hyperparameters generally have a significant impact on the success of machine learning algorithms. A poorly configured ML model may perform no better than chance while a well configured one could achieve state of the art result.
5 Bite-Sized Data Science Summaries
In the spirit of teamwork, the Next Rewind video series asked a bunch of people to pick up to five favorite talks from Google Cloud Next SF 2018 and discuss them on camera in no more than five minutes.
Applying Machine Learning to classify an unsupervised text document
Text classification is a problem where we have fixed set of classes/categories and any given text is assigned to one of these categories. In contrast, Text clustering is the task of grouping a set of unlabeled texts in such a way that texts in the same group (called a cluster) are more similar to each other than to those in other clusters.
Time Series Forecasting with RNNs
In this article I want to give you an overview of a RNN model I built to forecast time series data. Main objectives of this work were to design a model that can not only predict the very next time step but rather generate a sequence of predictions and utilize multiple driving time series together with a set of static (scalar) features as its inputs.
A brief Introduction to Support Vector Machine
Support Vector Machine (SVM) is one of the most popular Machine Learning Classifier. It falls under the category of Supervised learning algorithms and uses the concept of Margin to classify between classes. It gives better accuracy than KNN, Decision Trees and Naive Bayes Classifier and hence is quite useful.
Association Rule Mining
Association Rule Mining is one of the ways to find patterns in data. It finds:
• features (dimensions) which occur together
• features (dimensions) which are ‘correlated’
Automating interpretable feature engineering for predicting CLV
This post will demonstrate the effectiveness of automating interpretable feature engineering with Deep Feature Synthesis. If you haven’t yet, please read this great post by William Koehrsen providing the explanation of DFS before continuing. If you are only interested in the problem and the results, then you can simply skip the methodology section.
Why and How to Use Pandas with Large Data
Pandas has been one of the most popular and favourite data science tools used in Python programming language for data wrangling and analysis. Data is unavoidably messy in real world. And Pandas is seriously a game changer when it comes to cleaning, transforming, manipulating and analyzing data. In simple terms, Pandas helps to clean the mess.
Visualizing intermediate activation in Convolutional Neural Networks with Keras
In this article we’re going to train a simple Convolutional Neural Network using Keras with Python for a classification task. For that we will use a very small and simple set of images consisting of 100 pictures of circle drawings, 100 pictures of squares and 100 pictures of triangles which I found here in Kaggle. These will be split into training and testing sets (folders in working directory) and fed to the network.
Some Important Data Science Tools that aren’t Python, R, SQL or Math
If you ask any Data Scientist what you need to know to succeed in the field, they’ll likely tell you some combination of the above. Every single DS job description mentions Python or R (sometimes even Java lol), SQL and math with some Spark, AWS/cloud experience mixed in and topped off with a healthy portion of buzzwords.
Google’s AdaNet Uses Neural Networks to Build Better Neural Networks
Ensemble learning is a deep learning technique that combines multiple machine learning models into an ensemble that can outperform each of the individual models. The main thesis behind ensemble learning is rooted in the famous No-Free-Lunch-Theorem(NFLT) that states than given all possible data distributions in an environment, we can’t expect any model to do better than random. In other words, models that might seem like the perfect solution for a problem might perform poorly given a change in the data distribution. If that’s the case, why not to combine different models that perform well for a specific task and expect that the performance of the group will be more resilient to changes in the data distribution? That’s the essence of ensemble learning. Famously, several quant hedge funds use ensemble learning to combine multiple market trading models into a single ensemble hoping that this one will perform better on unknown market conditions.
Getting Started with Airflow Using Docker
Lately I’ve been reading intensively on data engineering after being inspired by this great article by Robert Chang providing an introduction to the field. The underlying message of the article really resonated with me: when most people think of data science they immediately think about the stuff being done by very mature tech companies like Google or Twitter, like deploying uber-sophisticated machine learning models all the time.
Preventing Machine Learning Bias
Machine learning algorithms are increasingly used to make decisions around assessing employee performance and turnover, identifying and preventing recidivism, and assessing job suitability. These algorithms are cheap to scale and sometimes can be cheap to develop in an environment filled with good quality research and tooling. Unfortunately, one thing these algorithms don’t prevent and automate is how to structure your data and training pipeline in such a way that it does not lead to bias and negative self-reinforcing loops. This article will cover solutions for faithfully removing bias in your algorithms using bias-aware rather than blindless approaches.
Residual Networks (ResNets)
In earlier posts, we saw the implementation of LeNet-5, AlexNet, and VGG16 which are deep convolutional neural networks. Similarly, we can build our own deep neural network with more than 100 layers theoretically but in reality, they are hard to train. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun introduced the concept of Residual Networks (ResNets) in their research Deep Residual Learning for Image Recognition. The ResNets allows a much deeper network to train efficiently.
11 websites to find free, interesting datasets
If you’re new to the data space, or if you’ve recently learned a new skill, or just trying to build a more robust data science/analystportfolio, a perfect way of solidifying your skills is to do some mini-projects focused on your new skills. Below we outline a few places you can find publicly available data for your next project. If you’re interested in practicing real data scientist and analyst interview questions, feel free to sign up for our email newsletter, where we send a few curated questions per week to help you prepare for interviews at top companies.