If you are a beginner who hasn’t tried yet a hand on data science projects, your possible transition from the starting point ‘no experience’ to so exceptionally desired destination called ‘expert’ will be nothing more than datasets. Using all those websites with free data sets on various topics has a number of advantages. By dint of them, you can easily brush up your skills and develop your own style of working which is highly important today. And then, undoubtedly, you can confidently build an excellent data science/analyst CV, get a job of your dream and eventually feel like a data science king or a queen. Sounds great, isn’t it? So why then waiting any longer? Without further ado, let’s get started!
1. data.world
2. Kaggle
3. FiveThirthyEight
4. BuzzFeed
5. Data.gov
6. Socrata OpenData
7. Quandl
8. Reddit or r/datasets
9. UCI Machine Learning Repository
PyTorch BigGraph is a tool to create and handle large graph embeddings for machine learning. Currently there are two approaches in graph-based neural networks:
• Directly use the graph structure and feed it to a neural network. The graph structure is then preserved at every layer. graphCNNs use that approach, see for instance my post or this paper on that.
• But most graphs are too large for that. So it’s also reasonable to create a large embedding of the graph. And then use it as features in a traditional neural network.
PyTorch BigGraph handles the second approach, and we will do so as well below. Just for reference let’s talk about the size aspect for a second. Graphs are usually encoded by their adjacency matrix. If you have a graph with 3,000 nodes and an edge between each node, you end up with around 10,000,000 entries in your matrix. Even if that’s sparse, apparently this bursts most GPUs according to the paper linked above.
From theory to practice, learn underlying principles of Perceptron and implement it on a dataset with Stochastic Gradient Descent.
Gathering data is a vital part of any machine learning project, but many tutorials tend to use data that already exists in a convenient format. This is great for example cases, but not great for learning the whole process. In the real world, not all data can be found on Google… at least not yet. This is the part of the tutorial where we collect all of the statistics necessary for training our neural net for NHL prediction, but more importantly, I will show you how you can apply these concepts and collect any data you want from the world wide web. I am going to break this down into a couple of digestible parts so that you can create the web scraper yourself and also show you how to easily read through the storm of text that is HTML.
Anyone working with data knows that real-world data is often patchy and cleaning it takes up a considerable amount of your time (80/20 rule anyone?). Having recently moved from Pandas to Pyspark, I was used to the conveniences that Pandas offers and that Pyspark sometimes lacks due to its distributed nature. One of the features I have learned to particularly appreciate is the straight-forward way of interpolating (or in-filling) time series data, which Pandas provides. This post is meant to demonstrate this capability in a straight forward and easily understandable way using the example of sensor read data collected in a set of houses. The full notebook for this post can be found in my GitHub.
In this blog, we give a quick hands on tutorial on how to train the ResNet model in TensorFlow. While the official TensorFlow documentation does have the basic information you need, it may not entirely make sense right away, and it can be a little hard to sift through.
If you ever struggled with tuning Machine Learning (ML) models, you are reading the right piece.
Hyper-parameter tuning refers to the problem of finding an optimal set of parameter values for a learning algorithm.
Usually, the process of choosing these values is a time-consuming task.
Even for simple algorithms like Linear Regression, finding the best set for the hyper-parameters can be tough. With Deep Learning, things get even worse.
Some of the parameters to tune when optimizing neural nets (NNs) include:
• learning rate
• momentum
• regularization
• dropout probability
• batch normalization
In this short piece, we talk about the best practices for optimizing ML models. These practices come in hand mainly when the number of parameters to tune exceeds two or three.
Machine Learning adoption among large enterprises is at an all-time high. If you’re not investing in the people and tools needed to support machine learning models, you’re probably behind your competitors.
Read this ebook to learn from real-world data science practitioners, who present their unique perspectives and advice on handling six common problems that include:
• Reconciling disparate interfaces
• Resolving environment dependencies
• Ensuring close collaboration among all ML stakeholders
• Building or renting adequate ML infrastructure
• Meeting the scalability needs of your application
• Enabling smooth deployment of ML projects
A curated list of applied machine learning and data science notebooks and libraries accross different industries. The code in this repository is in Python (primarily using jupyter notebooks) unless otherwise stated. The catalogue is inspired by awesome-machine-learning.
While the majority of us are ‘wow’ing the early applications of machine learning, it continues to evolve at quite a promising pace, introducing us to more advanced algorithms like Deep Learning. This branch, by the way, is attracting even more attention than all other ML-algorithms combined. Of course, I don’t have to declare it. It’s here. Why DL is so good? It is simply great in terms of accuracy when trained with a huge amount of data. Also, it plays a significant role to fill the gap when a scenario is challenging for the human brain. So, quite logical this contributed to a whole slew of new frameworks appearing. Just a few years ago, none of the leaders other than Theano were even around. Now the choice is big so to understand what is most suitable is time and energy consuming. Well, that’s why I’m doing this post. Without further ado, let’s get started.
1. TensorFlow
2. PyTorch
3. Sonnet
4. Keras
5. MXNet
6. Gluon
7. Swift
8. Chainer
9. DL4J
10. ONNX
Have you ever had a moment when you realized something you thought was so obviously true, suddenly became so obviously false, and in that instant your whole understanding of the world changed? I think of the day I realized my parents were actually fallible human beings. Up until then, part of my subconscious was convinced they were unquestionable demigods. It wasn’t until my 30’s that I started to really question the validity of my opinions and recognize that I was biased towards their viewpoints – both on the world and about me. I gave more weight to their ideas on politics, religion, morals, and even my own capabilities and characteristics than I did to the opinions or facts presented by my experiences. That was the moment I began making it a point to think harder. To examine why I believed something and attempt to form less parentally biased opinions of my own.
We are inching closer to one of the most anticipated events in India – the General Elections! Everyone is hooked to the latest news and developments (and trust me, there is something happening every single day). This is a great time to be a data scientist. Why? There is so much data being generated thanks to these developments. We can come up with tons of use cases – visualizing sentiments, predicting sentiments, building models to predict the winners, etc. Data Science offers different tools, techniques and algorithms to analyze those textual, audio and video data to infer the nation’s psychology, behavior and feeling before, during and after the elections. So, I decided to use a few awesome ML techniques to predict moods using Twitter data.
In this article, I want to explain one of the most important concepts of machine learning and data science which we encounter after we have trained our machine learning model. It is a must-know topic.