A Journey Through Spark

Bastian Haase is an alum from the Insight Data Engineering program in Silicon Valley, now working as a Program Director at Insight Data Science for the Data Engineering and the DevOps Engineering programs. In this blog post, he shares his experiences on how to get started working on open source software.


Deep Quantile Regression

One area that Deep Learning has not explored extensively is the uncertainty in estimates. Most Deep Learning frameworks currently focus on giving a best estimate as defined by a loss function. Occasionally something beyond a point estimate is required to make a decision. This is where a distribution would be useful. Bayesian statistics lends itself to this problem really well since a distribution over the dataset is inferred. However, Bayesian methods so far have been rather slow and would be expensive to apply to large datasets. As far as decision making goes, most people actually require quantiles as opposed to true uncertainty in an estimate. For instance when measuring a child´s weight for a given age, the weight of an individual will vary. What would be interesting is (for arguments sake) the 10th and 90th percentile. Note that the uncertainty is different to quantiles in that I could request for a confidence interval on the 90th quantile. This article will purely focus on inferring quantiles.


50 Most Popular Python Projects in 2018

1) TensorFlow Models
2) Keras
3) Flask
4) scikit-learn
5) Zulip
6) Django
7) Rebound
8) Google Images Download
9) YouTube-dl
10) System Design Primer
11) Mask R-CNN
12) Face Recognition
13) snallygaster
14) Ansible
15) Detectron
16) asciinema
17) HTTPie
18) You-Get
19) Sentry
20) Tornado
21) Magenta
22) ZeroNet
23) Gym
24) Pandas
25) Luigi
26) spaCy (by Explosion AI)
27) Theano
28) TFlearn
29) Kivy
30) Mailpile
31) Matplotlib
32) YAPF (by Google)
33) Cookiecutter
34) HTTP Prompt
35) speedtest-cli
36) Pattern
37) Gooey (Beta)
38) Wagtail CMS
39) Bottle
40) Prophet (by Facebook)
41) Falcon
42) Mopidy
43) Hug
44) SymPy
45) Dash
46) Visdom
47) LUMINOTH
48) Pygame
49) Requests
50) Statsmodels


Deep Learning Tips and Tricks

Below is a distilled collection of conversations, messages, and debates I´ve had with peers and students on how to optimize deep models. If you have tricks you´ve found impactful, please share them!!


Overview and benchmark of traditional and deep learning models in text classification

This article is an extension of a previous one I wrote when I was experimenting sentiment analysis on twitter data. Back in the time, I explored a simple model: a two-layer feed-forward neural network trained on keras. The input tweets were represented as document vectors resulting from a weighted average of the embeddings of the words composing the tweet. The embedding I used was a word2vec model I trained from scratch on the corpus using gensim. The task was a binary classification and I was able with this setting to achieve 79% accuracy. The goal of this post is to explore other NLP models trained on the same dataset and then benchmark their respective performance on a given test set. We’ll go through different models: from simple ones relying on a bag-of-word representation to a heavy machinery deploying convolutional/recurrent networks: We’ll see if we’ll score more than 79% accuracy!


Marginal Effects for Regression Models in R

Regression coefficients are typically presented as tables that are easy to understand. Sometimes, estimates are difficult to interpret. This is especially true for interaction or transformed terms (quadratic or cubic terms, polynomials, splines), in particular for more complex models. In such cases, coefficients are no longer interpretable in a direct way and marginal effects are far easier to understand. Specifically, the visualization of marginal effects makes it possible to intuitively get the idea of how predictors and outcome are associated, even for complex models.


Predict Customer Churn with Gradient Boosting

Customer churn is a key predictor of the long term success or failure of a business. But when it comes to all this data, what´s the best model to use This post shows that gradient boosting is the most accurate way of predicting customer attrition. I´ll show you how you can create your own data analysis using gradient boosting to identify and save those at risk customers!