Essentials of Deep Learning: Visualizing Convolutional Neural Networks in Python

One of the most debated topics in deep learning is how to interpret and understand a trained model – particularly in the context of high risk industries like healthcare. The term “black box” has often been associated with deep learning algorithms. How can we trust the results of a model if we can’t explain how it works? It’s a legitimate question. Take the example of a deep learning model trained for detecting cancerous tumours. The model tells you that it is 99% sure that it has detected cancer – but it does not tell you why or how it made that decision. Did it find an important clue in the MRI scan? Or was it just a smudge on the scan that was incorrectly detected as a tumour? This is a matter of life and death for the patient and doctors cannot afford to be wrong.

Automate R processes

Last week we updated the cronR R package and released it to CRAN allowing you to schedule any R code on whichever timepoint you like. The package was updated in order to comply to more stricter CRAN policies regarding writing to folders. Along the lines, the RStudio add-in of the package was also updated. It now looks as shown below and is tailored to Data Scientists that want to automate basic R scripts.

Multiple Versions of R

Data scientists prefer using the latest R packages to analyze their data. To ensure a good user experience, you will need a recent version of R running on a modern operating system. If you run R on a production server – and especially if you use RStudio Connect – plan to support multiple versions of R side by side so that your code, reports, and apps remain stable over time. You can support multiple versions of R concurrently by building R from source. Plan to install a new version of R at least once per year on your servers.

Efficiency of random swap clustering

Random swap algorithm aims at solving clustering by a sequence of prototype swaps, and by fine-tuning their exact location by k-means. This randomized search strategy is simple to implement and efficient. It reaches good quality clustering relatively fast, and if iterated longer, it finds the correct clustering with high probability. In this paper, we analyze the expected number of iterations needed to find the correct clustering. Using this result, we derive the expected time complexity of the random swap algorithm. The main results are that the expected time complexity has (1) linear dependency on the number of data vectors, (2) quadratic dependency on the number of clusters, and (3) inverse dependency on the size of neighborhood. Experiments also show that the algorithm is clearly more efficient than k-means and almost never get stuck in inferior local minimum.

Sequence to Sequence (seq2seq) Recurrent Neural Network (RNN) for Time Series Prediction

The goal of this project of mine is to bring users to try and experiment with the seq2seq neural network architecture. This is done by solving different simple toy problems about signal prediction. Normally, seq2seq architectures may be used for other more sophisticated purposes than for signal prediction, let’s say, language modeling, but this project is an interesting tutorial in order to then get to more complicated stuff. In this project are given 4 exercises of gradually increasing difficulty. I take for granted that the public already have at least knowledge of basic RNNs and how can they be shaped into an encoder and a decoder of the most simple form (without attention). To learn more about RNNs in TensorFlow, you may want to visit this other project of mine about that: https://…/LSTM-Human-Activity-Recognition The current project is a series of example I have first built in French, but I haven’t got the time to generate all the charts anew with proper English text. I have built this project for the practical part of the third hour of a ‘master class’ conference that I gave at the WAQ (Web At Quebec) in March 2017: https://…/deep-learning-avec-tensorflow

Data governance and the death of schema on read

In the olden days of data science, one of the rallying cries was the democratization of data. No longer were data owners at the mercy of enterprise data warehouses (EDWs) and extract, transform, load (ETL) jobs, where data had to be transformed into a specific schema (“schema on write”) before it could be stored in the enterprise data warehouse and made available for use in reporting and analytics. This data was often most naturally expressed as nested structures (e.g., a base record with two array-typed attributes), but warehouses were usually based on the relational model. Thus, the data needed to be pulled apart and “normalized’ into flat relational tables in first normal form. Once stored in the warehouse, recovering the data’s natural structure required several expensive relational joins. Or, for the most common or business-critical applications, the data was “de-normalized,” in which formerly nested structures were reunited, but in a flat relational form with a lot of redundancy. This is the context in which big data and the data lake arose. No single schema was imposed. Anyone could store their data in the data lake, in any structure (or no consistent structure). Naturally nested data was no longer stripped apart into artificially flat structures. Data owners no longer had to wait for the IT department to write ETL jobs before they could access and query their data. In place of the tyranny of schema on write, schema on read was born. Users could store their data in any schema, which would be discovered at the time of reading the data. Data storage was no longer the exclusive provenance of the DBAs and the IT departments. Data from multiple previously siloed teams could be stored in the same repository.

Interpreting predictive models with Skater: Unboxing model opacity

Over the years, machine learning (ML) has come a long way, from its existence as experimental research in a purely academic setting to wide industry adoption as a means for automating solutions to real-world problems. But oftentimes, these algorithms are still perceived as alchemy because of the lack of understanding of the inner workings of these model (see Ali Rahimi, NIPS ’17). There is often a need to verify the reasoning of such ML systems to hold algorithms accountable for the decisions predicted. Researchers and practitioners are grappling with the ethics of relying on predictive models that might have unanticipated effects on human life, such as the algorithms evaluating eligibility for mortgage loans or powering self-driving cars (see Kate Crawford, NIPS ’17, “The Trouble with Bias”). Data Scientist Cathy O’Neil has recently written an entire book filled with examples of poor interpretability as a dire warning of the potential social carnage from misunderstood models—e.g., modeling bias in criminal sentencing or using dummy features with human bias while building financial models.

Exploring Vitamin D deficiency in the United States: NHANES 2001-2010

In the latest issue of the British Medical Journal (BMJ), I read a paper by Budhathoki et al. that investigated the association between vitamin D (serum 25-hydroxyvitamin D) and cancer in the Japenese population. The authors reported that high levels of vitamin D were associated with low risk of cancer. There is a large evidence base supporting the beneficial role of vitamin D in lowering the risk of chronic diseases and mortality. Therefore, I got motivated to explore the levels of Vitamin D in the U.S. using the data from National Health and Nutrition Examination Survey (NHANES).