Configuration and Intercomparison of Deep Learning Neural Models for Statistical Downscaling

Deep learning techniques (in particular convolutional neural networks, CNNs) have recently emerged as a promising approach for statistical downscaling due to their ability to learn spatial features from huge spatio-temporal datasets. However, existing studies are based on complex models, applied to particular case studies and using simple validation frameworks, which makes difficult a proper assessment of the (possible) added value offered by these techniques. As a result, these models are usually seen as black-boxes generating distrust among the climate community, particularly in climate change problems. In this paper we undertake a comprehensive assessment of deep learning techniques for continental-scale statistical downscaling, building on the VALUE validation framework. In particular, different CNN models of increasing complexity are applied for downscaling temperature and precipitation over Europe, comparing them with a few standard benchmark methods from VALUE (linear and generalized linear models) which have been traditionally used for this purpose. Besides analyzing the adequacy of different components and topologies, we also focus on their extrapolation capability, a critical point for their possible application in climate change studies. To do this, we use a warm test period as surrogate of possible future climate conditions. Our results show that, whilst the added value of CNNs is mostly limited to the reproduction of extremes for temperature, these techniques do outperform the classic ones for the case of precipitation for most aspects considered. This overall good performance, together with the fact that they can be suitably applied to large regions (e.g. continents) without worrying about the spatial features being considered as predictors, can foster the use of statistical approaches in international initiatives such as CORDEX.

Fuzzy Forests: Extending Random Forest Feature Selection for Correlated, High-Dimensional Data

In this paper we introduce fuzzy forests, a novel machine learning algorithm for ranking the importance of features in high-dimensional classification and regression problems. Fuzzy forests is specifically designed to provide relatively unbiased rankings of variable importance in the presence of highly correlated features, especially when the number of features, p, is much larger than the sample size, n (p n). We introduce our implementation of fuzzy forests in the R package, fuzzyforest. Fuzzy forests works by taking advantage of the network structure between features. First, the features are partitioned into separate modules such that the correlation within modules is high and the correlation between modules is low. The package fuzzyforest allows for easy use of the package WGCNA (weighted gene coexpression network analysis, alternatively known as weighted correlation network analysis) to form modules of features such that the modules are roughly uncorrelated. Then recursive feature elimination random forests (RFE-RFs) are used on each module, separately. From the surviving features, a final group is selected and ranked using one last round of RFE-RFs. This procedure results in a ranked variable importance list whose size is pre-specified by the user. The selected features can then be used to construct a predictive model.

Cambridge Semantics – Genomics Data Demo

A recent Genomics demo Cambridge Semantics did that is very good and highlights lots of key features and functions of their product Anzo.

Unsupervised Learning Algorithms in One Picture

Unsupervised learning algorithms are ‘unsupervised’ because you let them run without direct supervision. You feed the data into the algorithm, and the algorithm figures out the patterns. The following picture shows the differences between three of the most popular unsupervised learning algorithms: Principal Component Analysis, k-Means clustering and Hierarchical clustering. The three are closely related, because data clustering is a type of data reduction; PCA can be viewed as a continuous counterpart of K-Means (see Ding & He, 2004).

Bridging the Digital Divide

Despite being about as prevalent as electricity, it can be difficult to adequately explain how critical data is to the modern world. From business operations to tackling the environmental crisis, data is the key to unlocking insight and developing intelligent solutions across every sector. Although Big Data has been in the news for at least a couple of decades, other types of data are now getting air time as well. Open data, External data, Training data, Dark data – these are new contours to an already multi-faceted conversation, and it’s understandable that the general public is getting overwhelmed.

LongCART – Regression tree for longitudinal data

Longitudinal changes in a population of interest are often heterogeneous and may be influenced by a combination of baseline factors. The longitudinal tree (that is, regression tree with longitudinal data) can be very helpful to identify and characterize the sub-groups with distinct longitudinal profile in a heterogenous population. This blog presents the capabilities of the R package LongCART for constructing longitudinal tree according to the LongCART algorithm (Kundu and Harezlak 2019). In addition, this packages can also be used to formally evaluate whether any particular baseline covariate affects the longitudinal profile via parameter instability test. In this blog, construction of longitudinal tree is illlustrated with an R dataset in step by step approach and the results are explained. Installing and Loading LongCART package


Vortimo is software that organizes information on webpages that you’ve visited. It records pages you go to, extracts data from it and enrich the data that was extracted. It augments the pages in your browser by allowing you to tag objects as well as decorating objects it deems important. It then arranges the data in an UI. Vortimo support switching between cases/projects seamlessly. You can also generate PDF reports based on the aggregated information and meta information.

Managing Delivery Networks: A Use Case For Graph Databases

At one of the biggest competitive advantages we have as an e-commerce platform is the maintenance and expansion of our own logistics network.This network allows us to control how and when we deliver to customers and, amongst other aspects, ensures that is the leading e-commerce platform in South Africa. In this article, I will provide an analysis of the unique problem faced in facilitating reliable deliveries to customers and how we use a graph database to deliver a performant and scalable solution.


DeepPavlov is an open source framework for chatbots and virtual assistants development. It has comprehensive and flexible tools that let developers and NLP researchers create production ready conversational skills and complex multi-skill conversational assistants.

What is Machine Learning on Code?

As IT organizations grow, so does the size of their codebases and the complexity of their ever-changing developer toolchain. Engineering leaders have very limited visibility into the state of their codebases, software development processes, and teams. By applying modern data science and machine learning techniques to software development, large enterprises have the opportunity to significantly improve their software delivery performance and engineering effectiveness. In the last few years, a number of large companies such as Google, Microsoft, Facebook and smaller companies such as Jetbrains and source{d} have been collaborating with academic researchers to lay the foundation for Machine Learning on Code.

How Data Labeling Facilitates AI Models

AI-based models are highly dependent on accurate, clean, well-labeled, and prepared data in order to produce the desired output and cognition. These models are fed with bulky datasets covering an array of probabilities and computations to make its functioning as smart and gifted as human intelligence.

P-Value Explained in One Picture

P-values (‘Probability values’) are one way to test if the result from an experiment is statistically significant. This picture is a visual aid to p-values, using a theoretical experiment for a pizza business.

The Ingenious Idea of Shirley Almon

I think for folks from an economics background, probably the Shirley Almon distributed lag model is common-place, but I must admit that I came across this model (that dates back to the 1960s) pretty recently and was quite impressed by the ingenuity and learnt something which I think could be applied in context of other problems too. Before we get to the model, Shirley Almon was a researcher in economics with just two publications to her credit, with one of them being the distributed lag model that she put forth. So the fact that she is considered among the most distinguished economists of her time, should tell the tale of the brilliance of these works. The sad part of the story though is that she got diagnosed with brain tumor in her early thirties, curtailing what would have otherwise been a long illustrious career in this field, culminating in her premature demise at the age of 40.