Tools for Working with Excel and Python

Microsoft Excel is widely used in almost every industry. Its intuitive interface and ease of use for organising data, performing calculations, and analysis of data sets has led to it being commonly used in countless different fields globally. Whether you´re a fan of Excel or not, at some point you will have to deal with it! For many applications you won´t want to do complex calculations or manage large data sets in Excel itself, but you may need to take values from Excel as inputs, produce reports in an Excel format, or provide tools to Excel users. Python can be a better choice for complex tasks and fortunately there are many tools for the Python developer to work with so Excel and Python can be used together. This post gives an overview of some of the most popular and useful tools out there to help you choose which is the right one for your specific application.

The performance of Intel vs. Anaconda vs. vanilla Python – my personal benchmark

The Python programming is our daily bread. We develop frameworks, which are afterwards deployed on the customers´ infrastructures. And in some cases, there is an emphasis on performance, such as in the recent case with a recommender engine, which should load an individual recommendation in less than 30 ms. And then faster calculation might be helpful, especially since the use of a specific distribution requires no changes to the underlying python code Two weeks ago, Martin found that an Intel distribution for Python exists, so I decided to have a look. Intel claims that this distribution is faster in every way, and shares its benchmark. So apart from conducting the intel benchmark only, I decided to test the distributions using my own benchmark to determine the performance on typical cases often performed in a Data Science pipeline.

Select Star SQL

This is an interactive book which aims to be the best place on the internet for learning SQL. It is free of charge, free of ads and doesn’t require registration or downloads. It helps you learn by running queries against a real-world dataset to complete projects of consequence. It is not a mere reference page – it conveys a mental model for writing SQL. I expect little to no coding knowledge. Each chapter is designed to take about 30 minutes. As more of the world’s data is stored in databases, I expect that this time will pay rich dividends!

DNest4: Diffusive Nested Sampling in C++ and Python

In probabilistic (Bayesian) inferences, we typically want to compute properties of the posterior distribution, describing knowledge of unknown quantities in the context of a particular dataset and the assumed prior information. The marginal likelihood, also known as the ‘evidence’, is a key quantity in Bayesian model selection. The diffusive nested sampling algorithm, a variant of nested sampling, is a powerful tool for generating posterior samples and estimating marginal likelihoods. It is effective at solving complex problems including many where the posterior distribution is multimodal or has strong dependencies between variables. DNest4 is an open source (MIT licensed), multi-threaded implementation of this algorithm in C++11, along with associated utilities including: (i) ‘RJObject’, a class template for finite mixture models; and (ii) a Python package allowing basic use without C++ coding. In this paper we demonstrate DNest4 usage through examples including simple Bayesian data analysis, finite mixture models, and approximate Bayesian computation.

Automated General-to-Specific (GETS) Regression Modeling and Indicator Saturation for Outliers and Structural Breaks

This paper provides an overview of the R package gets, which contains facilities for automated general-to-specific (GETS) modeling of the mean and variance of a regression, and indicator saturation (IS) methods for the detection and modeling of outliers and structural breaks. The mean can be specified as an autoregressive model with covariates (an ‘AR-X’ model), and the variance can be specified as an autoregressive log-variance model with covariates (a ‘log-ARCH-X’ model). The covariates in the two specifications need not be the same, and the classical linear regression model is obtained as a special case when there is no dynamics, and when there are no covariates in the variance equation. The four main functions of the package are arx, getsm, getsv and isat. The first function estimates an AR-X model with log-ARCH-X errors. The second function undertakes GETS modeling of the mean specification of an ‘arx’ object. The third function undertakes GETS modeling of the log-variance specification of an ‘arx’ object. The fourth function undertakes GETS modeling of an indicator-saturated mean specification allowing for the detection of outliers and structural breaks. The usage of two convenience functions for export of results to EViews and Stata are illustrated, and LATEX code of the estimation output can readily be generated.

What are these birds? Complement occurrence data with taxonomy and traits information

Thanks to the second post of the series where we obtained data from eBird we know what birds were observed in the county of Constance. Now, not all species´ names mean a lot to me, and even if they did, there are a lot of them. In this post, we shall use rOpenSci´s packages accessing taxonomy and trait data in order to summarize some characteristics of the birds´ population of the county: armed with scientific and common names of birds, we have access to plenty of open data!

Internationalization of shiny apps has never been easier!

Have you ever created a multilingual Shiny app? It is very likely that the answer is no, because Shiny just doesn´t have any good tools for that. In Appsilon we came across the internationalization problem many times, so we decided to make a tool which makes a live easier when it comes to mulitlingual. shiny.i18n is the new kid on the block and still under rapid development, but the 0.1.0 version is already ready to go.

Complete guide to Association Rules (1/2)

Looking back at the multitude of concepts that have been introduced to me in the statistics boot camp, there is a lot to write and share. I choose to start with Association Rules because of two reasons. First, this was one of the concepts which I enjoyed learning the most and second, there are a limited resources available online to get a good grasp. In Part 1 of the blog, I will be introducing some key terms and metrics aimed at giving a sense of what ‘association’ in a rule means and some ways to quantify the strength of this association. Part 2 will be focused on discussing the mining of these rules from a list of thousands of items using Apriori Algorithm.

Lite Intro into Reinforcement Learning

This is a brief introduction into Reinforcement Learning (RL) going through the basics in simplified terms. We start with a brief overview of RL and then get into some practical examples of techniques solving RL problems. In the end you may even think of places you can apply these techniques. I think we can all agree building our own Artificial Intelligence (AI) and having a robot do chores for us is cool.

How to Create Animated Graphs in Python

Matplotlib and Seaborn are some nice libraries in Python to create great looking plots. But these plots are all static and it´s hard to depict the change of data values in a dynamic and pleasingly looking way. How nice would it be if in you next presentation, video or social media post you can present development in the data by using a short video clip? And even better, you can still keep using Matplotlib, Seaborn or any other library that you like to use for your plots!

Importing Data to R: The First Step Towards Your Data Science Project

The aim of this article is to provide you with a quick look-up guide for your first step towards a data science project.

Big Data Architecture Style

A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional database systems.