One of the major aspects of training your machine learning model is avoiding overfitting. The model will have a low accuracy if it is overfitting. This happens because your model is trying too hard to capture the noise in your training dataset. By noise we mean the data points that don’t really represent the true properties of your data, but random chance. Learning such data points, makes your model more flexible, at the risk of overfitting. The concept of balancing bias and variance, is helpful in understanding the phenomenon of overfitting.
In any model development exercise, a considerable amount of time is spent in understanding the underlying data, visualizing relationships and validating preliminary hypothesis (broadly categorized as Exploratory data Analysis). A key element of EDA involves visually analyzing the data to glean valuable insights and understand underlying relationships & patterns in the data. While EDA is defined more as a philosophy rather than a defined set of procedures and techniques, there is a certain set of standard analysis that you would most likely perform as part of EDA to gain an initial understanding of the data.
Recently, I’ve been working on two problems that might be related to semiotic issues in predictive modeling (i.e. instead of a standard regression table, how can we plot coefficient values in a regression model).
Recently, there has been a lot of work on automating machine learning, from a selection of appropriate algorithm to feature selection and hyperparameters tuning. Several tools are available (e.g. AutoML and TPOT), that can aid the user in the process of performing hundreds of experiments efficiently. Likewise, the deep neural network architecture is usually designed by experts; through a trial and error approach. Although, this approach resulted in state-of-the-art models in several domains but is very time-consuming. Lately, due to increase in available computing power, researchers are employing Reinforcement Learning and Evolutionary Algorithms to automatically search for optimal neural architectures.
One of the great features of R is the possibility to quickly access web-services. While some companies have the habit and policy to document their APIs, there is still a large chunk of undocumented but great web-services that help the regular data scientist. In the following short post, I will show how we can turn a simple web-serivce in a nice R-function. The example I am going to use is the linguee translation service: DeepL. Just as google translate, Deepl features a simple text field. When a user types in text, the translation appears in a second textbox. Users can choose between the languages.
I revisited my previous post on creating beautiful time series calendar heatmaps in ggplot, moving the code into the tidyverse.
General linear models are one of the most widely used statistical tool in the biological sciences. This may be because they are so flexible and they can address many different problems, that they provide useful outputs about statistical significance AND effect sizes, or just that they are easy to run in many common statistical packages. The maths underlying General Linear Models (and Generalized linear models, which are a related but different class of model) may seem mysterious to many, but are actually pretty accessible. You would have learned the basics in high school maths. We will cover some of those basics here.