Anomaly Detection: A Survey (September 2009)

Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been speci¯cally developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into di®erent categories based on the underlying approach adopted by each technique. For each category we have identi¯ed key assumptions, which are used by the techniques to di®erentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the e®ectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the di®erent existing techniques in that category are variants of the basic tech- nique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the di®erent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

Factsheets for AI Services

We believe several elements or pillars form the basis for trusted AI systems.
• Fairness: AI systems should use training data and models that are free of bias, to avoid unfair treatment of certain groups.
• Robustness: AI systems should be safe and secure, not vulnerable to tampering or compromising the data they are trained on.
• Explainability: AI systems should provide decisions or suggestions that can be understood by their users and developers.
• Lineage: AI systems should include details of their development, deployment, and maintenance so they can be audited throughout their lifecycle.

Statistics Synonyms for Variables

Depending on the context, an independent variable is sometimes called a ‘predictor variable’, regressor, covariate, ‘controlled variable’, ‘manipulated variable’, ‘explanatory variable’, exposure variable (see reliability theory), ‘risk factor’ (see medical statistics), ‘feature’ (in machine learning and pattern recognition) or ‘input variable.’ In econometrics, the term ‘control variable’ is usually used instead of ‘covariate’.
Depending on the context, a dependent variable is sometimes called a ‘response variable’, ‘regressand’, ‘criterion’, ‘predicted variable’, ‘measured variable’, ‘explained variable’, ‘experimental variable’, ‘responding variable’, ‘outcome variable’, ‘output variable’ or ‘label’.
‘Explanatory variable’ is preferred by some authors over ‘independent variable’ when the quantities treated as independent variables may not be statistically independent or independently manipulable by the researcher. If the independent variable is referred to as an ‘explanatory variable’ then the term ‘response variable’ is preferred by some authors for the dependent variable.
‘Explained variable’ is preferred by some authors over ‘dependent variable’ when the quantities treated as ‘dependent variables’ may not be statistically dependent. If the dependent variable is referred to as an ‘explained variable’ then the term ‘predictor variable’ is preferred by some authors for the independent variable.
Variables may also be referred to by their form: continuous, binary/dichotomous, nominal categorical, and ordinal categorical, among others.
An example is provided by the analysis of trend in sea level by Woodworth (1987). Here the dependent variable (and variable of most interest) was the annual mean sea level at a given location for which a series of yearly values were available. The primary independent variable was time. Use was made of a covariate consisting of yearly values of annual mean atmospheric pressure at sea level. The results showed that inclusion of the covariate allowed improved estimates of the trend against time to be obtained, compared to analyses which omitted the covariate.

Helping Machines Be Conversational

Stanford University researchers have released the Conversational Question Answering (CoQA) dataset to help machines better gather and provide information in conversations with humans. The dataset includes 127,000 questions from 8,000 different conversations. These conversations are from seven different types of text, including children´s stories, high school English exams, and Reddit. AI models often struggle to answer questions across different domains (i.e. news stories vs. English exams), and the researchers found that humans significantly outperformed reading comprehension models in answering the questions.

Practical Apache Spark in 10 minutes. Part 6 – GraphX

In our last post, we explained the basics of streaming with Spark. Today, we want to talk about graphs and explore Apache Spark GraphX tool for graph computation and analysis. It is necessary to say that GraphX works only with Scala. A graph is a structure which consists of vertices and edges between them. Graph theory finds its application in various fields such as computer science, linguistics, physics, chemistry, social sciences, biology, mathematics, and others. Problems connected with graph analysis are rather complicated, but there are many modern convenient instruments and libraries for these purposes. In this post, we will consider the following example of the graph: the cities are the vertices and the distances between them are the edges. You can see the Google Maps illustration of this structure in the figure below.

Solving Some Image Processing Problems with Python libraries – Part 2

In this article a few more popular image processing problems along with their solutions are going to be discussed. Python image processing libraries are going to be used to solve these problems.

Operators in Python

This tutorial covers the different types of operators in Python, operator overloading, precedence and associativity

Introduction to Bioconductor

By reading this tutorial, you got familiar with the Bioconductor project, and it´s fundamental role in biological research works. Besides, types of Bioconductor packages and some of the most important ones were mentioned. In the next step, a short workflow of how to install and use the project was explained. You learned how to work with Biostring and GenomicRanges packages, and as a result, you got familiar with DNA sequences and genome ranges. In another sample, you tried to apply enrichment analysis on a set of human genes. Hence, you learned the concepts of enrichment analysis and GO. In the next part, various ways of getting help during working on Bioconductor were indicated. If you would like to know more about the project, don´t hesitate to see Bioconductor_courses.

Linear compression in python: PCA vs unsupervised feature selection

We illustrate the application of two linear compression algorithms in python: Principal component analysis (PCA) and least-squares feature selection. Both can be used to compress a passed array, and they both work by stripping out redundant columns from the array. The two differ in that PCA operates in a particular rotated frame, while the feature selection solution operates directly on the original columns. As we illustrate below, PCA always gives a stronger compression. However, the feature selection solution is often comparably strong, and its output has the benefit of being relatively easy to interpret – a virtue that is important for many applications.

Explore How to Detect and Address Machine Learning, AI Bias

Artificial intelligence. Machine learning. It´s easy to consider sectors such as this free of bias, as not much is these days. But that might not be the case. According to a new report from Alegion, AI bias is in fact a reality, and often one that is hard to avoid.

CodeR: an LSTM that writes R Code

Everybody talks about them, many people know how to use them, few people understand them: Long Short-Term Memory Neural Networks (LSTM). At STATWORX, with the beginning of the hype around AI and projects with large amounts of data, we also started using this powerful tool to solve business problems. In short, an LSTM is a special type of recurrent neural network – i.e. a network able to access its internal state to process sequences of inputs – which is really handy if you want to exploit some time-like structure in your data. Use cases for recurrent networks range from guessing the next frame in a video to stock prediction, but you can also use them to learn and produce original text. And this shall already be enough information about LSTMs from my side. I won’t bother you with yet another introduction into the theory of LSTMs, as there are more than enough great blog posts about their architecture (Kudos to Andrej Karpathy for this very informative piece of work which you should definitely read if you are not already bored by neural networks :)).

Couplings of Normal variables

Just to play a bit with the gganimate package, and to celebrate National Coupling Day, the above plot shows different couplings of two univariate Normal distributions, Normal(0,1) and Normal(2,1). That is, each point is a pair (x,y) where x follows a Normal(0,1) and y follows a Normal(2,1). Below I´ll recall briefly how each coupling operates, in the Normal case. The code is available at the end of the post.

Machine learning and AI technologies and platforms at AWS

Dan Romuald Mbanga walks through the ecosystem around the machine learning platform and API services at AWS.