Maritime anomaly detection: A review

The surveillance of large sea areas normally requires the analysis of large volumes of heterogeneous, multidimensional and dynamic sensor data, in order to improve vessel traffic safety, maritime security and to protect the environment. Early detection of conflict situations at sea provides critical time to take appropriate action with, possibly before potential problems occur. In order to provide an overview of the state-of-the-art of research carried out for the analysis of maritime data for situational awareness, this study presents a review of maritime anomaly detection. The found articles are categorized into four groups (a) data, (b) methods, (c) systems, and (d) user aspects. We present a comprehensive summary of the works found in each category, and finally, outline possible paths of investigation and challenges for maritime anomaly detection.


Hands-on Tutorial on Python Data Processing Library Pandas – Part 2

Pandas is a Python language package, which is used for data processing in the part one. This is a very common basic programming library when we use Python language for machine learning programming. This article is the second tutorial in the series of pandas tutorial series. We recommend you to read the first pandas introductory tutorial here, before start exploring this.


Hands-on Tutorial on Python Data Processing Library Pandas – Part 1

Pandas is a Python language package, which is used for data processing. This is a very common basic programming library when we use Python language for machine learning programming. This article is an introductory tutorial to it. Pandas provide fast, flexible and expressive data structures with the goal of making the work of “relational” or “marking” data simple and intuitive. It is intended to be a high-level building block for actual data analysis in Python.


Google’s Interpretation Of GDPR Puts Publishers In An Untenable Position: Forrester Recommends Rebellion

Google, in order to protect itself, is dictating the terms of the consent that publishers must obtain . . . . . . which leads to the question of liability. Most publishers that work with Google to monetize their sites have contracts in place with the platform indemnifying Google from liability, which made sense when publishers were in control of the data. But under Google’s announced implementation of GDPR rules, publishers assume all liability for obtaining explicit consumer consent but cede control over its use in the Google platform. And that doesn’t make sense. Google is a platform partner for the vast majority of publishers. While solid numbers are elusive, largely because Google doesn’t want its market dominance to be quantified (does the word “monopoly” resonate ), it is safe to say that at least 75% of all US-based publishers work with Google; the situation in Europe is less extreme, by all reports. There are other platforms with which to work, but there is no guarantee that those Google alternatives won’t dictate even more onerous terms.


Big Data Platforms Evolve for Analytics, Machine Learning

These days Cloudera, Hortonworks, and MapR promote themselves as platforms for analytics, data science, and machine learning, incorporating many of the open source technologies in one place to make them easier for enterprises to consume. (MapR describes itself as a converged data platform integrating Hadoop, Spark, and Apache Drill along with other data technologies. Hortonworks describes itself as a connected data platform and solution.) All these companies have repositioned themselves as providing much more than just open source storage technologies for big data needs. That’s a move echoed by the changing focus of what enterprise organizations want to do with their data programs.


From Machine Learning to Machine Reasoning

A plausible definition of ‘reasoning’ could be ‘algebraically manipulating previously acquired knowledge in order to answer a new question’. This definition covers first-order logical inference or probabilistic inference. It also includes much simpler manipulations commonly used to build large learning systems. For instance, we can build an optical character recognition system by first training a character segmenter, an isolated character recognizer, and a language model, using appropriate labelled training sets. Adequately concatenating these modules and fine tuning the resulting system can be viewed as an algebraic operation in a space of models. The resulting model answers a new question, that is, converting the image of a text page into a computer readable text. This observation suggests a conceptual continuity between algebraically rich inference systems, such as logical or probabilistic inference, and simple manipulations, such as the mere concatenation of trainable learning systems. Therefore, instead of trying to bridge the gap between machine learning systems and sophisticated ‘all-purpose’ inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to training systems, and build reasoning capabilities from the ground up.


Machine Learning Breaking Bad – addressing Bias and Fairness in ML models

Looking ahead to 2018, rising awareness of the impact of bias, and the importance of fairness and transparency, means that data scientists need to go beyond simply optimizing a business metric. We will need to treat these issues seriously, in much the same way we devote resources to fixing security and privacy issues.


Modelling Time Series Processes using GARCH

When techniques like linear regression or time series were aimed at modelling the general trend exhibited by a set or series of data points, data scientists faced another question – though these models can capture the overall trend but how can one model the volatility in the data In real life, the initial stages in a business or a new market are always volatile and changing with a high velocity until things calm down and become saturated. It is then one can apply the statistical techniques such as time series analysis or regression as the case may be. To go into the turbulent seas of volatile data and analyze it in a time changing setting, ARCH models were developed.


Data Science: 4 Reasons Why Most Are Failing to Deliver

1. Silos of knowledge.
2. Friction in model deployment.
3. Tool and technology mismatch.
4. Model liability.


Slopegraphs and R — A pleasant diversion

I try to at least scan the R-bloggers feed everyday. Not every article is of interest to me, but I often have one of two different reactions to at least one article. Sometimes it is an “ah ha” moment because the article is right on point for a problem I have now or have had in the past and the article provides a (better) solution. Other times my reaction is more of an “oh yeah”, because it is something I have been meaning to investigate, or something I once knew, but the article brings a different perspective to it.


How Many Factors to Retain in Factor Analysis

When running a factor analysis, one often needs to know how many components / latent variables to retain. Fortunately, many methods exist to statistically answer this question. Unfortunately, there is no consensus on which method to use. Therefore, the n_factors() function, available in the psycho package, performs the method agreement procedure: it runs all the routines and returns the number of factors with the highest consensus.