Trajectory Data Mining: An Overview

The advances in location-acquisition and mobile computing techniques have generated massive spatial trajectory data, which represent the mobility of a diversity of moving objects, such as people, vehicles and animals. Many techniques have been proposed for processing, managing and mining trajectory data in the past decade, fostering a broad range of applications. In this article, we conduct a systematic survey on the major research into trajectory data mining, providing a panorama of the field as well as the scope of its research topics. Following a roadmap from the derivation of trajectory data, to trajectory data preprocessing, to trajectory data management, and to a variety of mining tasks (such as trajectory pattern mining, outlier detection, and trajectory classification), the survey explores the connections, correlations and differences among these existing techniques. This survey also introduces the methods that transform trajectories into other data formats, such as graphs, matrices, and tensors, to which more data mining and machine learning techniques can be applied. Finally, some public trajectory datasets are presented. This survey can help shape the field of trajectory data mining, providing a quick understanding of this field to the community.


Approximate Query Processing

In modern large-scale analytical database systems, the keywords are speed, accuracy, and interactivity. These are the criteria that count for data analysts and business users who need to perform analytical queries on a large amount of data with complex conditions and return aggregated results that can enable decisions to be made. For example, a sales director of a company handling millions of transactions per day might be interested to know the total revenue of all transactions for a specific product category in a certain period of time, where the buyer and seller are in specific regions, and the product has parts manufactured in yet another region. Yet, long running times for such analytical queries, even in leading, commercial-level database systems, are still among the major challenges to overcome. Prof Ke Yi, an expert in database systems and algorithms, has solved a problem that has taxed the community for over 15 years, enabling responses to queries to be given in seconds rather than minutes or hours. Prof Yi and his team’s novel algorithm allows the database to return approximate results in a very short time, and continue to improve their accuracy as more time is spent. Working on datasets of 100 gigabytes and larger, their ‘Wander Join’ algorithm achieves more effective sampling through random walks, returning results with the same accuracy (for example, 95% confidence and 1%error) in one-hundredth of the time compared with prior solutions on the same hardware. The algorithm has been integrated into PostgreSQL, Spark, as well as Oracle through PL/SQL, demonstrating its viability in a variety of settings and bringing the sought-after goal of interactive data analysis a step closer.


Inference enterprise models: An approach to organizational performance improvement

We demonstrate that our success in solving a set of increasingly complex challenge problems is associated with an inference enterprise (IE) using inference enterprise models (IEMs). As part of a sponsored research competition, we created a multimodeling inference enterprise modeling (MIEM) process to achieve winning scores on a spectrum of challenge problems related to insider threat detection. We present in general terms the motivation for and description of our MIEM solution. We then present the results of applying MIEM across a range of challenge problems, with a detailed illustration for one challenge problem. Finally, we discuss the science and promise of IEM and MIEM, including the applicability of MIEM to a spectrum of inference domains.


5 Reasons Data Analytics are Falling Short

When it comes to big data, possession is not enough. Comprehensive intelligence is the key. But traditional data analytics paradigms simply cannot deliver on the promise of data-driven insights. Here´s why.
PROBLEM #1 – Too much data
PROBLEM #2 – Restrictive data pre-modeling
PROBLEM #3 – The price tag
PROBLEM #4 – Time-to-results
PROBLEM #5 – How much big data is actually used


Long Running Tasks With Shiny: Challenges and Solutions

One of the great additions to the R ecosystem in recent years is RStudio’s Shiny package. With it, you can easily whip up and share a user interface for a new statistical method in just a few hours. Today I want to share some of the methods and challenges that come up when the actual computation of a result takes a non-trivial amount of time (e.g >5 seconds).


A Certification for R Package Quality

There are more than 12,000 packages for R available on CRAN, and many others available on Github and elsewhere. But how can you be sure that a given R package follows best development practices for high-quality, secure software? Based on a recent survey of R users related to challenges in selecting R packages, the R Consortium now recommends a way for package authors to self-validate that their package follows best practices for development. The CII Best Practices Badge Program, developed by the Linux Foundation’s Core Infrastructure Initiative, defines a set of criteria that open-source software projects should follow for quality and security.


Rj Editor – Analyse your data with R in jamovi

Rj is a new module for the jamovi statistical spreadsheet that allows you to use the R programming language to analyse data from within jamovi. Although jamovi is already built on top of R, and all the analyses it provides are written in R, to date it hasn´t been possible to enter R code directly. Rj changes that.


Beyond Basic R – Introduction and Best Practices

We queried more than 60 people who have taken the USGS Introduction to R class over the last two years to understand what other skills and techniques are desired, but not covered in the course. Though many people have asked for an intermediate level class, we believe that many of the skills could be best taught through existing online materials. Instead of creating a stand-alone course, we invested our time into compiling the results of the survey, creating short examples, and linking to the necessary resources within a series of blog posts. This is the first in a series of 5 posts called Beyond basic R.


Data’s day of Reckoning

Data science, machine learning, artificial intelligence, and related technologies are now facing a day of reckoning. It is time for us to take responsibility for our creations. What does it mean to take responsibility for building, maintaining, and managing data, technologies, and services? Responsibility is inevitably tangled with the complex incentives that surround the creation of any product. These incentives have been front and center in the conversations around the roles that social networks have played in the 2016 U.S. elections, recruitment of terrorists, and online harassment. It has become very clear that the incentives of the organizations that build and own data products haven´t aligned with the good of the people using those products.
Here´s a checklist we´ve developed for developers working on data-driven applications:
• Have we listed how this technology can be attacked or abused?
• Have we tested our training data to ensure it is fair and representative?
• Have we studied and understood possible sources of bias in our data?
• Does our team reflect diversity of opinions, backgrounds, and kinds of thought?
• What kind of user consent do we need to collect and use the data?
• Do we have a mechanism for gathering consent from users?
• Have we explained clearly what users are consenting to?
• Do we have a mechanism for redress if people are harmed by the results?
• Can we shut down this software in production if it is behaving badly?
• Have we tested for fairness with respect to different user groups?
• Have we tested for disparate error rates among different user groups?
• Do we test and monitor for model drift to ensure our software remains fair over time?
• Do we have a plan to protect and secure user data?


Bivariate Distribution Heatmaps in R

Learn how to visually show the relationship between two features, how they interact with each other and where data points are concentrated.


Scraping Reddit with Python and BeautifulSoup 4

In this tutorial, you’ll learn how to get web pages using requests, analyze web pages in the browser, and extract information from raw HTML with BeautifulSoup.


10 Top Barriers to AI Success

Enterprises are eager to experience the potential benefits from deploying artificial intelligence, but if they want to be successful, they’ll need to overcome these technological, organizational and cultural challenges.
1. Fear
2. Lack of Stakeholder Buy-In
3. Lack of Budget
4. Data Challenges
5. Bias
6. Talent Shortages
7. Lack of IT Infrastructure
8. Unproven Technnology
9. Underwhelming Performance
10. Regulation


Google’s AutoML: Cutting Through the Hype

To announce Google´s AutoML, Google CEO Sundar Pichai wrote, ‘Today, designing neural nets is extremely time intensive, and requires an expertise that limits its use to a smaller community of scientists and engineers. That´s why we´ve created an approach called AutoML, showing that it´s possible for neural nets to design neural nets. We hope AutoML will take an ability that a few PhDs have today and will make it possible in three to five years for hundreds of thousands of developers to design new neural nets for their particular needs.’ (emphasis mine)


An Opinionated Introduction to AutoML and Neural Architecture Search

Researchers from CMU and DeepMind recently released an interesting new paper, called Differentiable Architecture Search (DARTS), offering an alternative approach to neural architecture search, a very hot area of machine learning right now. Neural architecture search has been heavily hyped in the last year, with Google´s CEO Sundar Pichai and Google´s Head of AI Jeff Dean promoting the idea that neural architecture search and the large amounts of computational power it requires are essential to making machine learning available to the masses. Google´s work on neural architecture search has been widely and adoringly covered by the tech media (see here, here, here, and here for examples).


What do machine learning practitioners actually do?

There are frequent media headlines about both the scarcity of machine learning talent (see here, here, and here) and about the promises of companies claiming their products automate machine learning and eliminate the need for ML expertise altogether (see here, here, and here). In his keynote at the TensorFlow DevSummit, Google´s head of AI Jeff Dean estimated that there are tens of millions of organizations that have electronic data that could be used for machine learning but lack the necessary expertise and skills. I follow these issues closely since my work at fast.ai focuses on enabling more people to use machine learning and on making it easier to use.


Data Engineer vs Software Engineer Departments

People often treat these as completely separate entities, which is not true if the company is data driven. When we think about data pipelines we typically imagine some bearded guy with eye circles that builds and maintains ETLs. This is valid for cases related to the integration of mature external products (Google and Facebook ads, Zendesk, etc.) into your analytics environment. Data engineers can handle different tasks independently from the software engineering department. These can be pulling data from clear, well-documented, stable APIs, designed for retrieval of historical data from a 3rd party. Plus, today there´s a range of companies that build their business around syncing your data for you. At the same time, most of the startups are software-first companies that build both customer facing and internal software, and this is an entirely different story.