Data Analytics
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.
Data analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information. Data analytics is used in many industries to allow companies and [organizations] to make better business decisions and in the sciences to verify or disprove existing models or theories.
Definition …
Active Learning in Python (ALiPy)
Supervised machine learning methods usually require a large set of labeled examples for model training. However, in many real applications, there are plentiful unlabeled data but limited labeled data; and the acquisition of labels is costly. Active learning (AL) reduces the labeling cost by iteratively selecting the most valuable data to query their labels from the annotator. This article introduces a Python toobox ALiPy for active learning. ALiPy provides a module based implementation of active learning framework, which allows users to conveniently evaluate, compare and analyze the performance of active learning methods. In the toolbox, multiple options are available for each component of the learning framework, including data process, active selection, label query, results visualization, etc. In addition to the implementations of more than 20 state-of-the-art active learning algorithms, ALiPy also supports users to easily configure and implement their own approaches under different active learning settings, such as AL for multi-label data, AL with noisy annotators, AL with different costs and so on. The toolbox is well-documented and open-source on Github, and can be easily installed through PyPI. …
Ensemble Forecast Framework (ENFF)
An accurate load forecast is always important for the power industry and energy players as it enables stakeholders to make critical decisions. In addition, its importance is further increased with growing uncertainties in the generation sector due to the high penetration of renewable energy and the introduction of demand side management strategies. An incremental improvement in grid-level demand forecast of anomalous days can potentially save millions of dollars. However, due to an increasing penetration of renewable energy resources and their dependency on several meteorological and exogenous variables, accurate load forecasting of anomalous days has now become very challenging. To improve the prediction accuracy of the load forecasting, an ensemble forecast framework (ENFF) is proposed with a systematic combination of three multiple predictors, namely Elman neural network (ELM), feedforward neural network (FNN) and radial basis function (RBF) neural network. These predictors are trained using global particle swarm optimization (GPSO) to improve their prediction capability in the ENFF. The outputs of individual predictors are combined using a trim aggregation technique by removing forecasting anomalies. Real recorded data of New England ISO grid is used for training and testing of the ENFF for anomalous days. The forecast results of the proposed ENFF indicate a significant improvement in prediction accuracy in comparison to autoregressive integrated moving average (ARIMA) and back-propagation neural networks (BPNN) based benchmark models. …
Cross-Dimensional Self-Attention (CDSA)
Many real-world applications involve multivariate, geo-tagged time series data: at each location, multiple sensors record corresponding measurements. For example, air quality monitoring system records PM2.5, CO, etc. The resulting time-series data often has missing values due to device outages or communication errors. In order to impute the missing values, state-of-the-art methods are built on Recurrent Neural Networks (RNN), which process each time stamp sequentially, prohibiting the direct modeling of the relationship between distant time stamps. Recently, the self-attention mechanism has been proposed for sequence modeling tasks such as machine translation, significantly outperforming RNN because the relationship between each two time stamps can be modeled explicitly. In this paper, we are the first to adapt the self-attention mechanism for multivariate, geo-tagged time series data. In order to jointly capture the self-attention across multiple dimensions, including time, location and the sensor measurements, while maintain low computational complexity, we propose a novel approach called Cross-Dimensional Self-Attention (CDSA) to process each dimension sequentially, yet in an order-independent manner. Our extensive experiments on four real-world datasets, including three standard benchmarks and our newly collected NYC-traffic dataset, demonstrate that our approach outperforms the state-of-the-art imputation and forecasting methods. A detailed systematic analysis confirms the effectiveness of our design choices. …
If you did not already know
10 Thursday Dec 2020
Posted What is ...
in