Persistent Identifier Kernel Information
Persistent Identifier (PID) is a widely used long-term unique reference to digital objects. Meanwhile, Handle, one of the main persistent identifier schemes in use, implements a central global registry to resolve PIDs. The value of Handle varies in sizes and types without any restrictions from user side. However, widely using the Handel raises challenges on managing and correlating different PIDs for users and curators. In this research paper, we raise an idea about the value of Handle, called Persistent Identifier Kernel Information, which is the critical metadata describing the minimal information for identifying the PID object. Simultaneously, an API service called Collection API, is collaborating with PID Kernel Information to manage the Backbone Provenance relationships among different PIDs. This paper is an early research exploration describing the strength and weakness of Collection API and PID Kernel Information. …
Modified Sequential Probability Ratio Test (MSPRT)
In a MSPRT design, the maximum sample size of an experiment is fixed prior to the start of an experiment, the alternative hypothesis used to define the rejection region of the test is derived from the size of the test (Type I error), the maximum available sample size (N), and the targeted Type 2 error (equal to 1 minus the power) is also prespecified. Given these values, the MSPRT is defined in a manner very similar to Wald’s initial proposal. This test can reduce the average sample size required to perform statistical hypothesis tests at the specified levels of significance and power. …
Tell Me Something New (TMSN)
We present a novel approach for parallel computation in the context of machine learning that we call ‘Tell Me Something New’ (TMSN). This approach involves a set of independent workers that use broadcast to update each other when they observe ‘something new’. TMSN does not require synchronization or a head node and is highly resilient against failing machines or laggards. We demonstrate the utility of TMSN by applying it to learning boosted trees. We show that our implementation is 10 times faster than XGBoost and LightGBM on the splice-site prediction problem. …
Hierarchical Temporal Convolutional Network (HierTCN)
Recommender systems that can learn from cross-session data to dynamically predict the next item a user will choose are crucial for online platforms. However, existing approaches often use out-of-the-box sequence models which are limited by speed and memory consumption, are often infeasible for production environments, and usually do not incorporate cross-session information, which is crucial for effective recommendations. Here we propose Hierarchical Temporal Convolutional Networks (HierTCN), a hierarchical deep learning architecture that makes dynamic recommendations based on users’ sequential multi-session interactions with items. HierTCN is designed for web-scale systems with billions of items and hundreds of millions of users. It consists of two levels of models: The high-level model uses Recurrent Neural Networks (RNN) to aggregate users’ evolving long-term interests across different sessions, while the low-level model is implemented with Temporal Convolutional Networks (TCN), utilizing both the long-term interests and the short-term interactions within sessions to predict the next interaction. We conduct extensive experiments on a public XING dataset and a large-scale Pinterest dataset that contains 6 million users with 1.6 billion interactions. We show that HierTCN is 2.5x faster than RNN-based models and uses 90% less data memory compared to TCN-based models. We further develop an effective data caching scheme and a queue-based mini-batch generator, enabling our model to be trained within 24 hours on a single GPU. Our model consistently outperforms state-of-the-art dynamic recommendation methods, with up to 18% improvement in recall and 10% in mean reciprocal rank. …
If you did not already know
19 Wednesday Jan 2022
Posted What is ...
in