Normalized Setwise Levenshtein Distance (NSLD) google
This work tackles the problem of fuzzy joining of strings that naturally tokenize into meaningful substrings, e.g., full names. Tokenized-string joins have several established applications in the context of data integration and cleaning. This work is primarily motivated by fraud detection, where attackers slightly modify tokenized strings, e.g., names on accounts, to create numerous identities that she can use to defraud service providers, e.g., Google, and LinkedIn. To detect such attacks, all the accounts are pair-wise compared, and the resulting similar accounts are considered suspicious and are further investigated. Comparing the tokenized-string features of a large number of accounts requires an intuitive tokenized-string distance that can detect subtle edits introduced by an adversary, and a very scalable algorithm. This is not achievable by existing distance measure that are unintuitive, hard to tune, and whose join algorithms are serial and hence unscalable. We define a novel intuitive distance measure between tokenized strings, Normalized Setwise Levenshtein Distance (NSLD). To the best of our knowledge, NSLD is the first metric proposed for comparing tokenized strings. We propose a scalable distributed framework, Tokenized-String Joiner (TSJ), that adopts existing scalable string-join algorithms as building blocks to perform NSLD-joins. We carefully engineer optimizations and approximations that dramatically improve the efficiency of TSJ. The effectiveness of the TSJ framework is evident from the evaluation conducted on tens of millions of tokenized-string names from Google accounts. The superiority of the tokenized-string-specific TSJ framework over the general-purpose metric-spaces joining algorithms has been established. …

Sparse Matrix-Based SPARQL (SM-based SPARQL,gSMat) google
Resource Description Framework (RDF) has been widely used to represent information on the web, while SPARQL is a standard query language to manipulate RDF data. Given a SPARQL query, there often exist many joins which are the bottlenecks of efficiency of query processing. Besides, the real RDF datasets often reveal strong data sparsity, which indicates that a resource often only relates to a few resources even the number of total resources is large. In this paper, we propose a sparse matrix-based (SM-based) SPARQL query processing approach over RDF datasets which con- siders both join optimization and data sparsity. Firstly, we present a SM-based storage for RDF datasets to lift the storage efficiency, where valid edges are stored only, and then introduce a predicate- based hash index on the storage. Secondly, we develop a scalable SM-based join algorithm for SPARQL query processing. Finally, we analyze the overall cost by accumulating all intermediate results and design a query plan generated algorithm. Besides, we extend our SM-based join algorithm on GPU for parallelizing SPARQL query processing. We have evaluated our approach compared with the state-of-the-art RDF engines over benchmark RDF datasets and the experimental results show that our proposal can significantly improve SPARQL query processing with high scalability. …

Natural Language Interaction (NLI) google
Natural Language Interaction (NLI) enables people to interact with any connected device using normal, everyday language. It understands the meaning of conversational input, and reacts accordingly, creating value and enhancing the user experience. …

Neural SLAM google
We present an approach for agents to learn representations of a global map from sensor data, to aid their exploration in new environments. To achieve this, we embed procedures mimicking that of traditional Simultaneous Localization and Mapping (SLAM) into the soft attention based addressing of external memory architectures, in which the external memory acts as an internal representation of the environment. This structure encourages the evolution of SLAM-like behaviors inside a completely differentiable deep neural network. We show that this approach can help reinforcement learning agents to successfully explore new environments where long-term memory is essential. We validate our approach in both challenging grid-world environments and preliminary Gazebo experiments. …

Advertisements