TRE: A Regex Engine with Approximate Matching

TRE is a regex engine that allows for approximate matching. It does this by using calculating the Levenshtein Distance (number of insertions, deletions, or substitutions it would take to make the strings equal), as it searchs for a match. The re::engine::TRE is a Perl wrapper around the TRE engine. It swaps out the default Perl regex engine with the TRE engine within the lexical scope that it is used.


KBpedia is a comprehensive knowledge structure for promoting data interoperability and knowledge-based artificial intelligence, or KBAI. The KBpedia knowledge structure combines seven ‘core’ public knowledge bases – Wikipedia, Wikidata,, DBpedia, GeoNames, OpenCyc, and UMBEL – into an integrated whole. KBpedia’s upper structure, or knowledge graph, is the KBpedia Knowledge Ontology. We base KKO on the universal categories and knowledge representation theories of the great 19th century American logician, polymath and scientist, Charles Sanders Peirce. KBpedia, written primarily in OWL 2, includes 55,000 reference concepts, about 30 million entities, and 5,000 relations and properties, all organized according to about 70 modular typologies that can be readily substituted or expanded. We test candidates added to KBpedia using a rigorous (but still fallible) suite of logic and consistency tests – and best practices – before acceptance. The result is a flexible and computable knowledge graph that can be sliced-and-diced and configured for all sorts of machine learning tasks, including supervised, unsupervised and deep learning.

AI technologies used in Robotics

• Computer Vision
• Natural language processing (NLP)
• Edge computing
• Complex event process
• Transfer learning and AI
• Hardware acceleration for AI
• Reinforcement learning
• GANs
• Mixed reality
• Emotion research – affective computing

Machine Learning Basics – Random Forest (video tutorial in German)

few colleagues of mine and I from are currently working on developing a free online course about machine learning and deep learning. As part of this course, I am developing a series of videos about machine learning basics – the first video in this series was about Random Forests.

User-adaptive statistical dialogue management using OpenDial

In Spoken Dialogue Systems, two techniques are currently used to create an optimal dialogue policy: hand-crafted rules and statistical procedures basing on machine learning. However, both types are not sufficient in complex areas where only limited training data is available. This thesis thus examines a hybrid approach to dialogue management that intents to combine the benefits of both rule-based and statistical methods. For this purpose, probabilistic rules are employed which depend on unknown parameters. Afterwards, these parameters are trained with supervised learning. Furthermore, the dialogue manager is designed to be adaptive to the user’s cultural background and emotional condition as this is supposed to have a crucial influence on the conversational behaviour. The configuration is then investigated in the context of the KRISTINA domain. The conducted experiments reveal that it is possible to include emotional and cultural features in the dialogue management.

A Data Hub is not the same as a Data Lake or Data Warehouse

Despite our best efforts we still receive lots of inquiries from organizations that confuse and conflate data hubs with data lakes and data warehouses. In truth, the term ‘data hub’ is the where the issue has come from. About two years ago a couple of us were looking at a number of topics and research areas and we connected the dots. The connective tissue was that idea that organisations ought to be conspicuously determining, ‘from where should our organizations maintain data (and analytics) governance policy that are executed, or the trusted source of data itself?’ It turns out that so many organizations had assumed all data and/or policy should be centralized (think ERP), or widely distributed at the edge (think standalone business application or IoT edge device). It turns out that a ‘new’ approach, once that permits the notion of intermediate nodes to store/execute the policy or trusted data itself somewhere within all the spaghetti of systems, yields real efficiencies and drives agility. Those ‘nodes’ we introduced as ‘data hubs’, knowing that the word ‘hub’ was already well used. It turns out the idea is real useful and popular, once you get over the nomenclature challenges. But for some reason, too many folks confuse these nodes with larger sets of collections of data – transaction and stream data – often referred to as data warehouses or data lakes. Data hubs tend to be small in comparison to data warehouses and data lakes; data hubs don’t store transaction information.

7 Best Practices for Machine Learning on a Data Lake

Download This Report to Learn About the Data Requirements for Advanced Analytics on a Data Lake, and Best Practices Such Analytics With a Focus on Machine Learning.
1. Understand the use Cases for Machine Learning
2. Consider Cloud-Based Platforms for Your Data Lake
3. Prepare Your Lake’s Data for Multiple Forms of Analytics
4. Utilize Machine Learning for Data Exploration and Analysis in the Data Lake
5. Create ‘labs’ for Machine Learning in the Data Lake
6. Prepare for Operationalizing Your Machine Learning Models
7. Become Aware of Deep Learning

Introduction to PyTorch for Deep Learning

In this tutorial, you’ll get an introduction to deep learning using the PyTorch framework, and by its conclusion, you’ll be comfortable applying it to your deep learning models. Facebook launched PyTorch 1.0 early this year with integrations for Google Cloud, AWS, and Azure Machine Learning. In this tutorial, I assume that you’re already familiar with Scikit-learn, Pandas, NumPy, and SciPy. These packages are important prerequisites for this tutorial.

Text Preprocessing in Python: Steps, Tools, and Examples

In this paper, we will talk about the basic steps of text preprocessing. These steps are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.

Mastering the Learning Rate to Speed Up Deep Learning

Efficiently training deep neural networks can often be an art as much as a science. Industry-grade libraries like PyTorch and TensorFlow have rapidly increased the speed with which efficient deep learning code can be written, but there are still a lot of work required to create a performant model. Let’s say, for example, you want to build an image classifier model. A convolutional neural network would be the proper approach to utilize deep learning. But how many layers go in your network? How much momentum and weight decay should you use? What’s the best dropout probability?

Learning Concepts with Energy Functions

We’ve developed an energy-based model that can quickly learn to identify and generate instances of concepts, such as near, above, between, closest, and furthest, expressed as sets of 2d points. Our model learns these concepts after only five demonstrations. We also show cross-domain transfer: we use concepts learned in a 2d particle environment to solve tasks on a 3-dimensional physics-based robot.

Integrating R and Telegram

Some models (deep learning) take a long time to finish. Even some data preparation scripts. We can be notified that the process ended by Telegram sending messages from R.

Using httr to Detect HTTP(s) Redirects

In this short note I will write about the httr package and my need to detect whether or not an HTTP request had been redirected or not – it turns out this is quite easy. Along the way I will also show how to access information of an HTTP-conversation other than the actual content to be retrieved.

Source and List: Organizing R Shiny Apps

Keeping R Shiny code organized can be a challenge. One method to organize your Shiny UI and Server code is to use a combination of R’s list and source functions. Another method to organize you’re Shiny code is through modularization techniques. Here though, we’re going concentrate on the list and source options.

Causal mediation estimation measures the unobservable

I put together a series of demos for a group of epidemiology students who are studying causal mediation analysis. Since mediation analysis is not always so clear or intuitive, I thought, of course, that going through some examples of simulating data for this process could clarify things a bit.

How do neural nets learn?’ A step by step explanation using the H2O Deep Learning algorithm.

In my last blogpost about Random Forests I introduced the Bootcamp. The next part I published was about Neural Networks and Deep Learning. Every video of our bootcamp will have example code and tasks to promote hands-on learning. While the practical parts of the bootcamp will be using Python, below you will find the English R version of this Neural Nets Practical Example, where I explain how neural nets learn and how the concepts and techniques translate to training neural nets in R with the H2O Deep Learning function.

Predict Search Relevance Using Machine Learning for Online Retailers

Large online retailers typically use query-based search to help consumers find information/products on their websites. They are able to use technology to provide users with a better experience. Because they understand the importance of search relevance, and that long and/or unsuccessful searches can turn their users away because users are accustomed to and expect instant, relevant search results like they get from Google and Amazon.

AI Series: Data Scientists, the modern alchemists.

In first place, data scientists need to understand the nature of the problem they have to solve. In machine learning there are mainly 3 types of problems: Classification, Regression and Clustering. Classification tasks involve the ability to assign the input data to categorial labels, like ‘yes’ or ‘no’, ‘True’ or ‘False’ or more complex ones like face recognition by assigning a face to the name of the person it belongs to. Regression tasks are similar to the classification tasks, but the prediction is related to a continuous value rather than a category of objects. Teaching an algorithm to predict how the prices related to a specific product or service will change under a specific set of circumstances is a regression problem. Clustering problems are closer to the traditional data mining tasks where the need is to analyze unlabeled data to discover specific and hidden patterns allowing the extraction of powerful insights like in the case of product recommendation.

Security and Privacy considerations in Artificial Intelligence & Machine Learning – Part 5: When the attackers use AI.

In the last post, we reviewed some constructive use cases of AI & ML for security-related scenarios. However, just as we, the defenders, have started increasingly using AI for improving security so have the attackers to make attacks more sophisticated! Let us look at how AI & ML are getting deployed and leveraged as a part of the attackers’ armory. Like earlier, we will begin with cybersecurity scenarios but then also touch on broader issues around ‘weaponization’ and worries emerging from AI going rogue on us.

Learning to make Recommendations

Recommender Systems are machine learning systems that help users discover new product and services. Every time you shop online, Recommendation system is guiding towards the most likely product the user is going to buy. Recommendation system in today’s world is an essential feature as users are often overwhelmed by choice and recommendation system help users quickly find the product they love. It leads to more sales as well as make users happier. These systems make companies dollars as recommending a product to a user might yield a sale or make the user think about whether to purchase a product or not. It’s like a salesman recommending product but a virtual one without any human curation but a better curation based on your history. Recommender Systems is like a friend suggesting you.

Slopes as features: making time-sensitive predictions

A lot of the projects I work on are time-bound in one way or another. My clients need to know the churn rate next week, the risk of fraud next month, their anticipated revenue next quarter. But what features does a model need to do this well? Feature engineering is one of the most creatively challenging aspects of a data science project. When you follow a tutorial or read a book, it’s easy to forget that someone had to go through the difficult work of creating the features you use in your model. In practice, creating features from raw data requires a great amount of foresight and some intuition about what would help your model do its job best. One the best ways I’ve found to increase the accuracy of a time-based predictive model (especially one that’s trained on an imbalanced data set) is to use slopes.

Scaling Machine Learning at Uber with Michelangelo

In September 2017, we published an article introducing Michelangelo, Uber’s Machine Learning Platform, to the broader technical community. At that point, we had over a year of production experience under our belts with the first version of the platform, and were working with a number of our teams to build, deploy, and operate their machine learning (ML) systems. As our platform matures and Uber’s services grow, we’ve seen an explosion of ML deployments across the company. At any given time, hundreds of use cases representing thousands of models are deployed in production on the platform. Millions of predictions are made every second, and hundreds of data scientists, engineers, product managers, and researchers work on ML solutions across the company. In this article, we reflect on the evolution of ML at Uber from the platform perspective over the last three years. We review this journey by looking at the path taken to develop Michelangelo and scale ML at Uber, offer an in-depth look at Uber’s current approach and future goals towards developing ML platforms, and provide some lessons learned along the way. In addition to the technical aspects of the platform, we also look at the important organizational and process design considerations that have been critical to our success with ML at Uber.