What is a Transformer?

New deep learning models are introduced at an increasing rate and sometimes it’s hard to keep track of all the novelties. That said, one particular neural network model has proven to be especially effective for common natural language processing tasks. The model is called a Transformer and it makes use of several methods and mechanisms that I’ll introduce here. The papers I refer to in the post offer a more detailed and quantitative description.


21 Must-Know Open Source Tools for Machine Learning you Probably Aren’t Using (but should!)

I love the open-source machine learning community. The majority of my learning as an aspiring and then as an established data scientist came from open-source resources and tools. If you haven’t yet embraced the beauty of open-source tools in machine learning – you’re missing out! The open-source community is massive and has an incredibly supportive attitude towards new tools and embracing the concept of democratizing machine learning.
1. Uber Ludwig
2. KNIME
3. Orange
4. MLFlow
5. Apple’s CoreML
6. TensorFlow Lite
7. TensorFlow.js
8. Hadoop
9. Spark
10. Neo4j
11. SimpleCV
12. Tesseract OCR
13. Detectron
14. StanfordNLP
15. BERT as a Service
16. Google Magenta
17. LibROSA
18. Google Research Football
19. OpenAI Gym
20. Unity ML Agents
21. Project Malmo


A review on sentiment discovery and analysis of educational big-data

Sentiment discovery and analysis (SDA) aims to automatically identify the underlying attitudes, sentiments, and subjectivity towards a certain entity such as learners and learning resources. Due to its enormous potential for smart education, SDA has been deemed as a powerful technique for identifying and classifying sentiments from multimodal and multisource data over the whole process of education. For big educational data streams, SDA faces challenges in unimodal feature selection, sentiment classification, and multimodal fusion. As such, a large body of studies in the literature explores diverse approaches to SDA for educational applications. This paper provides a self-contained, uniform overview of the SDA techniques for education. In particular, we focus on prominent studies in unimodal sentiment features and classifications (e.g., text, audio, and visual). In addition, we present a novel SDA framework of multimodal fusions, together with description of their crucial components. Based on this framework, we review different approaches to SDA on education from the perspectives of approaches and applications. After comprehensively reviewing the SDA techniques on education, we present the trends and prospectives of the future SDA research under ubiquitous education contexts.


Comparing Model Evaluation Techniques Part 1: Statistical Tools & Tests

Evaluating a model is just as important as creating the model in the first place. Even if you use the most statistically sound tools to create your model, the end result may not be what you expected. Which metric you use to test your model depends on the type of data you’re working with and your comfort level with statistics.


Improving the MLModel Base Class

In the previous blog post in this series I showed an object oriented design for a base class that does Machine Learning model prediction. The design of the base class was intentionally very simple so that I could show a simple example of how to use the base class with a scikit-learn model. I showed an easy way to publish schema metadata about the model inputs and outputs, and how to write model deserialization code so that it is hidden from the users of the model. I also showed how to hide the implementation details of the model by translating the user’s input to the model’s input so that the user of the model doesn’t have to know how to use pandas or numpy. In this blog post I will continue to make improvements to the MLModel class and the example that I used in the previous post. In this blog post I will make the iris example code from the previous post into a full python package with many features that will make the iris model easier to install and use from other python packages. I will also continue to improve the MLModel base class. In general, I want to show how to make ML code easier to install and use.


Natural Language Processing Classification Using Deep Learning And Word2Vec

I experienced machine learning algorithms before for different problematics like predictions of money exchange rate or image classification. I had to work on a project recently of text classification, and I read a lot of literature about this subject. The case of NLP (Natural Language Processing) is fascinating. When you begin to think about it, you realized that it’s not that simple, and before the classification, there’s still this question: ‘How the hell can an algorithm read words ?’. One solution is to transform words into vectors to have a numerical representation of them. This solution is far from new, and a few years ago, an article presented the Google Word2Vec unsupervised algorithm: Efficient Estimation of Word Representations in Vector Space (Mikolov & al., 2013). Many documentation about it can be found, but the point of this article is to detail from A to Z how to build machine learning algorithm for text classification. I will demonstrate how to use Word2Vec with the pre-trained Google news Dataset, and how to train it yourself with your data. I will then demonstrate two techniques; one is to do the mean of your document words, and the other is to keep your data like they are, which keep more information, but it’s a bit more complicated and requires more time to train. So it depends on you, what you think is better in your case and with your data.


Build a Scalable Search Engine for Fuzzy String Matching

In fuzzy matching our goal is to score string A to string B in terms of how close they are together. We want to find similar strings. For example ‘mayor’ could be very close to ‘major’, or something like ‘threat’ very close to a typo like ‘thraet’, but also ‘Christoph Alexander Ostertag’ could be very close to ‘Christoph Ostertag’. There exist a variety of algorithms for calculating those distances.


PDF Processing with Python

Being a high-level, interpreted language with a relatively easy syntax, Python is perfect even for those who don’t have prior programming experience. Popular Python libraries are well integrated and provide the solution to handle unstructured data sources like Pdf and could be used to make it more sensible and useful PDF is one of the most important and widely used digital media. used to present and exchange documents. PDFs contain useful information, links and buttons, form fields, audio, video, and business logic.


Sarcasm Detection: Step towards Sentiment Analysis

Humans have a social nature. Social nature means that we interact with each other in positive, friendly ways, and it also means that we know how to manipulate others in a very negative way. Sarcasm, which is both positively funny and negatively nasty, plays an important part in human social interaction. But sometimes it is difficult to detect whether someone is making fun of us with some irony. So to make it easy we built something which helps you in detecting sarcastic text, but before getting into much more detail, let’s define sarcasm:- Sarcasm is ‘a sharp, bitter, or cutting expression or remark’. The use of irony to mock or convey contempt. So now it’s time to charge up and start the journey of sarcasm detection.


Breaking down Correlation

Correlation is the first step in finding relationships between quantities and deserves some attention . Correlation is defined as the association between quantities , for eg, the sales might increase when income of people increases.


How to Use R in AWS Lambda

Cloud solutions rule the world of modern computing. Even the biggest players use solutions provided by Amazon (AWS stands for Amazon Web Services), Google or other cloud providers instead of establishing their own infrastructure. Such a solution saves time (and money), but the number of tasks that can be transferred outside is bigger than just using external servers. In fact, we can use various serverless solutions to deploy our application (e.g. Google App Engine), analyse real-time streaming data (e.g. Amazon Kinesis) and solve many other problems.


Time Series Analysis of SFO Traffic Growth

This is a writeup of a machine learning project I completed. In this post I will walk through the process of my project, and present results on what we can expect for the future of SFO.


Advancing Semi-supervised Learning with Unsupervised Data Augmentation

Success in deep learning has largely been enabled by key factors such as algorithmic advancements, parallel processing hardware (GPU / TPU), and the availability of large-scale labeled datasets, like ImageNet. However, when labeled data is scarce, it can be difficult to train neural networks to perform well. In this case, one can apply data augmentation methods, e.g., paraphrasing a sentence or rotating an image, to effectively increase the amount of labeled training data. Recently, there has been significant progress in the design of data augmentation approaches for a variety of areas such as natural language processing (NLP), vision, and speech. Unfortunately, data augmentation is often limited to supervised learning only, in which labels are required to transfer from original examples to augmented ones.


A Novel Idea of Utilizing A/B Testing ‘Internally’

Finally, it’s time for the hard work to pay off: As a data scientists, you helped the business team develop an advanced data-driven decision-making tool, with anticipated impact of improved overall working efficiency, as well as providing recommendations in specific business questions. However, what is the best way to measure these fabulous impacts? This is exactly the situation that my Practicum team is faced with, as we are wrapping up the year-long analytics project with our client Hilti. Thankfully, with the background knowledge of A/B testing, I found a novel yet applicable approach.


NLP(AI) Reality… decluttered

Poem generators, tweet generators, news generators, chat-bots, machines cracking exams… phew… Natural Language Processing (NLP) has gotten far far away. Or are we in an echo chamber of media hype? What is the reality? Reality is where the developers are… what are they coding, what issues arise, what conversations happen? Top developer focused sites like Stack Overflow has a great picture of this. It is also a great lead indicator of what is coming (solutions of tomorrow being developed today). Let’s declutter to get NLP reality. Here is the summary, basis tags & titles of developer issue queries raised. 11000 queries were analyzed across 1867 tags as of June 2019.
Advertisements