Text Mining 101: A Stepwise Introduction to Topic Modeling using Latent Semantic Analysis (using Python)

Have you ever been inside a well-maintained library? I´m always incredibly impressed with the way the librarians keep everything organized, by name, content, and other topics. But if you gave these librarians thousands of books and asked them to arrange each book on the basis of their genre, they will struggle to accomplish this task in a day, let alone an hour! However, this won´t happen to you if these books came in a digital format, right? All the arrangement seems to happen in a matter of seconds, without requiring any manual effort. All hail Natural Language Processing (NLP).

pandas: powerful Python data analysis toolkit

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with ‘relational’ or ‘labeled’ data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

New Startup Combines Code Analysis and Machine Learning to Speed Software Modernization

source{d}, the company enabling Machine Learning for large scale code analysis, announced the public beta of source{d} Engine and public alpha of source{d} Lookout. Combining code retrieval, language agnostic parsing and git history tools with familiar APIs parsing, source{d} Engine simplifies code analysis. source{d} Lookout is a service for assisted code review that enables running custom code analyzers on GitHub pull requests. According to a recent Thomson Reuters report, by now all established companies in traditional industries like finance, retail, manufacturing have become technology companies. While source code is now a large part of every company´s assets, that asset remains often underutilized. Large scale code analysis and Machine Learning on Code is the next logical step for companies as they progress on their digital transformation and IT automation journeys. source{d}, the only open-core company building a tech stack for Code as Data and Machine Learning on Code (ML on Code), turns code into an analyzable and productive asset across an enterprises source code repositories, facilitating the adoption of Inner Source practices at large traditional companies.

Bookreview: A Data Scientist’s Guide to Acquiring, Cleaning and Managing Data in R

Overall, the book was worth the time spent with it. I recommend it particularly for people like me who in principle know their R but have never spent very much thought on the data handling side of it. It will also be a good read for R beginners who want to benefit from the authors´ treasure trove of experience and their view on the big picture of data handling in context. However, in spite of the claim on the book´s cover, one should not expect an entirely systematic approach to data handling.

Solving multiarmed bandits: A comparison of epsilon-greedy and Thompson sampling

The multi-armed bandit (MAB) is a classic problem in decision sciences. Effectively, it is one of optimal resource allocation under uncertainty. The name is derived from old slot machines that where operated by pulling an arm?-?they are called bandits because they rob those who play them. Now, imagine there are multiple machines and we suspect that the payout rate?-?the payout to pull ratio?-?varies across the machines. Naturally we want to identify the machine with the highest payout and exploit it?-?i.e. pull it more than the others.

Review — ARCNN (Deblocking / Denoising)

In this story, Artifacts Reduction CNN (ARCNN) is reviewed.

DBSCAN clustering for data shapes k-means can’t handle well (in Python)

In this post I’d like to take some content from Introduction to Machine Learning with Python by Andreas C. Müller & Sarah Guido and briefly expand on one of the examples provided to showcase some of the strengths of DBSCAN clustering when k-means clustering doesn’t seem to handle the data shape well. I’m going to go right to the point, so I encourage you to read the full content of Chapter 3, starting on page 168 if you would like to expand on this topic. I’ll be quoting the book when describing the working of the algorithm.

Deep Neural Networks for Regression Problems

Neural networks are well known for classification problems, for example, they are used in handwritten digits classification, but the question is will it be fruitful if we used them for regression problems? In this article I will use a deep neural network to predict house pricing using a dataset from Kaggle .

Understanding your Convolution network with Visualizations

The field of Computer Vision has seen tremendous advancements since Convolution Neural Networks have come into being. The incredible speed of research in this area, combined with the open availability of vast amount of Image databases on web has given us incredible results in past few years. The rise of large convolution neural networks started with AlexNet in 2012, which was created by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, and was the winning entry in ImageNet Large-Scale Visual Recognition Challenge that year. Since, then there has been no looking back for researchers in this field, and the results in various areas in Computer Vision are a clear proof of that. From the Face Recognition in your phone to driving your cars, the tremendous power of CNNs is being used to solve many real-world problems.

How Microsoft Uses Machine Learning to Help You Build Machine Learning Pipelines

Last week at its Ignite Conference, Microsoft unveiled the preview version of Automated Machine Learning(ML), a component of Azure ML that allows non-data science experts to build machine learning pipelines. Microsoft’s automated machine learning can be seen as their entrance in the popular Auto ML space which is quickly becoming one of the most active areas of research in the machine learning space. The work behind Microsoft’s automated machine learning were outlined in a research paper published a few months ago. Building machine learning solutions in the real world is a never-ending cycle of different steps such as extracting features, identifying models, tuning hyperparameters, etc. Each one of these tasks requires specific expertise that results overwhelming to non-data scientists. The decision making process that data scientists follow is never formalized in ways that can be reused by other teams with less expertise. Even worse, is the fact that the complexity of building a machine learning pipeline scales linearly with the complexity of the project and, more often than not, results in data science teams sacrificing the accuracy of a model in favor of delivering something that works.

Classification on a large and noisy dataset with R

Some days ago I wrote an article describing a comprehensive supervised learning workflow in R with multiple modelling using packages caret and caretEnsemble. Back then I mentioned that the I was using was kind of an easy one, in the sense that it was fully numeric, perfectly filled (not a single missing value), no categorical features, no class imbalance (of course since it was a regression problem), and it was a fairly small dataset with just 8 predictors. It felt almost like cheating but it really helped both for my first article written but also to go through the full workflow without much trouble and finally to do some multiple modelling using caretEnsemble. So now I’ve decided to take this from easy difficulty to normal difficulty.

A Checklist for working with Complex ML Problems

Many times, it happens that we encounter complex machine learning problems which are difficult to break down into simple sub-problems. Those working in startups would relate to the fact that we often have a habit of jumping from one experiment to another while we try to solve such complex use-cases. At Squad, our machine learning team encountered similar challenge. Despite doing an initial study, we at times were losing track of our experiments and approaches. We recognized that to circumvent frequent mistakes and to have clarity while approaching a complex business problem we must have a standard checklist. The checklist should assure that we don’t miss on critical elements while we solve any complex problem. In this blog post, we will share a 3 Stage Checklist which can be used for approaching complex ML problems. Each stage has some checkpoints which can help in systematic problem-solving.

DeepGL on Neo4j

In 2013 Tomas Mikolov and his Google colleagues released a paper describing word2vec, and popularised the idea of generating embeddings to represent pieces of data.