Keyword Extraction with TF-IDF and scikit-learn – Full Working Example

In this era of Deep Learning, one may be wondering why you would even use TF-IDF for any task at all ?!! The truth is TF-IDF is very easy to understand, very easy to compute and is one of the most versatile statistic that shows the relative importance of a word or phrase in a document or a set of documents in comparison to the rest of your corpus. It can be used for a wide range of tasks including text classification, clustering / topic-modeling, search, keyword extraction and a whole lot more. In this article, you will learn how to use TF-IDF from the scikit-learn package to extract keywords from documents. Keywords are simply descriptive words or phrases that characterize your documents. For example, keywords from this article would be tf-idf, scikit-learn, keyword extraction, extract and so on. These keywords are also referred to as topics in some applications.

The Data Science Method (DSM) -Exploratory Data Analysis

This is the third article in a series about how to take your data science projects to the next level by using a methodological approach similar to the scientific method coined the Data Science Method. This article is focused on the number of step three Exploratory Data Analysis. If you missed the previous article(s) in this series, you can go to the beginning here, or click on each step title below to read a specific step in the process.

19 Great Articles About Natural Language Processing (NLP)

• Structuring Unstructured Big Data via Indexation
• Your Guide to Natural Language Processing (NLP)
• Comparison of Top 6 Python NLP Libraries
• Text Classification & Sentiment Analysis tutorial
• Deep Learning Research Review: Natural Language Processing
• 10 Common NLP Terms Explained for the Text Analysis Novice
• Temporal Convolutional Nets Take Over from RNNs for NLP Predictions
• How I used NLP (Spacy) to screen Data Science Resumes
• Data Science Reveals Trump Tweets are Written by Two People
• Simple introduction to Natural Language Processing
• An NLP Approach to Analyzing Twitter, Trump, and Profanity
• A Natural Language Processing (NLP) Approach to Data Exploration
• Python NLTK Tools List for Natural Language Processing
• NLP app to find great available domain names
• Scaling an NLP problem without using a ton of hardware
• Analyzing the structure and effectiveness of news headlines
• Seven tricky sentences for NLP and text mining algorithms
• Overview of Artificial Intelligence and Role of NLP
• Text Classification & Sentiment Analysis tutorial

Putting AI First

Overshadowed in Davos last month by the constant buzz of the ‘ Fourth Industrial Revolution ‘ and ‘ Human-centered A.I. ‘, artificial intelligence could be read on the lips of most every CEO in attendance.[1] In sharp contrast with the diversity of the technical introductions, there seemed common mindset around the bottom line: artificial intelligence is a massive opportunity for both commerce and industry. In the shadows of such euphoria the answers to two fundamental questions seemed lost in the chatter: how will organizations implement AI, and how can human and machine intelligence be leveraged together to bring tangible benefits to the business? In this contribution, we will distinguish AI from machine learning, discuss the notion of AI First, and propose our AI Roadmap to demonstrate how bringing Data Science to the table can benefit organizations today and tomorrow.

HandCrafting an Artificial Neural Network

In this article, I have implemented a fully vectorized python code of an Artificial Neural Network tested on multiple datasets. Further, dropout and L2 regularization techniques are implemented and explained in detail. It is highly recommended to go through the basic working of an Artificial Neural Network, forward propagation, and backward propagation.

Having an Imbalanced Dataset? Here Is How You Can Fix It.

Classification is one of the most common machine learning problems. The best way to approach any classification problem is to start by analyzing and exploring the dataset in what we call Exploratory Data Analysis (EDA). The sole purpose of this exercise is to generate as many insights and information out of the data as possible. It is also used to find any problems that might exist in the dataset. One of the common issues found in a dataset that is used for classification is imbalanced classes issue.

New Perspectives on Statistical Distributions and Deep Learning

In this data science article, emphasis is on science, not just on data. State-of-the art material is presented in simple English, from multiple perspectives: applications, theoretical research asking more questions than it answers, scientific computing, machine learning, and algorithms. I attempt here to lay the foundations of a new statistical technology, hoping that it will plant the seeds for further research on a topic with a broad range of potential applications. It is based on mixture models. Mixtures have been studied and used in applications for a long time, and it is still a subject of active research. Yet you will find here plenty of new material.

3 Conditions for Data Science Project Success

Your Data Science Project Will Fail Unless It Meets These 3 Conditions aka ‘Will’s Rules’, without which your project doesn’t have a chance in …
Rule #1 – Will There Be a Genuine, Enthusiastic Sponsor?
Rule #2 – Will the Stakeholders and User Community Embrace the Change?
Rule #3 – Will You Have Domain Subject Matter Experts Available to Support the Project and Finished Product?

Automating Scientific Data Analysis Part 1

1. Create the Test Plan
2. Design the Data set to Allow Automation
3. Create a Clear File Naming System
4. Store the Resulting Data Files in a Specific Folder
5. Analyze the Results of Individual Tests
6. Include Error Checking Options
7. Store the Data Logically
8. Generate Regressions From the Resulting Data set
9. Validate the Results

Scikit-Learn Decision Trees Explained

Decision trees are the most important elements of a Random Forest. They are capable of fitting complex data sets while allowing the user to see how a decision was taken. While searching the web I was unable to find one clear article that could easily describe them, so here I am writing about what I have learned so far. It’s important to note, a single decision tree is not a very good predictor; however, by creating an ensemble of them (a forest) and collecting their predictions, one of the most powerful machine learning tools can be obtained?-?the so called Random Forest.

An introduction to Spark GraphFrame with examples analyzing the Wikipedia link graph

The Spark GraphFrame is a powerful abstraction for processing large graphs using distributed computing. It provides a plethora of common graph algorithms including label propagation and PageRank. Further, it provides the foundations for implementing complex graph algorithms, including a robust implementation of the Pregel paradigm for graph processing. Anyone who’s interested in working with large graphs should learn how to apply this powerful tool. In this article, I’ll introduce you to the basic of GraphFrame and demonstrate how to use this tool through several examples. These examples consider the link graph between Wikipedia articles and I demonstrate how to analyze this graph by leveraging the GraphFrame abstraction.

Spam Filtering System With Deep Learning

Deep learning is getting very popular in many industry and many interesting problems can be solved by deep learning technology. In this article, I am showing you how to utilize a deep learning model to design a super effective spam filtering system.
Not long ago, I had written an article about Spam Filtering with Traditional Machine Learning algorithm.
In that article, I have covered from data exploration, data preprocessing , feature extraction to choosing the right scoring metrics for your algorithm. If you are interested, you can read the article here!
Today, I am going to focus more on the following two parts when building a spam filtering system:
1. Word Embedding
2. GRU + Bidirectional Deep Learning Model

Language Translation with RNNs

This post explores my work on the final project of the Udacity Artificial Intelligence Nanodegree program. My goal is to help other students and professionals who are in the early phases of building their intuition in machine learning (ML) and artificial intelligence (AI). With that said, please keep in mind that I am a product manager by trade (not an engineer or data scientist). So, what follows is meant to be a semi-technical yet approachable explanation of the ML concepts and algorithms in this project. If anything covered below is inaccurate or if you have constructive feedback, I’d love to hear from you.

Gradient-Free Optimization for GLMNET Parameters

In the post https://…/variable-selection-with-elastic-net, it was shown how to optimize hyper-parameters, namely alpha and gamma, of the glmnet by using the built-in cv.glmnet() function. However, following a similar logic of hyper-parameter optimization shown in the post https://…/direct-optimization-of-hyper-parameter, we can directly optimize alpha and gamma parameters of the glmnet by using gradient-free optimizations, such as Nelder-Mead simplex or particle swarm. Different from traditional gradient-based optimizations, gradient-free optimizations are often able to find close-to-optimal solutions that are considered ‘good enough’ from an empirical standpoint in many cases that can’t be solved by gradient-based approaches due to noisy and discontinuous functions.

Creating Bots That Sound Like Humans With Natural Language Processing

Creating Bots That Talk Like Humans With Natural Language Processing

Sentiment Analysis with Deep Learning

In this article, I will cover the topic of Sentiment Analysis and how to implement a Deep Learning model that can recognize and classify human emotions in Netflix reviews.