The Essential NLP Guide for data scientists (with codes for top 10 common NLP tasks)

Organizations today deal with huge amount and wide variety of data – calls from customers, their emails, tweets, data from mobile applications and what not. It takes a lot of effort and time to make this data useful. One of the core skills in extracting information from text data is Natural Language Processing (NLP). Natural Language Processing (NLP) is the art and science which helps us extract information from text and use it in our computations and algorithms. Given then increase in content on internet and social media, it is one of the must have still for all data scientists out there. Whether you know NLP or not, this guide should help you as a ready reference for you. Through this guide, I have provided you with resources and codes to run the most common tasks in NLP.

When to Categorize Continuous Predictor in a Regression Model?

Research fields usually follow the practice of categorizing continuous predictor variables, and they are the same who mostly use ANOVA. They often do it through median splits, the high value above the median and the low values below the median.

From social media to public health surveillance: Word embedding based clustering method for twitter classification

Social media provide a low-cost alternative source for public health surveillance and health-related classification plays an important role to identify useful information. In this paper, we summarized the recent classification methods using social media in public health. These methods rely on bag-of-words (BOW) model and have difficulty grasping the semantic meaning of texts. Unlike these methods, we present a word embedding based clustering method. Word embedding is one of the strongest trends in Natural Language Processing (NLP) at this moment. It learns the optimal vectors from surrounding words and the vectors can represent the semantic information of words. A tweet can be represented as a few vectors and divided into clusters of similar words. According to similarity measures of all the clusters, the tweet can then be classified as related or unrelated to a topic (e.g., influenza). Our simulations show a good performance and the best accuracy achieved was 87.1%. Moreover, the proposed method is unsupervised. It does not require labor to label training data and can be readily extended to other classification problems or other diseases.

?September Kaggle Dataset Publishing Awards Winners’ Interview

This interview features the stories and backgrounds of our $10,000 Datasets Publishing Award’s September winners-Khuram Zaman, Mitchell J, and Dave Fisher-Hickey. If you’re inspired to publish your own datasets on Kaggle and vie for next month’s prize, check out this page for more details.

Neural Network Foundations, Explained: Updating Weights with Gradient Descent & Backpropagation

In neural networks, connection weights are adjusted in order to help reconcile the differences between the actual and predicted outcomes for subsequent forward passes. But how, exactly, do these weights get adjusted?

Analysing Cryptocurrency Market in R

Cryptocurrency market has been growing rapidly that being an Analyst, It intrigued me what does it comprise of. In this post, I’ll explain how can we analyse the Cryptocurrency Market in R with the help of the package coinmarketcapr. Coinmarketcapr package is an R wrapper around coinmarketcap API.

Linear Regression in Python; Predict The Bay Area’s Home Prices

I chose the Bay Area housing price dataset that was sourced from Bay Area Home Sales Database and Zillow. This dataset was based on the homes sold between January 2013 and December 2015. It has many characteristics of learning. The dataset can be downloaded from here.

Machine Learning Algorithms: Which One to Choose for Your Problem

When I was beginning my way in data science, I often faced the problem of choosing the most appropriate algorithm for my specific problem. If you’re like me, when you open some article about machine learning algorithms, you see dozens of detailed descriptions. The paradox is that they don’t ease the choice. In this article for Statsbot, I will try to explain basic concepts and give some intuition of using different kinds of machine learning algorithms in different tasks. At the end of the article, you’ll find the structured overview of the main features of described algorithms.

Density Based Spatial Clustering of Applications with Noise (DBSCAN)

DBSCAN is a different type of clustering algorithm with some unique advantages. As the name indicates, this method focuses more on the proximity and density of observations to form clusters. This is very different from KMeans, where an observation becomes a part of cluster represented by nearest centroid. DBSCAN clustering can identify outliers, observations which won’t belong to any cluster. Since DBSCAN clustering identifies the number of clusters as well, it is very useful with unsupervised learning of the data when we don’t know how many clusters could be there in the data. K-Means clustering may cluster loosely related observations together. Every observation becomes a part of some cluster eventually, even if the observations are scattered far away in the vector space. Since clusters depend on the mean value of cluster elements, each data point plays a role in forming the clusters. Slight change in data points might affect the clustering outcome. This problem is greatly reduced in DBSCAN due to the way clusters are formed.

Neurosurgeon: collaborative intelligence between the cloud and the mobile edge

For a whole class of new intelligent personal assistant applications that process images, videos, speech, and text using deep neural networks, the common wisdom is that you really need to run the processing in the cloud to take advantage of powerful clusters of GPUs. Intelligent personal assistants on mobile devices such Siri, Google Now, and Cortana all fall into this category, and all perform their computations in the cloud. There’s no way, so the thinking goes, that you could do this kind of computation on the device with reasonable latency and energy consumption. What Neurosurgeon shows us is that the common wisdom is wrong! In a superbly written paper the authors demonstrate that by intelligent splitting of the computation between the cloud and the mobile device, we can reach solutions that are better for everyone.