10 Areas of Expertise in Data Science

In this article, we would walk you through the ten areas in Data Science which are a key part of a project, and you need to master those to be able to work as a Data Scientist in much big organization.
• Data Engineering
• Data Mining
• Cloud Computing
• Database Management
• Business Intelligence
• Machine Learning
• Deep Learning
• Natural Language Processing
• Data Visualization
• Domain Expertise

Build your First Multi-Label Image Classification Model in Python

In this article, I have explained the idea behind multi-label image classification. We will then build our very own model using movie posters. You will be amazed by the impressive results our model generates. And if you’re an Avengers or Game of Thrones fan, there’s an awesome (spoiler-free) surprise for you in the implementation section. Excited? Good, let’s dive in!

Thunderstruck: Disaster CNN visualization of AC power lines

NET Centre at VŠB is trying to detect partial discharge patterns from overhead power lines by analyzing power signals. This Kaggle challenge was a fun one for any electrical power enthusiasts. Ideally, we would be able to detect the slowly increasing damage to the power lines before it suffers a power outage or starts an electrical fire. However, there are many miles of powerlines. Also, damage to powerline isn’t immediately apparent, small damage from about anything (trees, high wind, manufacturing flaws, etc.) can be the start of cascading damages from discharges which increase the likely hood of failure in the future. It is a great goal. If we can successfully estimate the lines that need repairs, we can reduce costs while maintaining the flow of electricity. I mean money talks.

Mathematical programming – a key habit to built up for advancing in data science

We show how, by simulating the random throw of a dart, you can compute the value of pi approximately. This is a small step towards building the habit of mathematical programming, which should be a key skill in the repertoire of a budding data scientist.

Interactive Data Visualization with Vega

I’m always learning new visualization tools because this helps me identify the right one for the task at hand. When it comes to Data Visualization, d3 is usually the go-to choice, but recently I’ve been playing with Vega and I’m loving it. Vega introduces a visualization grammar. A grammar is basically a set of rules that dictate how to use a language, so we can think of Vega as a tool that defines a set of rules of how to build and manipulate visual elements. As my experience with data visualization grows, I’m finding more and more that constraints are a good thing. By introducing a visualization grammar, Vega gives us some constraints to work with. The best thing about it is that these constraints can make us feel very productive while building data visualizations. There is also Vega-Lite, a high-level grammar that focuses on rapid creation of common statistical graphics, but today we’ll stick with Vega which is a more general purpose tool. Ok, enough of introductions, let’s get an overview about how Vega works.

10 Tips to build a better modeling dataset for tree-based machine learning models

Assuming there’s a business problem that can be converted to a machine learning problem with tabular data as its input, clearly defined labels and metrics (say, RMSE for regression problems or ROCAUC for classification problems). In the dataset, there’re a bunch of categorical variables, numerical variables and some missing values within the data, and a tree-based ML model is going to be built on top of the dataset (decision trees, random forests, or gradient boosting trees). Are there some tricks to improve the data before applying any ML algorithms on top of it? This process may vary a lot with the dataset. But I’d like to point out some general principles that could apply to a bunch of datasets, and also explain why. Some knowledge of tree-based ML algorithms may help the reader better digest part of the materials.

End-To-End Topic Modeling in Python: Latent Dirichlet Allocation (LDA)

Topic Model: In a nutshell, it is a type of statistical model used for tagging abstract ‘topics’ that occur in a collection of documents that best represents the information in them. Many techniques are used to obtain topic models. This post aims to demonstrate the implementation of LDA: a widely used topic modeling technique.

Understanding Bayesian Inference with a simple example in R!

Last summer, the Royal Botanical Garden (Madrid, Spain) hosted the first edition of MadPhylo, a workshop about Bayesian Inference in phylogeny using RevBayes. It was a pleasure for me to be part of the organization staff with John Huelsenbeck, Brian Moore, Sebastian Hoena, Mike May, Isabel Sanmartin and Tamara Villaverde. Next edition of Madphylo will be held June 10, 2019 to June 19, 2019at the Real Jardín Botánico de Madrid. If you are interested in Bayesian Inference and phylogeny just can’t miss it! You’ll learn the RevBayes language, a programming language to perform phylogeny (and other) analyses under a Bayesian framework!

Generating Images with Autoencoders

In the following weeks, I will post a series of tutorials giving comprehensive introductions into unsupervised and self-supervised learning using neural networks for the purpose of image generation, image augmentation, and image blending. The topics include:
• Variational Autoencoders (VAEs) (this tutorial)
• Neural Style Transfer Learning
• Generative Adversarial Networks (GANs)

AI-Generated Rap Songs

I often tell my younger coworkers that the most boring way to start a blog post is, ‘This post is about …’ – unless of course you rap it!

Real-World Data Science Challenge: When Is ‘Good Enough’ Actually ‘Good Enough’

One of the biggest challenges that data scientists face when developing their analytic models is knowing when ‘good enough’ is actually ‘good enough’. And this problem is exacerbated by the flood of data (some important, most not important) being generated from IoT sensors.

Cellular Coverage and Crime: A Case Study in the UK Using Machine Learning

Mobile carriers, governments, and communities must plan and assess cellular infrastructure deployment on an ongoing basis. The current study attempts to augment this decision process by exploring the spatial relationship between cellular coverage and street crime using bootstrap and machine learning techniques. Five machine learning algorithms (i.e., Logistic Regression, Support Vector Machines, K-Nearest Neighbors, Gradient Boosting, and Gaussian Naive Bayes) are optimized to perform binary classification and predict whether or not a given area contains more (or less) than the mean number of cellular radios across the UK, based on observed change in select categories of street crime. Gaussian Naive Bayes performed the best in terms of overall predictive performance with a precision of 92%, and a recall of 97% on the target class. Due to data availability and other constraints, the current study focuses on the change in street crime that occurred between 2012 and 2014 ? the roll-out of the first 4G network in the UK. The results of the study suggest that change in certain categories of street crime may be more (or less) correlated with cellular coverage than others. However, further analysis should be performed on other periods of time to broaden the scope of the conclusions, and additional socioeconomic and geographic variables should be included; such as population density and average household income. It is important to note that even though statistical correlation may be found between cellular coverage and street crime, no causal connection is made.

A-Z Machine Learning using Azure Machine Learning (AzureML)

In this course of Machine Learning using Azure Machine Learning, we will make it even more exciting and fun to learn, create and deploy machine learning models. We will go through every concept in depth. This course not only teaches basic but also the advance techniques of Data processing, Feature Selection and Parameter Tuning which an experienced and seasoned Data Science expert typically deploys. Armed with these techniques, in a very short time, you will be able to match the results that an experienced data scientist can achieve.

Practical Introduction to Web Scraping in R

Are you trying to compare price of products across websites? Are you trying to monitor price changes every hour? Or planning to do some text mining or sentiment analysis on reviews of products or services? If yes, how would you do that? How do you get the details available on the website into a format in which you can analyse it?
• Can you copy/paste the data from their website?
• Can you see some save button?
• Can you download the data?
Hmmm.. If you have these or similar questions on your mind, you have come to the right place. In this post, we will learn about web scraping using R. Below is a video tutorial which covers the intial part of this post.

How to use K-Means clustering in BigQuery ML to understand and describe your data better

BigQuery ML now supports unsupervised learning – you can apply the K-Means algorithm to group your data into clusters. Unlike supervised machine learning, which is about predictive analytics, unsupervised learning is about descriptive analytics – it’s about understanding your data so that you can make data-driven decisions.

Exploring Recruitment Bias using Machine Learning and R

Using an experimental dataset for this case study , the key objectives of my work are to investigate only the Shortlisting stage of the recruitment process and :
• Conduct an exploratory data analysis of the recruitment data to determine patterns of Gender , Ethnicity through the recruitment stages
• Investigate if Gender and Ethnicity influence applicant shortlisting process
• Apply Machine learning to Predict who will be Shortlisted and determine the key drivers
• Recommend updates to the Hiring strategy based on the findings

Econometrics behind Simple Linear Regression

One of the ways to describe Machine Learning accurately is to figure out the mathematical optimization for real-world problems. Sometimes when trying to solve the real-world problems using Machine Learning we may want to examine whether if certain factors have any correlation with a certain impact. For instance, whether a family’s weekly income has a correlation with the amount of money spent on food in a certain place over a period of time. In this particular example, the family’s weekly income is the predictor variable ( independent variable X), and the amount spent on food would be the response variable (dependent variable Y). Simple linear regression is the approach of forming a relationship between the dependent and independent variables. The simplest situation is to check whether if a single action has any relationship to a response. This is called a simple linear regression. In this article let’s look into the econometrics behind the simple linear regression.