Startups bringing analytics and data science closer to you!

Data Science and analytics are changing every industry as you read this article! If you have been following Analytics Vidhya lately, I am sure you would know how strongly I believe in it. You can look at some exciting start-ups coming out, which are using analytics and data science to solve problems more efficiently than ever before. However, finding the right talent and tools is still difficult for a lot of companies wanting to use analytics. For example, how many SMEs can use dashboards to make real time decisions? or how many bloggers across the world know reading habits of their readers? This week I bring you a collection of products / companies, which are making anaytics more accessible to an end user. They are taking the data from customers, doing the heavy lifting and delivering results in simple and impactful manner – what else would a business owner ask for!

Affinity Analysis: Cost Effective Data Science for Smaller Banks and Credit Unions

Much of data science is not economic for smaller businesses. Affinity Analysis is low cost and can offer new and actionable insights.

Grave Mistakes that Companies Make in Big Data Projects

• Lack of Business Case
• Minimizing Data Relevance
• Underestimating Data Quality
• Overlooking of Data Granularity
• Improper Contextualization of Data
• Ignoring Data Preparation

Can Context Extraction replace Sentiment Analysis?

Sentiment analysis is hard. Most of the systems on the market will clock anywhere around 55-65% for unseen data, even though they might be 85%+ accurate in their cross-validations.
A couple of reasons why creating a generic sentiment analyser is tough;
• There is too much variation in texts across domains, leading to different meanings
• Identifying sarcasm and combination of phrases like, ‘not bad’ is not equal to ‘not’ AND ‘bad’

The Data Science Workflow

When dealing with data, it helps to have a well defined workflow. Specifically, whether we want to perform an analysis with the sole intent of ‘telling the story’ (Data Visualisation/Journalism) or build a system that relies on data to model a certain task (Data Mining), process matters. By defining a methodology in advance, teams are in sync and it is easier to avoid losing time trying to figure out what the next step should be. This enables a faster production of results and publication of materials. With that in mind, and following the previous blogpost about the Ashley Madison leak data analysis, we saw an opportunity to show the workflow that we are currently using. This workflow is used not only to analyse data leaks (such as the case of AshMad), but also to analyse our own internal data. It is important to mention however, that this workflow is a work in progress, in the sense that can be subjected to changes over time in order to obtain results more effectively.

Density-Based Clustering

In this blog post, I will cover a family of techniques known as density-based clustering. Compared to centroid-based clustering like K-Means, density-based clustering works by identifying ‘dense’ clusters of points, allowing it to learn clusters of arbitrary shape and identify outliers in the data.

Time Maps: Visualizing Discrete Events Across Many Timescales

Discrete events pervade our daily lives. These include phone calls, online transactions, and heartbeats. Despite the simplicity of discrete event data, it’s hard to visualize many events over a long time period without hiding details about shorter timescales.

Bayesian Data Analysis demos for Python

This repository contains some Python demos for the book Bayesian Data Analysis, 3rd ed by Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin (BDA3). Currently there are demos for BDA3 Chapters 2, 3, 4, 5, 6, 10 and 11. Furthermore there is a demo for PyStan. All demos have also notebook versions (.ipynb). When these are viewed in github, the pre-generated figures are directly shown without need to install anything. The demos were originally written for Matlab by Aki Vehtari and translated to Python by Tuomas Sivula.

Deep Learning Libraries by Language

Distance between Latitude and Longitude Coordinates in SQL

Pretty much any language commonly used for data analysis (R, SAS, Python) can calculate the distance between two geographic coordinates with relative ease. But sometimes you don’t want to have to pull your data out of your data warehouse to do your dirty work. If you’ve got a spatially enabled version of Postgres or SQL Server, you’re in business. But if not, you’ll have to get a little messy.

Clojure Machine Learning, Math & Statistical Libraries Collection

Clojure Machine Learning Libraries

Social Network Analysis reveals the alternative list of global power elite

Forbes magazine has been publishing the list of The World’s Most Powerful People since 2009. The number of people in the list is proportional to the global population with the ratio being one slot for every 100 million people on Earth. When the list started in 2009, there were 67 people on the list and the latest list from year 2014 had 72 people. According for Forbes, the list is calculated based on the person’s influence over lots of other people (e.g. Pope Francis, Wal-Mart CEO, Doug McMillon), financial resources controlled by the people (e.g. GDP, market capitalization, profits, assets, revenues and net worth), power in multiple spheres (e.g. Bill Gates), active use of power by the people (e.g. Vladimir Putin).

Most popular “Statistical Analysis and Data Mining” Papers

Statistical Analysis and Data Mining: The ASA Data Science Journal Edited By: David Madigan Most accessed papers from Statistical Analysis and Data Mining published 2014-2015

Recipe for Centered Horizontal Stacked Barplots (Useful for Likert scale responses)

There is a nice package and paper about this here: http://…/paper. However, the associated code is complex and uses lattice. Here’s a brief recipe using base graphics that implements the above figure

Examining Website Pathing Data Using Markov Chains

A markov model can be used to examine a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.