TensorFlow does not change the world. But it appears to be the best, most convenient deep learning library out there.
Enough with the theory we recently published, let’s take a break and have fun on the application of Statistics used in Data Mining and Machine Learning, the k-Means Clustering. k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. We will apply this method to an image, wherein we group the pixels into k different clusters.
I’ve been working on a dashboard and found a data set that would look great represented as a discrete sparkline. This type of visualization is great for quickly showing trends at a glance in discrete variable datasets.
Generating data usually requires a variance – covariance matrix and is therefore restricted by using a linear assumption between the variables. However, using a linear assumption between data can miss important non – linear relationships. This post uses quantile random forests to simulate an arbitrary dataset.
1. Random Forest Method
2. Relative Importance
5. Step-wise Regression
7. Information value and Weight of evidence
2. Relative Importance
5. Step-wise Regression
7. Information value and Weight of evidence
This is the second post in a series of me trying to learn something new over a short period of time. The first time consisted of learning how to do machine learning in a week. This time I’ve tried to learn neural networks. While I didn’t manage to do it within a week, due to various reasons, I did get a basic understanding of it throughout the summer and autumn of 2015. By basic understanding, I mean that I finally know how to code simple neural networks from scratch on my own. In this post, I’ll give a few explanations and guide you to the resources I’ve used, in case you’re interested in doing this yourself.
In this study, I evaluate some popular deep learning toolkits. The candidates are listed in alphabetical order: Caffe, CNTK, TensorFlow, Theano, and Torch. This is a dynamic document and the evaluation, to the best of my knowledge, is based on the current state of their code. I also provide ratings in some areas because for a lot of people, ratings are useful. However, keep in mind that ratings are inherently subjective .
Regression analysis to estimate the association between a variable of interest and outcome. The methods that we include in this category are linear regression, logistic regression, and cox regression.
This document outlines the specifics training features and the practicalities of how to use them in DeepLearning4J. This document assumes some familiarity with recurrent neural networks and their use – it is not an introduction to recurrent neural networks, and assumes some familiarity with their both their use and terminology. If you are new to RNNs, read A Beginner’s Guide to Recurrent Networks and LSTMs before proceeding with this page.
One of the themes of the Christmas movie classic Love Actually is the interconnections between people of different communities and cultures, from the Prime Minister of the UK to a young student in London. StackOverflow’s David Robinson brings these connections to life by visualizing the network diagram of 20 characters in the movie, based on scenes in which they appear together:
The R Academy of eoda is a modular course program for the R statistical language with regular events and training sessions. Our course instructors have been working with data analysis for over 10 years. The course concept is aimed to train you to become an R expert. Depending on your needs and interests, you can choose from a variety of different course modules. A strictly hierarchical structure does not exist, and the modules can be combined individually. Our R training at universities, graduate centers as well as for companies are regularly evaluated and rated very well.
On my December to-do list, I had “write an R package to make analytic hierarchy process (AHP) easier” — but fortunately gluc beat me to it, and saved me tons of time that I spent using AHP to do an actual research problem. First of all, thank you for writing the new ahp package! Next, I’d like to show everyone just how easy this package makes performing AHP and displaying the results. We will use the Tom, Dick, and Harry example that is described on Wikipedia. – the goal is to choose a new employee, and you can pick either Tom, Dick, or Harry. Read the problem statement on Wikipedia before proceeding.
The comments on my post outlining recommended R usage for professional developers were universally scornful, with my proposal recommending subset receiving the greatest wrath. The main argument against using subset appeared to be that it went against existing practice, one comment linked to Hadley Wickham suggesting it was useful in an interactive session (and by implication not useful elsewhere). The commenters appeared to be knowledgeable R users and I suspect might have fallen into the trap of thinking that having invested time in obtaining expertise of language intricacies, they ought to use these intricacies. Big mistake, the best way to make use of language expertise is to use it to avoid the intricacies, aiming to write simply, easy to understand code. Let’s use Hadley’s example to discuss the pros and cons of subset vs. array indexing (normally I have lots of data to help make my case, but usage data for R is thin on the ground).
R is not one of those languages where there is only one way of doing something, the language is blessed/cursed with lots of ways of doing the same thing. Teaching R to professional developers is easy in the sense that their fluency with other languages will enable them to soak up this small language like a sponge, on the day they learn it. The problems will start a few days after they have been programming in another language and go back to using R; what they learned about R will have become entangled in their general language knowledge and they will be reduced to trial and error, to figure out how things work in R (a common problem I often have with languages I have not used in a while, is remembering whether the if-statement has a then keyword or not).
In this post, we’ll look at a simple method to identify segments of an image based on RGB color values. The segmentation technique we’ll consider is called color quantization. Not surprisingly, this topic lends itself naturally to visualization and R makes it easy to render some really cool graphics for the color quantization problem.
I am delighted to announce launch of AV Blogathon. We are inviting bloggers to position themselves as data hackers & thought leaders in the industry. Don’t get mislead by the term ‘blogger’. For us, you are a ‘blogger’, if you know the art of expressing yourself in writing. All you have to do is pen down your article and make a submission. We will not only publish the best of articles on Analytics Vidhya, but also provide feedback to every writer about their article. We have grown this community through blogs and if there is some one who would understand the value of a well written article – it is us!
Moritz Stefaner started off 2016 with a very spiffy post on “a visual exploration of the spatial patterns in the endings of German town and village names”. Moritz was exploring some new data processing & visualization tools for the post, but when I saw what he was doing I wondered how hard it would be to do something similar in R and also used it as an opportunity to start practicing a new habit in 2016: packages vs projects.
In the previous post (https://…/the-power-of-decision-stumps), it was shown that the boosting algorithm performs extremely well even with a simple 1-level stump as the base learner and provides a better performance lift than the bagging algorithm does. However, this observation shouldn’t be generalized, which would be demonstrated in the following example. First of all, we developed a rule-based PART model as below. Albeit pruned, this model will still tend to over-fit the data, as shown in the highlighted.
Matt Parker recently showed us how to create multi-tab reports with R and jQuery UI. His example was absurdly easy to reproduce; it was a great blog post. I have been teaching myself Shiny in fits and starts, and I decided to attempt to reproduce Matt’s jQuery UI example in Shiny. You can play with the app on shinyapps.io, and the complete project is up on Github. The rest of this post walks through how I built the Shiny app.
A decision stump is the weak classification model with the simple tree structure consisting of one split, which can also be considered a one-level decision tree. Due to its simplicity, the stump often demonstrates a low predictive performance. As shown in the example below, the AUC measure of a stump is even lower than the one of a single attribute in a separate testing dataset.
A week ago my high school friend, @XLRunner, sent me a link to the article ‘How Zach Bitter Ran 100 Miles in Less Than 12 Hours’. Zach’s effort was rewarded with the American record for the 100 mile event. This reminded me of some analysis I did, many years ago, of the world record speeds for various running distances. The International Amateur Athletics Federation (IAAF) keeps track of world records for distances from 100m up to the marathon (42km). The distances longer than 42km do not fall in the IAAF event list, but these are also tracked by various other organisations.
Top recent deep learning papers on arXiv are presented, summarized, and explained with the help of a leading researcher in the field.