R vs Python: head to head data analysis
There have been dozens of articles written comparing Python and R from a subjective standpoint. We’ll add our own views at some point, but this article aims to look at the languages more objectively. We’ll analyze a dataset side by side in Python and R, and show what code is needed in both languages to achieve the same result. This will let us understand the strengths and weaknesses of each language without the conjecture. At Dataquest, we teach both languages, and think both have a place in a data science toolkit. We’ll be analyzing a dataset of NBA players and their performance in the 2013-2014 season. You can download the file here. For each step in the analysis, we’ll show the Python and R code, along with some explanation and discussion of the different approaches. Without further ado, let’s get this head to head matchup started!
Visualising a Circular Density
This afternoon, Jean-Luc asked me some help about an old post I did publish, minuit, l’heure du crime; and some graphs published a few days after, where I used a different visualisation, in another post.
Unsupervised Learning on Neural Network Outputs
The outputs of a trained neural network contain much richer information than just an one-hot classifier. For example, a neural network may give an image of a dog the probability of one in a million of being a cat but it is still much larger than the probability of being a car. To reveal the hidden structure in them, we apply two unsupervised learning algorithms, PCA and ICA, to the outputs of a deep convolutional neural network trained on the ImageNet of 1000 classes. The PCA/ICA embedding of the object classes reveals their visual similarity and the PCA/ICA components can be interpreted as common features shared by visually similar object classes. For an application, we show that the learned PCA/ICA can be useful for zero-shot learning. Our new zero-shot learning method outperforms previous state-of-the-art methods on the ImageNet of over 20000 classes.
Book Review: Data Mining with Rattle and R
Wrapping things up, Data Mining with Rattle and R is not just about how to use Rattle to solve Data Mining problems. It also digs quite deep into a number of Data Mining and Machine Learning algorithms. As such it’s also a pretty handy reference. If you are looking at a way of transitioning from a point-and-click style analysis to R, then I think that installing Rattle and getting hold of a copy of this book would be a good place to start. I think that the key thing to consider with regards to this book is that it does not set out to be an encyclopaedia on Machine Learning. It’s focus is on using Rattle (and in the process it provides the necessary background information).
In this repository hosted at github, the datadolph.in team is sharing all of the R codebase that it developed to analyze large quantities of data. datadolph.in team has benefited tremendously from fellow R bloggers and other open source communities and is proud to contribute all of its codebase into the community. The codebase includes ETL and integration scripts on –
• R-Solr Integration
• R-Mongo Interaction
• R-MySQL Interaction
• Fetching, cleansing and transforming data
• Classification (identify column types)
• Default chart generation (based on simple heuristics and matching a dimension with a measure)
• R-Solr Integration
• R-Mongo Interaction
• R-MySQL Interaction
• Fetching, cleansing and transforming data
• Classification (identify column types)
• Default chart generation (based on simple heuristics and matching a dimension with a measure)
Survey of review spam detection using machine learning techniques
Online reviews are often the primary factor in a customer’s decision to purchase a product or service, and are a valuable source of information that can be used to determine public opinion on these products or services. Because of their impact, manufacturers and retailers are highly concerned with customer feedback and reviews. Reliance on online reviews gives rise to the potential concern that wrongdoers may create false reviews to artificially promote or devalue products and services. This practice is known as Opinion (Review) Spam, where spammers manipulate and poison reviews (i.e., making fake, untruthful, or deceptive reviews) for profit or gain. Since not all online reviews are truthful and trustworthy, it is important to develop techniques for detecting review spam. By extracting meaningful features from the text using Natural Language Processing (NLP), it is possible to conduct review spam detection using various machine learning techniques. Additionally, reviewer information, apart from the text itself, can be used to aid in this process. In this paper, we survey the prominent machine learning techniques that have been proposed to solve the problem of review spam detection and the performance of different approaches for classification and detection of review spam. The majority of current research has focused on supervised learning methods, which require labeled data, a scarcity when it comes to online review spam. Research on methods for Big Data are of interest, since there are millions of online reviews, with many more being generated daily. To date, we have not found any papers that study the effects of Big Data analytics for review spam detection. The primary goal of this paper is to provide a strong and comprehensive comparative study of current research on detecting review spam using various machine learning techniques and to devise methodology for conducting further investigation.
Resolving transactional access and analytic performance trade-offs
While specialized systems will continue to serve companies, there will be situations where the complexity of maintaining multiple systems – to eke out extra performance – will be harder to justify.
Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks
A series of experiments on ten real-world datasets show that PBP is significantly faster than other techniques, while offering competitive predictive abilities. Our experiments also show that PBP provides accurate estimates of the posterior variance on the network weights.