20 Big Data Repositories You Should Check Out
This is an interesting listing created by Bernard Marr. I would add the following great sources:
Awesome Data Science
An open source DataScience repository to learn and apply for real world problems.
Active Data Mining, Data Science blogs
Here are 85 or so active (recently updated) data mining, data science, and machine learning blogs.
How to Transition from Excel to R
In today’s increasingly data-driven world, business people are constantly talking about how they want more powerful and flexible analytical tools, but are usually intimidated by the programming knowledge these tools require and the learning curve they must overcome just to be able to reproduce what they already know how to do in the programs they’ve become accustomed to using. For most business people, the go-to tool for doing anything analytical is Microsoft Excel. If you’re an Excel user and you’re scared of diving into R, you’re in luck. I’m here to slay those fears! With this post, I’ll provide you with the resources and examples you need to get up to speed doing some of the basic things you’re used to doing in Excel in R. I’m going to spare you the countless hours I spent researching how to do this stuff when I first started so that you feel comfortable enough to continue using R and learning about its more sophisticated capabilities.
Mapping the world with tweets
A few days ago, I collected 30 minutes of tweets all around the world. I used the twitteR and streamR packages for this. The nice thing about those tweets is that they have geo-information associated with them. Not all of them, of course, but more than enough.
Customer segmentation – LifeCycle Grids with R
I want to share a very powerful approach for customer segmentation in this post. It is based on customer’s lifecycle, specifically on frequency and recency of purchases. The idea of using these metrics comes from the RFM analysis. Recency and frequency are very important behavior metrics. We are interested in frequent and recent purchases, because frequency affects client’s lifetime value and recency affects retention. Therefore, these metrics can help us to understand the current phase of the client’s lifecycle. When we know each client’s phase, we can split customer base into groups (segments) in order to:
• understand the state of affairs,
• effectively using marketing budget through accurate targeting,
• use different offers for every group,
• effectively using email marketing,
• increase customers’ life-time and value, finally.
How-to go parallel in R – basics + tips
Today is a good day to start parallelizing your code. I’ve been using the parallel package since its integration with R (v. 2.14.0) and its much easier than it at first seems. In this post I’ll go through the basics for implementing parallel computations in R, cover a few common pitfalls, and give tips on how to avoid them. The common motivation behind parallel computing is that something is taking too long time. For me that means any computation that takes more than 3 minutes – this because parallelization is incredibly simple and most tasks that take time are embarrassingly parallel. Here are a few common tasks that fit the description:
• Bootstrapping
• Cross-validation
• Multivariate Imputation by Chained Equations (MICE)
• Fitting multiple regression models
Data Mining finds JASBUG, a Critical Security Vulnerability
We explain how the critical Microsoft security vulnerability JASBUG that existed for 15 years was detected with similarity search and regular expression inference.