17 Trends in Big Data and Data Science
1. The rise of data plumbing
2. The rise of the data plumber, system architect, and system analyst
3. Use of data science in unusual fields such as astrophysics
4. The death of the fake data scientist
5. The rise of the right-sized data (as oppose to big data)
6. Putting more intelligence (sometimes called AI or deep learning) into rudimentary big data applications
7. Increased awareness of data security and protection
8. The rise of mobile data exploitation
9. The rise of the ‘automated statistician’
10. Predictive modeling without models
11. High performance computing (HPC)
12. Increased collaboration between government agencies worldwide to standardize data and share it
13. Forecasting space weather and natural events
14. Use of data science for automated content generation
15. Measuring yield of big data or data science initiatives
16. Digital health: diagnostic/treatment offered by a robot
17. Analytic processes (even in batch mode) accessible from your browser anywhere on any device.
50 Excellent Data Visualization Examples for Presentations
Data presentation should be elegant, detailed and beautiful. There are different ways to show the data that can be a pie charts, tables, histograms, and bar graph. However, to send a clear and effective message to your readers, you just need more than just a simple table or histogram, etc. There are some data visualization techniques that present your data much better than expected, intelligent, beautiful, original and in an excellent way. We have gathered some of the most attractive and unique ideas of data visualization examples.
Gold Mine or Blind Alley? Functional Programming for Big Data & Machine Learning
Functional programming is touted as a solution for big data problems. Why is it advantageous? Why might it not be? And who is using it now?
Model Management and Deployment
Have you experienced or thought how corporates manage their analytical assets which are mission critical to the business? A Bank or a Telecom Service Provider may often have more than 100 predictive model assets developed over a time period, but faces an important issue of how to effectively manage,store,share or archive these assets.
How to Become a Data Scientist for Free
Here is my cheat sheet of becoming a Data Scientist for Free:
1. Understand Data
2. Understand Data Scientist
3. Watch these 13 Ted Videos …
4. Watch this video of Hans Rosling to understand the power of Visualization
5. Listen to weekly podcasts by Partially Derivative on Data Sciences
6. Look at University of Washington’s Intro to Data Science
7. Explore this GitHub Link and try to read as much as you can
8. Check out Measure for America to gain an understanding of how data can make a difference
9. Read the free book – ‘Field Guide to Data Sciences’
10. Religiously follow this infographic on how to become a data scientist
11. Read this blog to master your R skills
12. Read this blog to master your statistics skills
13. Read this wonderful practical intro to data sciences
14. Try to complete this open source data science Masters program
15. Do this Machine Learning course at Coursera
16. Complete this Data Science Specialization on Coursera
17. Complete this Data Mining Specialization
18. Check out these industry specific courses/links on data sciences
19. Cloud computing specialization trainings are a must to do
20. Do these courses on Mining Massive Datasets and Process Mining
21. 27 best data mining books for free
22. Try to read Data Science Central once a day
23. Try to compete in as many Kaggle competitions as you can
24. These statistics driven courses will help you in differentiation from all other applicants
25. Follow the following on Twitter for Predictive Analytics …
26. Follow the following on Twitter for Big Data and Data Sciences …
The Science of Crawl (Part 3): Prioritization
This is Part 3 in our series ‘The Science of Crawl’ addressing the technical challenges around growing and maintaining a web corpus. In Part 1, we introduced a funnel for deduplicating web documents within a search index. The dual problems of exact-duplicate and near-duplicate web document identification are considered. By chaining together several methods with increasing specificity we identify a system that provides sufficient precision and recall with minimal computational trade-offs. In Part 2 we visit the problem of balancing resource allocation between crawling new pages and revisiting stale ones.
Regression – covariate adjustment
Linear regression is one of the key concepts in statistics [wikipedia1, wikipedia2]. However, people are often confuse the meaning of parameters of linear regression – the intercept tells us the average value of y at x=0, while the slope tells us how much change of y can we expect on average when we change x for one unit – exactly the same as in the linear function, though we use averages here due to noise.
Useful R snippets
In this post we collect several R one- or few-liners that we consider useful. As our minds tend to forget these little fragments we jot them down here so we will find them again.
Crawling facebook with R
So, let’s crawl some data out of facebook using R. Don’t get too excited though, this is just a weekend whatif project. Anyway, so for example, I want to download some photos where I’m tagged.
Time series cross-validation: an R example
I was recently asked how to implement time series cross-validation in R. Time series people would normally call this ‘forecast evaluation with a rolling origin’ or something similar, but it is the natural and obvious analogue to leave-one-out cross-validation for cross-sectional data, so I prefer to call it ‘time series cross-validation’.
Cohort Analysis and LifeCycle Grids mixed segmentation with R
This is the third post about LifeCycle Grids. You can find the first post about the sense of LifeCycle Grids and A-Z process for creating and visualizing with R programming language here. Lastly, here is the second post about adding monetary metrics (customer lifetime value – CLV – and customer acquisition cost – CAC) to the LifeCycle Grids.
Modeling Count Time Series with tscount Package
The example below shows how to estimate a simple univariate Poisson time series model with the tscount package. While the model estimation is straightforward and yeilds very similar parameter estimates to the ones generated with the acp package (https://statcompute.wordpress.com/2015/03/29/autoregressive-conditional-poisson-model-i), the prediction mechanism is a bit tricky.