Ramsay and Silverman’s Functional Data Analysis is a tremendously useful book that deserves to be more widely known. It’s full of ideas of neat things one can do when part of a dataset can be viewed as a set of curves – which is quite often. One of the methods they’ve developed is called Functional ANOVA. It can be understood I think as a way of formulation a hierarchical prior, but the way they introduce it is more as a way of finding and visualising patterns of variation in a bunch of curves. Ramsay and Silverman use classical penalised likelihood estimation techniques, but I thought it’d be useful to explore functional ANOVA in a Bayesian framework, so here’s a quick explanation of how you can do that armed with R and INLA (Rue et al., 2009).
Tableau today released a new visualization tool for iPad, called Vizable. This is a completely new app built specifically for exploring data using touch. It is based on a new approach to visual analysis that focuses on the data and task, rather than providing a chart toolbox.
I hope you’ve followed my previous articles on ensemble modeling. In this article, I’ll share a crucial trick helpful to build models using ensemble learning. This trick will teach you, ‘How to choose the right models for your ensemble process?’ Are you ready? Let’s Begin!
If you’re a data science enthusiast, you’ve doubtless read many of the classic books — both popular science and academic — that cover the field. But what if you want to expand your data science horizons but don’t have time to read a full book? Podcasts are a great way to enhance your education without sacrificing precious time. You can listen to them while commuting, cleaning, or performing any number of other tasks. Here are six of the best data-science-oriented podcasts to listen to if you’re a fan of popular data science books:
Data Mining has been introduced to the Computer Science field to solve an important problem which is the huge amount of Data we have there on the web available for everyone or the Big Data related to a specific Organization and its customers transactions and preferences. This availability of Data and this kind of crazy increase of it made a kind of huge amount of unknown information. Information is usually what we figure out from Data, and since the amount of Data is too big, so the process of exploring information would be more difficult and even impossible without a computed way of solving such a problem. Computer Science, Programming, Databases, Artificial Intelligence and Statistics all together constructed a new field in Computer Science called Data Mining.
Stream Processing and In-Stream Analytics are two rapidly emerging and widely misunderstood data science technologies. In this article we’ll focus on their basic characteristics and some business cases where they are useful.
Collaborative filtering is the process of filtering for information using techniques involving collaboration among multiple agents. Applications of collaborative filtering typically involve very large data sets. This article covers some good tutorials regarding collaborative filtering we came across in Python, Java and R.
Support vector machines (SVMs) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis.
Machine learning is a field that uses algorithms to learn from data and make predictions. Practically, this means that we can feed data into an algorithm, and use it to make predictions about what might happen in the future. This has a vast range of applications, from self-driving cars to stock price prediction. Not only is machine learning interesting, it’s also starting to be widely used, making it an extremely practical skill to learn. In this tutorial, we’ll guide you through the basic principles of machine learning, and how to get started with machine learning in Python. Luckily for us, Python has an amazing ecosystem of libraries that make machine learning easy to get started with. We’ll be using the excellent Scikit-learn, Pandas, and Matplotlib libraries in this tutorial.
In my previous post, I introduced the concept of smoothing using Fourier basis functions and I applied them onto temperature data. It is important to note the that a similar kind of analysis can be replicated using B-splines (see this page). In this post, I extend the concept to an another type of basis functions: Gaussian Radial basis functions. Radial basis functions are part of a class of single hidden layer feedforward networks which can be expressed as a linear combination of radially symmetric nonlinear basis functions. Each basis function forms a localized receptive field in the input space. The most commonly used function is the Gaussian Basis
My company has a subscription-based business model, which means we spend a lot of time analyzing customer churn. We wanted to include Kaplan-Meier survival curves in some of our executive dashboards, but neither our database (Redshift) nor any of our commonly used dashboarding tools (Tableau, Periscope, etc.) provided the necessary functionality. We could, of course, have pulled data out of the warehouse, analyzed it in R or Python, and pushed it back up, but that’s pretty complicated. So we went looking for a better solution.
autoencoder to calculate word embeddings based on counts and the Hellinger distance
Recruit Ponpare is Japan’s leading joint coupon site, offering huge discounts on everything from hot yoga, to gourmet sushi, to a summer concert bonanza. The Recruit Coupon Purchase Prediction challenge asked the community to predict which coupons a customer would buy in a given period of time using past purchase and browsing behavior.
The Caterpillar Tube Pricing competition challenged Kagglers to predict the price a supplier would quote for the manufacturing of different tube assemblies using detailed tube, component, and volume data. Team Shift Workers finished in 3rd place by combining a diverse set of approaches different members of the team had used before joining forces. Like other teams in the competition, they found XGBoost to be particularly powerful on this dataset.
Most software systems evolve over time. New features are added and old ones pruned. Fluctuating user demand means an efficient system must be able to quickly scale resources up and down. Demands for near zero-downtime require automatic fail-over to pre-provisioned back-up systems, normally in a separate data centre or region. On top of this, organizations often have multiple such systems to run, or need to run occasional tasks such as data-mining that are separate from the main system, but require significant resources or talk to the existing system. When using multiple resources, it is important to make sure they are efficiently used — not sitting idle — but can still cope with spikes in demand. Balancing cost-effectiveness against the ability to quickly scale is difficult task that can be approached in a variety of ways.
In my last post I described how I built a shiny application called “DFaceR” that used Chernoff Faces to plot multidimensional data. To improve application response time during plotting, I needed to split large datasets into more manageable “pages” to be plotted. Rather than take the path of least resistance and use either numericInput or sliderInput widgets that come with shiny to interact with paginated data, I wanted nice page number and prev/next buttons like on a dataTables.js table. In this post, I describe how I built a custom shiny widget called pager-ui to achieve this.