Ensemble methods have the ability to provide much needed robustness and accuracy to both supervised and unsupervised problems. Machine learning is going to evolve more and more and computations power becomes cheap and the volume of data continues to increase. In such a scenario, there is a limit to the improvement you can achieve by using a single framework and attempting to improve its predictive power (using modification in variables). Ensemble Modeling follows the philosophy of ‘Unity in Strength’ i.e. combination of diversified base models strengthens weak models. The success of ensemble techniques spreads across multiple disciplines like recommendation systems, anomaly detection, stream mining, and web applications where the need for combination of competing models is ubiquitous.
My previous post covered the basics of logistic regression. We must now examine the model to understand how well it fits the data and generalizes to other observations. The evaluation process involves the assessment of three distinct areas – goodness of fit, tests of individual predictors, and validation of predicted values – in order to produce the most useful model. While the following content isn’t exhaustive, it should provide a compact ‘cheat sheet’ and guide for the modeling process.
I’ve always struggled with using plotmath via the expression function in R for adding mathematical notation to axes or legends. For some reason, the most obvious way to write something never seems to work for me and I end up using trial and error in a loop with far too many iterations. So I am very happy to see the new latex2exp package available which translates LaTeX expressions into a form suitable for R graphs. This is going to save me time and frustration!
Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.
In the next few posts, I’d like to cover some work to help you to process aggregated proficiency testing (PT) data. Interpreting PT data from groups such as the College of American Pathologists (CAP) is, of course, a fundamental task for lab management. Comparing your lab’s results to peer group data from other users of the same instrumentation helps to ensure that your patients receive consistent results, and it provides at least a crude measure to ensure that your instrument performance is “in the ballpark”. Of course, many assays show significant differences between instrument models and manufacturers that can lead to results that are not comparable as a patient moves from institution to institution (or when your own lab changes instruments!). There are a number of standardization and harmonization initiatives underway (see http://harmonization.net, for example) to address this, and understanding which assays show significant bias compared to benchmark studies or national guidelines is a critical task for laboratorians. All of this is further complicated by the fact that sample matrix can significantly affect assay results, and sample commutability is one important reason why we can’t just take, say, CAP PT survey results (not counting the accuracy-based surveys) and determine which assays aren’t harmonized.