“1. Never assume that your data is correct. By all means know that it is, but don’t assume it. Don’t trust the client on this count – to do so might be to do them a disservice.
2. Regardless of how noisy and inconsistent the bulk of your data may be, make sure that you have a small sample of “sanity-check” data that is exceptionally good, or at least whose limitations are very well understood. If the test data is solid, any problems with the training data – even if rife – may prove inconsequential. Without solid test data, you will never know.
3. Do not force (even by mild assumption) the use of sophisticated algorithms and complex models if the data does not support them. Sometimes much simpler is much better. The problem of overfitting (building unnecessarily complex models which serve only to reproduce idiosyncracies of the training data) is well documented, but the extent of this problem is still capable of causing surprise! Let the algorithms – in collaboration with the data – speak for themselves.
4. It follows that the algorithm and parameters the provide the best solution (when very rigorously cross-validated) can actually provide an indication of the quality of your data or the true complexity of the process that it embodies. If a thorough comparison of all the available algorithms suggests Nearest Centroid, Naive Bayes, or some Decision Stump, it is a good indication that the dominant signal in your data is a very simple one.
5. In situations like this machine learning algorithms can actually be used to clean the source data, if there’s a business case for it. Again, a small super-high-quality test set is essential in order to validate the efficacy of this cleaning process.”
Justin Washtell ( November 16, 2014 )