A simple way to improve classification performance is to average the predictions of a large ensemble of different classifiers. This is great for winning competitions but requires too much computation at test time for practical applications such as speech recognition. In a widely ignored paper in 2006, Caruana and his collaborators showed that the knowledge in the ensemble could be transferred to a single, efficient model by training the single model to mimic the log probabilities of the ensemble average. This technique works because most of the knowledge in the learned ensemble is in the relative probabilities of extremely improbable wrong answers. For example, the ensemble may give an image of a BMW a probability of one in a billion of being a garbage truck but this is still far greater (in the log domain) than its probability of being a carrot. This ‘dark knowledge’, which is practically invisible in the class probabilities, defines a similarity metric over the classes that makes it much easier to learn a good classifier.
http://…/dark-knowledge-neural-network.html
http://…/geoff-hintons-dark-knowledge
http://…/1503.02531v1.pdf
Dark Knowledge google