“If we think of training the model as a part of it, then even after you’ve trained a model and evaluated it and found it to be good by some evaluation metric standards, when you deploy it, where it actually goes and faces users, then there’s a different set of metrics that would impact the users. You might measure: how long do users actually interact with this model? Does it actually make a difference in the length of time? Did they used to interact less and now they’re more engaged, or vice versa? That’s different from whatever evaluation metric that you used, like AUC or per class accuracy or precision and recall. … It’s probably not enough to just say this model has a .85 F1 score and expect someone who has not done any data science to understand what that means. How good are the results? What does it actually mean to the end users of the product?” Alice Zheng ( 2015 )