Tags: Machine Learning
A few years ago, I attended a very good talk about identifying influencers in social media based on textual features. To evalute the results, the researchers employed cross-validation, a very popular technique in machine learning where the train set is split in n parts (called folds). The machine learning system is then trained and evaluated n times, each time in all the training data minus one fold and then evaluated in the remaining fold. In that way it is possible to have evaluation results for an evaluation set of the same size as the train set without doing the "mortal sin" of evaluating in training. The technique is very useful and widely employed. However, that doesn't stop you from overfitting at the methodological level, meaning if you repeat multiple experiments over the same data you will get enough insights into it to "overfit" it. This methodological problem is quite common, so I decided to write it down. It is also not very easy to spot due to the Warm Fuzzy Feeling (TM) that comes with using cross validation. That is, many times we as practitioners feel that by using cross-validation we buy some magical insurance policy against overfitting.
After evaluating how their system performed with all its components, the authors evaluated each of their components separatedly and then assembled the "best" system. My issue is with the claim the performance results in this so assembled "best" system were not overfitted. Taking the performance of multiple components and assembling the best combination is a type of meta-learning. Even though the performance of each of the individual components is cross-validated, and arguably the selected components are possibly the best for the task, the performance of this combination of best components is overfit to this training data. To have non-overfit numbers, a two-level cross-validation needs to be performed or an evaluation in a held-out set.
A two-level cross-validation will go like this: split the training data into 2 parts (I choose two parts to maximize the amount of training data in the remaining fold). In the first half, do a ten-fold cross-validation training the sub-components and assemble the best possible system. Then in the second half, do 10-fold cross-validation of training and testing the full-system. Then repeat switching the halves. The final evaluation number is representative of the performance of a "best" system. Interestingly, the best system components in the first half might not be the same in the second!
At any rate, this is a very small point that does not invalidate the work in that paper. It does although make for a nice background for the larger methodological point I discuss next.
OK, while this meta-learning issue was tough to spot, there's also a tricky overfitting problem that is even more common: when developing a system using multiple features, it is easy to keep evaluating using cross-validation and gain intuitions and insights about the data up to the point to overfitting on it, in practice.
Interestingly, the meta-learning issue discussed before is a fitting metaphor for a scientist trying different features and feature variations against the same dataset: a human-in-the-middle type of meta-learning. The way to avoid it is to do this adaptation process in a small development set before moving to a large scale evaluation (that can very well be done using cross-validation). I did that on my thesis and it works well for a build-once-and-test situation. For something with multiple iterations (like the Watson system), you'll need large amounts of data and a data release protocol so you can ensure the new features and components are added in a healthy, not overfitting manner.