diff --git a/inactive_learning.html b/inactive_learning.html index 18c7923..4112410 100644 --- a/inactive_learning.html +++ b/inactive_learning.html @@ -168,7 +168,7 @@

What do we expect to see?

What do we see?

A learning setup can vary wrt multiple things: the dataset, the classifier family (something traditional like Random Forests vs a recent one like RoBERTa) and the text representation (so many embeddings to pick from, e.g., MPNet, USE). You’re thrown into such a setup, and you have no labeled data, but you have read about this cool new AL technique - would you expect it to work?

-

This is the aspect of AL that we explored. The figure below - taken from the paper - shows the cross-product of the different factors we tried. In all, there are \(350\) experiment settings. Note that RoBERTa is an end-to-end model, so in its case, both the “Representation” and “Classifier” are identical. Not counting random sampling, we tested out \(4\) query strategies (right-most box below), some traditional (“Margin” is a form of Uncertainty Sampling), some new.

+

This is the aspect of AL that we explored. The figure below - taken from the paper - shows the cross-product of the different factors we tested. In all, there are \(350\) experiment settings. Note that RoBERTa is an end-to-end model, so in its case, both the “Representation” and “Classifier” are identical. Not counting random sampling, we tested out \(4\) query strategies (right-most box below), some traditional (“Margin” is a form of Uncertainty Sampling), some new.

@@ -249,7 +249,7 @@

Here be Dragons

AL hyperparams are like existence proofs in mathematics - “we know for some value of these hyerparams our algorithm knocks it out of the park!” - as opposed to constructive proofs - “Ah! But we don’t know how to get to that value…”.

-
  • Lack of experiment standards: its hard to compare AL techniques across papers because there is no standard for setting batch or seed sizes or even the labeling budget (the final number of labeled points). These wildly vary across papers (for an idea, take a look at Table 4 in the paper), and sadly, they heavily influence performance.
  • +
  • Lack of experiment standards: its hard to compare AL techniques across papers because there is no standard for setting batch or seed sizes or even the labeling budget (the final number of labeled points). These wildly vary in the literature (for an idea, take a look at Table 4 in the paper), and sadly, they heavily influence performance.
  • I hope this post doesn’t convey the impression that I hate AL. But yes, it can be frustrating :-) I still think its a worthy problem, and I often read papers from the area. In fact, we have an ICML workshop paper involving AL from earlier (Nguyen & Ghose, 2023). All we are saying is that it is time to scrutinize the various practical aspects of AL. Our paper is accompanied by a library that we’re releasing (still polishing up things) - which will hopefully make good benchmarking convenient.