zzzotop/low-resource-data-quality-classification-demo-cat-5-epoch

Demo exploring, amongst other things, the extent to which low-resource languages have poorer quality data (in terms of both tagging and more general usefulness) than high-resource counterparts. Inspired by the estimate that error rate of tagging in the corpus used was 10% higher in the LRL than it was in the HRL (Zotova et al 2020). Also demonstrated is cross-lingual transfer, akin to my earlier demos.

BETO (dccuchile/bert-base-spanish-wwm-cased) finetuned for text classification on the Catalan portion of the Catalonia Independence Corpus (CIC) for 5 epochs. All Catalonian text entered will be classified as either in favour of, against, or neutral towards Catalonian independence. Significant preprocessing of dataset involved, including removal of the validation set and the reassignment of its data to the train and test sets. Learning rate 2e-5, batch size 4, weight decay 0.1.

Exists to compare with the 20 epoch model, which I believe to be overfitted. Slightly better at shorter inputs than the 20 epoch model, but still poor at very short inputs. Performs marginally better at examples taken from the CIC itself, though.

Evaluated every epoch using F1 score with macro averaging:
5 epochs: 0.729566