zzzotop/low-resource-data-quality-classification-demo-cat

Demo exploring, amongst other things, the extent to which low-resource languages have poorer quality data (in terms of both tagging and more general usefulness) than high-resource counterparts. Inspired by the estimate that error rate of tagging in the corpus used was 10% higher in the LRL than it was in the HRL (Zotova et al 2020). Also demonstrated is cross-lingual transfer, akin to my earlier demos.

BETO (dccuchile/bert-base-spanish-wwm-cased) finetuned for text classification on the Catalan portion of the Catalonia Independence Corpus (CIC) for 5 epochs. All Catalonian text entered will be classified as either in favour of, against, or neutral towards Catalonian independence. Significant preprocessing of dataset involved, including removal of the validation set and the reassignment of its data to the train and test sets. Learning rate 2e-5, batch size 4, weight decay 0.1.

Works best with long inputs, seems to associate topics about change and modernity with 'FAVOR' and those about history with 'AGAINST'. Generally skews 'AGAINST', probably overfitted.

Evaluated every epoch using F1 score with macro averaging:
5 epochs: 0.716673
10 epochs: 0.719966
20 epochs (final): 0.740322