zzzotop/low-resource-data-quality-classification-demo-esp

Demo exploring, amongst other things, the extent to which low-resource languages have poorer quality data (in terms of both tagging and more general usefulness) than high-resource counterparts. Inspired by the estimate that error rate of tagging in the corpus used was 10% higher in the LRL than it was in the HRL (Zotova et al 2020). Also demonstrated is cross-lingual transfer, akin to my earlier demos.

BETO (dccuchile/bert-base-spanish-wwm-cased) finetuned for text classification on the Spanish portion of the Catalonia Independence Corpus (CIC) for 10 epochs, and then the Catalonian portion for 10 more. Same number of training steps. The intermediate model is on my profile. All Catalonian text entered will be classified as either in favour of, against, or neutral towards Catalonian independence. Significant preprocessing of dataset involved, including removal of the validation set and the reassignment of its data to the train and test sets. Learning rate 2e-5, batch size 4, weight decay 0.1.

Subject to many of the same shortcomings of its Catalonian-only counterpart, but seems to perform much better qualitatively overall. These results might indicate that the data for Catalonian is in fact of a poorer quality, to the point that cross-lingual transfer from more useful Spanish data is a superior option, but this is impossible to say for certain as the experiment is very lazy. It may well be the case, for example, that Catalonian-language examples skew more 'FAVOR' than Spanish examples, and as such finetuning on both could be greatly beneficial for the task. Unlike demo-cat, "la independencia catalana" is a big 'AGAINST' trigger whereas "la independència de Catalunya" is a big 'FAVOR' trigger.

Evaluated every epoch using F1 score with macro averaging:
5 epochs: 0.765449
10 epochs: 0.778278
15 (5) epochs: 0.727466
20 (10) epochs (final): 0.723115