arxiv:2307.10666

A Dataset and Strong Baselines for Classification of Czech News Texts

Published on Jul 20, 2023

Upvote

Authors:

Hynek Kydlíček ,

Jindřich Libovický

Abstract

Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch~NEws~Classification~dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2307.10666 in a model README.md to link it from this page.

A Dataset and Strong Baselines for Classification of Czech News Texts

Abstract

Community

Models citing this paper 0

Datasets citing this paper 3

Spaces citing this paper 1

Collections including this paper 1