posted an update Jun 3
Impressed by the work of @guipenedo @hynky @loubnabnl @anton-l @craffel @lvwerra @thomwolf on FineWeb.

LLMs are only as good as the data they have been trained on, but the crucial aspect of pretraining data remains obscure. Our approach lifts the veil on building high-quality pretraining datasets by sharing every detail about this process to enable a wider community to build on top of it.

- The FineWeb-Edu dataset, which outperforms all openly accessible web datasets in a number of educational benchmarks. We built it by developing a quality classifier using annotations generated by an LLM.

- A new technical report explaining in detail how to create a large and high-quality web-scale dataset for LLM pretraining such as FineWeb

👉 HuggingFaceFW/blogpost-fineweb-v1
