nanochat Datasets

nanochat uses several datasets for pretraining, SFT, and evaluation. Data is handled in nanochat/dataset.py and the tasks/ directory. All datasets are either publicly available or derived from public sources.

Pretraining

FineWeb-Edu — karpathy/fineweb-edu-100b-shuffle, derived from FineWeb-Edu. ~24GB used for pretraining. Download and shard utilities are in the repo.

Chat / SFT

SmolTalk — Conglomerate from Hugging Face; primary chat data
ARC-Easy / ARC-Challenge — Multiple choice science questions
GSM8K — Grade school math

See tasks/smoltalk.py, tasks/arc.py, tasks/gsm8k.py.

Evaluation tasks

ARC — Multiple choice science reasoning (ARC-Easy, ARC-Challenge)
MMLU — Broad multiple choice across many subjects
GSM8K — Grade school math word problems
HumanEval — Python coding (despite the name, a simple coding task)
Spelling Bee — Letter counting and spelling (e.g., "how many r's in strawberry")

Tasks live in tasks/. tasks/customjson.py lets you add custom JSONL conversation data. See file structure.

Custom data

tasks/customjson.py lets you create tasks from arbitrary JSONL conversation files. dev/gen_synthetic_data.py shows synthetic data for identity. See guides.

nanochat Datasets

Pretraining

Chat / SFT

Evaluation tasks

Custom data

Related