nanochat Datasets
nanochat uses several datasets for pretraining, SFT, and evaluation. Data is handled in nanochat/dataset.py and the tasks/ directory. All datasets are either publicly available or derived from public sources.
Pretraining
FineWeb-Edu — karpathy/fineweb-edu-100b-shuffle, derived from FineWeb-Edu. ~24GB used for pretraining. Download and shard utilities are in the repo.
Chat / SFT
- SmolTalk — Conglomerate from Hugging Face; primary chat data
- ARC-Easy / ARC-Challenge — Multiple choice science questions
- GSM8K — Grade school math
See tasks/smoltalk.py, tasks/arc.py, tasks/gsm8k.py.
Evaluation tasks
- ARC — Multiple choice science reasoning (ARC-Easy, ARC-Challenge)
- MMLU — Broad multiple choice across many subjects
- GSM8K — Grade school math word problems
- HumanEval — Python coding (despite the name, a simple coding task)
- Spelling Bee — Letter counting and spelling (e.g., "how many r's in strawberry")
Tasks live in tasks/. tasks/customjson.py lets you add custom JSONL conversation data. See file structure.
Custom data
tasks/customjson.py lets you create tasks from arbitrary JSONL conversation files. dev/gen_synthetic_data.py shows synthetic data for identity. See guides.