QT V.4.6 32K Prelude - flagship tokenizer, open source on HuggingFace

Clean data is the
entire architecture

Quartz is the open data infrastructure layer behind AENEA. We publish ultra-clean datasets, the exact cleaning pipelines that produce them, and enterprise-grade data services for teams building their own models.

72
Languages
27
Script Families
32K / 64K
V.4 Vocabulary
204
FLORES Languages
31.5×
Best Equity (vs Llama 3's 118.6×)
Tokenizers

QT V.4.6 - the flagship

QT V.4.6 32K Prelude is our current flagship tokenizer: 32,000 vocabulary, 72 languages, 27 scripts, and the best cross-lingual equity of any QT tokenizer to date. V.4.6 fixes the Lao coverage gap present in earlier releases and ships with a cleaner training corpus. The companion V.4.1 64K remains available for larger models (500M–2B parameters) where the extra vocabulary headroom matters - though note it predates the Lao fix and does not yet cover Lao. Both are Apache 2.0: free to use, modify, and deploy.

19.7×
V.4.6 equity ratio on FLORES-200 (vs Llama 3's 118.6× - 6× fairer)
239
Lao tokens in V.4.6 (V.4.4 had 0 - byte-fallback, now fixed)
27.2
V.4.6 Tibetan tok/word (Llama 3: 149.8 - 82% reduction)
72/27
Languages / scripts covered by V.4.6 32K
FLORES-200 - QT V.4.6 (flagship) vs V.4.1 64K vs Llama 3 Latest
Metric V.4.6 32K V.4.1 64K Llama 3 (128K)
Vocabulary32,00064,000128,256
Mean fertility (tok/word)3.8933.7805.716
Equity ratio (lower = fairer)19.7×32.3×118.6×
Total tokens (204 langs)13,941,67012,979,33016,764,198
Token savings vs Llama 3−16.8%−22.6%-
Lao (tok/word)13.8642.90 (no coverage)-
Tibetan (tok/word)27.2133.89149.79
Thai (tok/word)12.5511.7414.03
Tamil (tok/word)3.793.1612.45
Hebrew (tok/word)2.832.455.76

How to read this: V.4.6 32K is the flagship - it wins on the metric that matters most (equity) while using a quarter of Llama 3's vocabulary, and it covers Lao, which V.4.1 64K does not. V.4.1 64K wins on raw fertility because it has twice V.4.6's vocab budget to spend on common tokens; choose it if you're training a 500M-2B model where the extra embedding parameters are affordable and Lao is not required. For sub-500M models, V.4.6's equity and coverage wins matter more than V.4.1 64K's fertility edge.

Script Family - V.4.6 32K (flagship) vs Llama 3 tok/word
Script Family V.4.6 32K Llama 3 Langs
Arabic2.512.702
Latin2.582.3937
Hebrew2.835.762
Gurmukhi2.748.231
Devanagari2.803.523
Bengali3.178.071
Tamil3.7912.451
Myanmar6.1029.771
Thai12.5514.031
Khmer13.5540.911
CJK19.9419.754
Tibetan27.21149.791
V.4 Innovations Architecture
Innovation Impact
Two-Stage SuperBPESuperword tokens spanning word boundaries (of the, in order to)
Streaming Sharded TrainingFull 5 GB corpus + SuperBPE on 16 GB RAM hardware
Indic Script-Aware Pre-tokVirama-aware syllable segmentation for 10 Indic scripts
Equity-Balanced Stage 2Four-bucket corpus builder oversamples underserved scripts - V.4.6 Tibetan 38.6→27.2 TPW
Per-Bucket Chunk SizingCJK gets long chunks (1000 chars), underserved scripts get short chunks (200 chars) to bound RAM

V.4.6 underserved language spotlight: Lao now covered at 13.86 tok/word (V.4.4 had no coverage - byte-fallback). Hebrew 2.83 tok/word (vs Llama 3's 5.76 - 51% reduction). Tamil 3.79 (vs 12.45 - 70% reduction). Tibetan 27.21 (vs 149.79 - 82% reduction). Khmer 13.55 (vs 40.91 - 67% reduction).

Datasets

The cleanest training corpora available

Every dataset is produced by our multi-pass cleaning pipelines with MinHash dedup, lint gates, and structural validation. We publish the exact scripts alongside the data - reproducibility is non-negotiable.

Wikipedia Multilingual v7.3

Open
Ultra-clean multilingual Wikipedia. 72 languages, 27 script families. Three-pass pipeline with script-aware quality filters, MinHash dedup, and --light-clean mode for tokenizer training.
Languages 72 Scripts 27 families Format JSONL.gz

Stack Exchange Q&A v1.0

Open
23 Stack Exchange sites cleaned into instruct-format Q&A pairs. Only accepted/top-voted answers. HTML stripped, code preserved, noise removed.
Tokens ~3.6B Pairs ~8.2M Format JSONL.gz

QT Tokenizer Family

Open
The QT V.4 tokenizer family. Flagship: V.4.6 32K Prelude (sub-500M models, Lao fixed, best equity of any QT). Companion: V.4.1 64K (500M-2B models, larger vocab headroom; predates the Lao fix). Plus V.4.4/V.4.1 32K and V.2/V.3 legacy variants. Two-stage SuperBPE with equity-balanced corpus construction, streaming sharded training, and Indic script-aware pre-tokenization. All Apache 2.0.
Flagship V.4.6 32K Languages 72 + code License Apache 2.0

Custom Enterprise Corpora

Enterprise
Domain-specific datasets cleaned to your specification. Legal, medical, financial, and scientific corpora with full provenance and licensing.
Quality Audited License Custom SLA Available
Open Source

The pipelines that produce the data

We don't just publish datasets. We publish the exact cleaning scripts that created them. Fork them, adapt them, run them on your own dumps.

Python

wiki_ultra_clean v7.3

Multilingual Wikipedia pipeline. 72 languages, 27 script families including Odia, Tibetan, Thaana, N'Ko, and Tifinagh. Script-aware quality filters, MinHash dedup, --light-clean mode.
72 languages BZ2 → JSONL v7.3
Python

se_ultra_clean v1

Stack Exchange pipeline. Two-pass Q&A stitching, HTML-in-XML cleaning, score gates, code preservation, instruct-format output.
7z → JSONL 49 sites Q&A pairs
Python

QT Tokenizer Trainer V.4

Streaming two-stage SuperBPE tokenizer training with script-aware pre-tokenization, equity-balanced Stage 2 corpus builder, per-bucket chunk sizing, parity equity tracking, and FLORES-200 benchmarking.
SuperBPE Streaming Equity Builder FLORES-200
In Production

Validated in live model training

The proof of a data stack is in the models it produces. Quartz-cleaned data and QT tokenizers are currently powering AENEA's most advanced training runs.

QT V.4 Tokenizer Family

Live
QT V.4.6 32K Prelude is the current flagship - best equity of any QT tokenizer to date, with Lao coverage fixed, through equity-balanced Stage 2 corpus construction with four-bucket script oversampling. V.4.1 64K is the larger-vocabulary companion for 500M–2B parameter models (predates the Lao fix). V.4.4/V.4.1 32K and V.2/V.3 legacy variants remain available. All Apache 2.0, all on HuggingFace.
Languages 72 Scripts 27 Equity 19.7× vs Llama 3 −16.8% tokens

Prelude-5 Training Run

Live
The QT V.4.4 32K tokenizer is now powering AENEA's Prelude-5 training run, the first model trained on equity-balanced SuperBPE. Past step 40,000 on multilingual data: loss 2.249, perplexity 9.48, gradient norm 0.208. Reaching factual crystallisation 5× faster than any prior model.
Tokenizer QT V.4.4 32K Step 40,000+ Loss 2.249 Grad Norm 0.208

Factual Crystallisation Hypothesis

Discovery
Training on Quartz-cleaned data has contributed to the discovery of the Factual Crystallisation Hypothesis - the finding that gradient norm, not loss, predicts the emergence of factual recall in language models. This validates Quartz's core premise: data quality is not preprocessing, it is architecture.
Threshold ~0.27 grad norm Predictor Grad norm, not loss Validates Ultra-clean data thesis
Enterprise

Production-grade data at scale

For teams training models commercially. We handle the cleaning, deduplication, licensing, and quality assurance - you focus on architecture.

Quartz Enterprise

Custom cleaning pipelines, domain-specific corpora, ongoing data delivery, and dedicated support for teams building production models.

Custom Corpora
Domain-specific datasets cleaned to your quality spec with full provenance tracking
Pipeline Licensing
Run our cleaning infrastructure on your proprietary data, on your hardware
Ongoing Delivery
Scheduled re-cleaning as source corpora update. Fresh data, same quality guarantees

The substrate matters

Clean data isn't a feature, it's the architecture. QT V.4.6 32K Prelude - our flagship - beats Llama 3 with 1/4 the vocabulary and 6× better cross-lingual equity. Open source, Apache 2.0, free forever. Start building on Quartz.