New — QT V.4 SuperBPE Tokenizer Family Now on HuggingFace

Clean data is the
entire architecture

Quartz is the open data infrastructure layer behind AENEA. We publish ultra-clean datasets, the exact cleaning pipelines that produce them, and enterprise-grade data services for teams building their own models.

72
Languages
27
Script Families
32K / 64K
V.4 Vocabulary
204
FLORES Languages
31.5×
Best Equity (vs Llama 3's 118.6×)
Tokenizers

QT V.4 UltraLingo — SuperBPE

The most equitable multilingual tokenizer family available. Two variants — 64K for Overture (500M–2B) and 32K for Prelude (sub-500M) — covering 72 languages across 27 scripts. Both beat Llama 3 (128K vocab) at a fraction of the vocabulary, with the 32K achieving the best equity ratio of any QT tokenizer ever built.

31.5×
V.4.4 32K equity (best ever — vs Llama 3's 118.6×)
126/204
Languages where V.4.1 64K beats Llama 3
22.6%
Fewer tokens than Llama 3 (64K variant)
26.7
Tibetan tok/word (Llama 3: 149.8)
FLORES-200 — The QT V.4 Family vs Llama 3 Latest
Metric V.4.4 32K V.4.1 64K Llama 3 (128K)
Vocabulary32,00064,000128,256
Mean fertility (tok/word)4.2313.9175.716
Equity ratio (lower = fairer)31.5×32.3×118.6×
Total tokens (204 langs)14,125,43712,979,33016,764,198
Token savings vs Llama 3−15.7%−22.6%
Tibetan (tok/word)26.7033.89149.79
Thai (tok/word)12.8811.7414.03
Tamil (tok/word)3.883.1612.45
Hebrew (tok/word)2.872.455.76
Script Family — V.4.1 64K vs Llama 3 tok/word
Script Family V.4.1 64K Llama 3 Langs
Latin2.292.3937
Cyrillic2.472.595
Hebrew2.455.762
Arabic2.102.702
Devanagari2.583.523
Bengali2.958.071
Tamil3.1612.451
Myanmar6.0529.771
Thai11.7414.031
Khmer13.2940.911
CJK18.8019.754
Tibetan33.89149.791
V.4 Innovations Architecture
Innovation Impact
Two-Stage SuperBPESuperword tokens spanning word boundaries (of the, in order to)
Streaming Sharded TrainingFull 5 GB corpus + SuperBPE on 16 GB RAM hardware
Indic Script-Aware Pre-tokVirama-aware syllable segmentation for 10 Indic scripts
Equity-Balanced Stage 2Four-bucket corpus builder oversamples underserved scripts — Tibetan 38.6→26.7 TPW
Per-Bucket Chunk SizingCJK gets long chunks (1000 chars), underserved scripts get short chunks (200 chars) to bound RAM

Underserved language spotlight: Hebrew 2.45 tok/word (vs Llama 3's 5.76 — 57% reduction). Tamil 3.16 (vs 12.45 — 75% reduction). Tibetan 26.70 (vs 149.79 — 82% reduction).

Archive

Previous generations

Earlier QT generations remain available on HuggingFace. V.3 32K SuperBPE pioneered two-stage training. V.2 offers 64K, 96K, and 114K Code variants. V.4.1 32K is also available as the non-equity-balanced 32K option.

Datasets

The cleanest training corpora available

Every dataset is produced by our multi-pass cleaning pipelines with MinHash dedup, lint gates, and structural validation. We publish the exact scripts alongside the data — reproducibility is non-negotiable.

Wikipedia Multilingual v7.3

Open
Ultra-clean multilingual Wikipedia. 71 languages, 26 script families. Three-pass pipeline with script-aware quality filters, MinHash dedup, and --light-clean mode for tokenizer training.
Languages 71 Scripts 26 families Format JSONL.gz

Stack Exchange Q&A v1.0

Open
23 Stack Exchange sites cleaned into instruct-format Q&A pairs. Only accepted/top-voted answers. HTML stripped, code preserved, noise removed.
Tokens ~3.6B Pairs ~8.2M Format JSONL.gz

QT Tokenizer Family

Open
The QT V.4 tokenizer family: V.4.1 64K (Overture), V.4.4 32K (Prelude), plus V.2/V.3 legacy variants. Two-stage SuperBPE with equity-balanced corpus construction, streaming sharded training, and Indic script-aware pre-tokenization. 31.5× equity ratio — 3.8× fairer than Llama 3.
Sizes 32K / 64K Languages 72 + code Format HuggingFace JSON

Custom Enterprise Corpora

Enterprise
Domain-specific datasets cleaned to your specification. Legal, medical, financial, and scientific corpora with full provenance and licensing.
Quality Audited License Custom SLA Available
Open Source

The pipelines that produce the data

We don't just publish datasets. We publish the exact cleaning scripts that created them. Fork them, adapt them, run them on your own dumps.

Python

wiki_ultra_clean v7.3

Multilingual Wikipedia pipeline. 71 languages, 26 script families including Odia, Tibetan, Thaana, N'Ko, and Tifinagh. Script-aware quality filters, MinHash dedup, --light-clean mode.
71 languages BZ2 → JSONL v7.3
Python

se_ultra_clean v1

Stack Exchange pipeline. Two-pass Q&A stitching, HTML-in-XML cleaning, score gates, code preservation, instruct-format output.
7z → JSONL 49 sites Q&A pairs
Python

QT Tokenizer Trainer V.4

Streaming two-stage SuperBPE tokenizer training with script-aware pre-tokenization, equity-balanced Stage 2 corpus builder, per-bucket chunk sizing, parity equity tracking, and FLORES-200 benchmarking.
SuperBPE Streaming Equity Builder FLORES-200
In Production

Validated in live model training

The proof of a data stack is in the models it produces. Quartz-cleaned data and QT tokenizers are currently powering AENEA's most advanced training runs.

QT V.4 Tokenizer Family

Live
Three tokenizers live on HuggingFace: V.4.1 64K (Overture, 500M–2B), V.4.4 32K (Prelude, sub-500M), and V.4.1 32K. The V.4.4 32K achieves 31.5× equity — the best of any QT tokenizer — through equity-balanced Stage 2 corpus construction with four-bucket script oversampling.
Languages 72 Scripts 27 Equity 31.5× vs Llama 3 −22.6% tokens

Prelude-5 Training Run

Live
The QT V.4.4 32K UltraLingo tokenizer is now powering AENEA's Prelude-5 training run, the first model trained on equity-balanced SuperBPE. Phase 1 (English) is underway, with Phase 2 (72 languages) to follow.
Tokenizer QT V.4.4 32K Phase 1 — English Equity 31.5×

Factual Crystallisation Hypothesis

Discovery
Training on Quartz-cleaned data has contributed to the discovery of the Factual Crystallisation Hypothesis — the finding that gradient norm, not loss, predicts the emergence of factual recall in language models. This validates Quartz's core premise: data quality is not preprocessing, it is architecture.
Threshold ~0.27 grad norm Predictor Grad norm, not loss Validates Ultra-clean data thesis
Enterprise

Production-grade data at scale

For teams training models commercially. We handle the cleaning, deduplication, licensing, and quality assurance — you focus on architecture.

Quartz Enterprise

Custom cleaning pipelines, domain-specific corpora, ongoing data delivery, and dedicated support for teams building production models.

Custom Corpora
Domain-specific datasets cleaned to your quality spec with full provenance tracking
Pipeline Licensing
Run our cleaning infrastructure on your proprietary data, on your hardware
Ongoing Delivery
Scheduled re-cleaning as source corpora update. Fresh data, same quality guarantees

The substrate matters

Clean data isn't a feature. It's why QT V.4 beats Llama 3 on 126 languages with half the vocabulary and 3.8× better equity. Start building on Quartz.