New — QT V.4 SuperBPE Tokenizer Family Now on HuggingFace

Clean data is the
entire architecture

Quartz is the open data infrastructure layer behind AENEA. We publish ultra-clean datasets, the exact cleaning pipelines that produce them, and enterprise-grade data services for teams building their own models.

Browse Datasets View Pipelines

Languages

Script Families

32K / 64K

V.4 Vocabulary

204

FLORES Languages

31.5×

Best Equity (vs Llama 3's 118.6×)

Tokenizers

QT V.4 UltraLingo — SuperBPE

The most equitable multilingual tokenizer family available. Two variants — 64K for Overture (500M–2B) and 32K for Prelude (sub-500M) — covering 72 languages across 27 scripts. Both beat Llama 3 (128K vocab) at a fraction of the vocabulary, with the 32K achieving the best equity ratio of any QT tokenizer ever built.

31.5×

V.4.4 32K equity (best ever — vs Llama 3's 118.6×)

126/204

Languages where V.4.1 64K beats Llama 3

−22.6%

Fewer tokens than Llama 3 (64K variant)

26.7

Tibetan tok/word (Llama 3: 149.8)

FLORES-200 — The QT V.4 Family vs Llama 3 Latest

Metric	V.4.4 32K	V.4.1 64K	Llama 3 (128K)
Vocabulary	32,000	64,000	128,256
Mean fertility (tok/word)	4.231	3.917	5.716
Equity ratio (lower = fairer)	31.5×	32.3×	118.6×
Total tokens (204 langs)	14,125,437	12,979,330	16,764,198
Token savings vs Llama 3	−15.7%	−22.6%	—
Tibetan (tok/word)	26.70	33.89	149.79
Thai (tok/word)	12.88	11.74	14.03
Tamil (tok/word)	3.88	3.16	12.45
Hebrew (tok/word)	2.87	2.45	5.76

Script Family — V.4.1 64K vs Llama 3 tok/word

Script Family	V.4.1 64K	Llama 3	Langs
Latin	2.29	2.39	37
Cyrillic	2.47	2.59	5
Hebrew	2.45	5.76	2
Arabic	2.10	2.70	2
Devanagari	2.58	3.52	3
Bengali	2.95	8.07	1
Tamil	3.16	12.45	1
Myanmar	6.05	29.77	1
Thai	11.74	14.03	1
Khmer	13.29	40.91	1
CJK	18.80	19.75	4
Tibetan	33.89	149.79	1

V.4 Innovations Architecture

Innovation	Impact
Two-Stage SuperBPE	Superword tokens spanning word boundaries (of the, in order to)
Streaming Sharded Training	Full 5 GB corpus + SuperBPE on 16 GB RAM hardware
Indic Script-Aware Pre-tok	Virama-aware syllable segmentation for 10 Indic scripts
Equity-Balanced Stage 2	Four-bucket corpus builder oversamples underserved scripts — Tibetan 38.6→26.7 TPW
Per-Bucket Chunk Sizing	CJK gets long chunks (1000 chars), underserved scripts get short chunks (200 chars) to bound RAM

Underserved language spotlight: Hebrew 2.45 tok/word (vs Llama 3's 5.76 — 57% reduction). Tamil 3.16 (vs 12.45 — 75% reduction). Tibetan 26.70 (vs 149.79 — 82% reduction).

V.4.1 64K — Overture V.4.4 32K — Prelude

Previous generations

Earlier QT generations remain available on HuggingFace. V.3 32K SuperBPE pioneered two-stage training. V.2 offers 64K, 96K, and 114K Code variants. V.4.1 32K is also available as the non-equity-balanced 32K option.

V.4.1 32K V.3 32K V.2 96K V.2 64K V.2 114K Code

Datasets

The cleanest training corpora available

Every dataset is produced by our multi-pass cleaning pipelines with MinHash dedup, lint gates, and structural validation. We publish the exact scripts alongside the data — reproducibility is non-negotiable.

Wikipedia Multilingual v7.3

Open

Ultra-clean multilingual Wikipedia. 71 languages, 26 script families. Three-pass pipeline with script-aware quality filters, MinHash dedup, and --light-clean mode for tokenizer training.

Languages 71 Scripts 26 families Format JSONL.gz

Stack Exchange Q&A v1.0

Open

23 Stack Exchange sites cleaned into instruct-format Q&A pairs. Only accepted/top-voted answers. HTML stripped, code preserved, noise removed.

Tokens ~3.6B Pairs ~8.2M Format JSONL.gz

QT Tokenizer Family

Open

The QT V.4 tokenizer family: V.4.1 64K (Overture), V.4.4 32K (Prelude), plus V.2/V.3 legacy variants. Two-stage SuperBPE with equity-balanced corpus construction, streaming sharded training, and Indic script-aware pre-tokenization. 31.5× equity ratio — 3.8× fairer than Llama 3.

Sizes 32K / 64K Languages 72 + code Format HuggingFace JSON

V.4.1 64K → V.4.4 32K → V.4.1 32K →

Custom Enterprise Corpora

Enterprise

Domain-specific datasets cleaned to your specification. Legal, medical, financial, and scientific corpora with full provenance and licensing.

Quality Audited License Custom SLA Available

Open Source

The pipelines that produce the data

We don't just publish datasets. We publish the exact cleaning scripts that created them. Fork them, adapt them, run them on your own dumps.

Python

wiki_ultra_clean v7.3

Multilingual Wikipedia pipeline. 71 languages, 26 script families including Odia, Tibetan, Thaana, N'Ko, and Tifinagh. Script-aware quality filters, MinHash dedup, --light-clean mode.

71 languages BZ2 → JSONL v7.3

Python

se_ultra_clean v1

Stack Exchange pipeline. Two-pass Q&A stitching, HTML-in-XML cleaning, score gates, code preservation, instruct-format output.

7z → JSONL 49 sites Q&A pairs

Python

QT Tokenizer Trainer V.4

Streaming two-stage SuperBPE tokenizer training with script-aware pre-tokenization, equity-balanced Stage 2 corpus builder, per-bucket chunk sizing, parity equity tracking, and FLORES-200 benchmarking.

SuperBPE Streaming Equity Builder FLORES-200

In Production

Validated in live model training

The proof of a data stack is in the models it produces. Quartz-cleaned data and QT tokenizers are currently powering AENEA's most advanced training runs.

QT V.4 Tokenizer Family

Live

Three tokenizers live on HuggingFace: V.4.1 64K (Overture, 500M–2B), V.4.4 32K (Prelude, sub-500M), and V.4.1 32K. The V.4.4 32K achieves 31.5× equity — the best of any QT tokenizer — through equity-balanced Stage 2 corpus construction with four-bucket script oversampling.

Languages 72 Scripts 27 Equity 31.5× vs Llama 3 −22.6% tokens

Prelude-5 Training Run

Live

The QT V.4.4 32K UltraLingo tokenizer is now powering AENEA's Prelude-5 training run, the first model trained on equity-balanced SuperBPE. Phase 1 (English) is underway, with Phase 2 (72 languages) to follow.

Tokenizer QT V.4.4 32K Phase 1 — English Equity 31.5×

Factual Crystallisation Hypothesis

Discovery

Training on Quartz-cleaned data has contributed to the discovery of the Factual Crystallisation Hypothesis — the finding that gradient norm, not loss, predicts the emergence of factual recall in language models. This validates Quartz's core premise: data quality is not preprocessing, it is architecture.

Threshold ~0.27 grad norm Predictor Grad norm, not loss Validates Ultra-clean data thesis

Enterprise

Production-grade data at scale

For teams training models commercially. We handle the cleaning, deduplication, licensing, and quality assurance — you focus on architecture.

Quartz Enterprise

Custom cleaning pipelines, domain-specific corpora, ongoing data delivery, and dedicated support for teams building production models.

Custom Corpora

Domain-specific datasets cleaned to your quality spec with full provenance tracking

Pipeline Licensing

Run our cleaning infrastructure on your proprietary data, on your hardware

Ongoing Delivery

Scheduled re-cleaning as source corpora update. Fresh data, same quality guarantees

The substrate matters

Clean data isn't a feature. It's why QT V.4 beats Llama 3 on 126 languages with half the vocabulary and 3.8× better equity. Start building on Quartz.

Browse Datasets Visit AENEA

Clean data is theentire architecture

QT V.4 UltraLingo — SuperBPE

Previous generations

The cleanest training corpora available

Wikipedia Multilingual v7.3

Stack Exchange Q&A v1.0

QT Tokenizer Family

Custom Enterprise Corpora

The pipelines that produce the data

wiki_ultra_clean v7.3

se_ultra_clean v1

QT Tokenizer Trainer V.4

Validated in live model training

QT V.4 Tokenizer Family

Prelude-5 Training Run

Factual Crystallisation Hypothesis

Production-grade data at scale

Quartz Enterprise

The substrate matters

Clean data is the
entire architecture