QT VI.1.3 32K Prelude - new flagship tokenizer, live on HuggingFace

Clean data is the
entire architecture

Quartz is the open data infrastructure layer behind AENEA. We publish ultra-clean datasets, the exact cleaning pipelines that produce them, and enterprise-grade data services for teams building their own models.

Browse Datasets View Pipelines

Languages

Script Families

32K / 64K

V.4 Vocabulary

204

FLORES Languages

100%

FLORES coverage (Llama 2 / Mistral: 83%)

Tokenizers

QT VI.1.3 - the new flagship

QT VI.1.3 32K Prelude is our new flagship tokenizer: 32,000 vocabulary, lossless coverage of all 204 FLORES-200 languages, and English tuned to stay competitive with industry 32K tokenizers. Against Llama 2 and Mistral at the same 32K budget, VI.1.3 covers every language where they fall back to raw bytes on 34, and it wins on the majority of the rest. It fixes the English over-fragmentation of earlier QT releases while staying more equitable than V.4.6. The previous flagship, V.4.6 32K, remains available and still leads on the lowest-resource complex scripts; the planned 64K Overture tier is intended to combine both. All Apache 2.0: free to use, modify, and deploy.

204/204

FLORES-200 languages covered losslessly (Llama 2 & Mistral: 170)

+0.79

English chars/token gain vs V.4.6 (3.90 vs 3.11 - fragmentation fixed)

107/170

Languages where VI.1.3 beats Llama 2 (137/170 vs Mistral)

6.4×

Equity ratio on FLORES-200 (fairer than V.4.6's 7.6×)

FLORES-200 - QT VI.1.3 32K vs Llama 2 32K vs Mistral 32K Latest

Metric	VI.1.3 32K	Llama 2 32K	Mistral 32K
Languages covered (of 204)	204	170	170
Wins vs VI.1.3 (of 170 shared)	-	63	33
English (chars/token)	3.90	4.24	4.26
German (chars/token)	3.17	3.50	3.17
Russian (chars/token)	2.43	2.82	2.51
Chinese (chars/token)	0.94	0.70	0.88
Japanese (chars/token)	1.25	0.82	0.87
Arabic (chars/token)	2.11	1.09	1.11
Hindi (chars/token)	2.12	0.92	0.95
Thai (chars/token)	1.80	0.93	0.99

How to read this: higher chars/token means better compression. At the same 32K budget, VI.1.3 covers all 204 FLORES-200 languages losslessly, while Llama 2 and Mistral fall back to raw bytes on 34 (CJK, Tibetan, Telugu, Myanmar, Armenian, Lao and more). VI.1.3 trails Llama 2 by about 8% on English and the European languages - the honest cost of universal coverage - but is far ahead of both on every non-Latin script, roughly doubling their efficiency on Arabic, Hindi and Thai. Across the 170 languages Llama 2 can tokenize at all, VI.1.3 wins on 107; against Mistral, 137.

Complex scripts - V.4.6 32K (previous flagship) vs Llama 3 tok/word

Script Family	V.4.6 32K	Llama 3	Langs
Arabic	2.51	2.70	2
Latin	2.58	2.39	37
Hebrew	2.83	5.76	2
Gurmukhi	2.74	8.23	1
Devanagari	2.80	3.52	3
Bengali	3.17	8.07	1
Tamil	3.79	12.45	1
Myanmar	6.10	29.77	1
Thai	12.55	14.03	1
Khmer	13.55	40.91	1
CJK	19.94	19.75	4
Tibetan	27.21	149.79	1

V.4 Innovations Architecture

Innovation	Impact
Two-Stage SuperBPE	Superword tokens spanning word boundaries (of the, in order to)
Streaming Sharded Training	Full 5 GB corpus + SuperBPE on 16 GB RAM hardware
Indic Script-Aware Pre-tok	Virama-aware syllable segmentation for 10 Indic scripts
Equity-Balanced Stage 2	Four-bucket corpus builder oversamples underserved scripts - V.4.6 Tibetan 38.6→27.2 TPW
Per-Bucket Chunk Sizing	CJK gets long chunks (1000 chars), underserved scripts get short chunks (200 chars) to bound RAM

Choosing between them: VI.1.3 is the new flagship for English, coverage and fairness. V.4.6 (previous flagship) still leads on the lowest-resource complex scripts and is the pick for complex-script-heavy work: Tibetan 27.21 tok/word (vs Llama 3's 149.79 - 82% reduction), Tamil 3.79 (vs 12.45 - 70%), Khmer 13.55 (vs 40.91 - 67%), Hebrew 2.83 (vs 5.76 - 51%). The planned 64K Overture tier is intended to combine VI.1.3's English with V.4.6's complex-script strength.

VI.1.3 32K - New Flagship V.4.6 32K - Complex Scripts

Previous generations

Earlier QT generations remain available on HuggingFace. V.4.6 32K was the previous flagship and still leads on the lowest-resource complex scripts. V.4.1 64K suits larger models (500M-2B) where the extra vocabulary headroom matters. V.3 32K SuperBPE pioneered two-stage training. V.2 offers 64K, 96K, and 114K Code variants. V.4.1 32K is the non-equity-balanced 32K option.

V.4.1 64K V.4.4 32K V.4.1 32K V.3 32K V.2 96K V.2 64K V.2 114K Code

Datasets

The cleanest training corpora available

Every dataset is produced by our multi-pass cleaning pipelines with MinHash dedup, lint gates, and structural validation. We publish the exact scripts alongside the data - reproducibility is non-negotiable.

Wikipedia Multilingual v7.3

Open

Ultra-clean multilingual Wikipedia. 72 languages, 27 script families. Three-pass pipeline with script-aware quality filters, MinHash dedup, and --light-clean mode for tokenizer training.

Languages 72 Scripts 27 families Format JSONL.gz

Stack Exchange Q&A v1.0

Open

23 Stack Exchange sites cleaned into instruct-format Q&A pairs. Only accepted/top-voted answers. HTML stripped, code preserved, noise removed.

Tokens ~3.6B Pairs ~8.2M Format JSONL.gz

QT Tokenizer Family

Open

The QT tokenizer family. New flagship: VI.1.3 32K Prelude (English fixed, all 204 FLORES-200 languages covered losslessly, competitive with industry 32K tokenizers). Previous flagship V.4.6 32K still leads on complex scripts. Companion V.4.1 64K suits 500M-2B models. Plus V.4.4/V.4.1 32K and V.2/V.3 legacy variants. Two-stage SuperBPE with equity-balanced corpus construction, streaming sharded training, and Indic script-aware pre-tokenization. All Apache 2.0.

Flagship VI.1.3 32K Coverage 204 / 204 License Apache 2.0

VI.1.3 32K → V.4.6 32K → V.4.1 64K →

Custom Enterprise Corpora

Enterprise

Domain-specific datasets cleaned to your specification. Legal, medical, financial, and scientific corpora with full provenance and licensing.

Quality Audited License Custom SLA Available

Open Source

The pipelines that produce the data

We don't just publish datasets. We publish the exact cleaning scripts that created them. Fork them, adapt them, run them on your own dumps.

Python

wiki_ultra_clean v7.3

Multilingual Wikipedia pipeline. 72 languages, 27 script families including Odia, Tibetan, Thaana, N'Ko, and Tifinagh. Script-aware quality filters, MinHash dedup, --light-clean mode.

72 languages BZ2 → JSONL v7.3

Python

se_ultra_clean v1

Stack Exchange pipeline. Two-pass Q&A stitching, HTML-in-XML cleaning, score gates, code preservation, instruct-format output.

7z → JSONL 49 sites Q&A pairs

Python

QT Tokenizer Trainer V.4

Streaming two-stage SuperBPE tokenizer training with script-aware pre-tokenization, equity-balanced Stage 2 corpus builder, per-bucket chunk sizing, parity equity tracking, and FLORES-200 benchmarking.

SuperBPE Streaming Equity Builder FLORES-200

In Production

Validated in live model training

The proof of a data stack is in the models it produces. Quartz-cleaned data and QT tokenizers are currently powering AENEA's most advanced training runs.

QT V.4 Tokenizer Family

Live

QT VI.1.3 32K Prelude is the new flagship - all 204 FLORES-200 languages covered losslessly, English fragmentation fixed, and competitive with industry 32K tokenizers. The previous flagship V.4.6 32K still leads on complex scripts. V.4.1 64K is the larger-vocabulary companion for 500M-2B parameter models. V.4.4/V.4.1 32K and V.2/V.3 legacy variants remain available. All Apache 2.0, all on HuggingFace.

Coverage 204 / 204 Equity 6.4× vs Llama 2 wins 107/170

Prelude-5 Training Run

Live

The QT V.4.4 32K tokenizer is now powering AENEA's Prelude-5 training run, the first model trained on equity-balanced SuperBPE. Past step 40,000 on multilingual data: loss 2.249, perplexity 9.48, gradient norm 0.208. Reaching factual crystallisation 5× faster than any prior model.

Tokenizer QT V.4.4 32K Step 40,000+ Loss 2.249 Grad Norm 0.208

Factual Crystallisation Hypothesis

Discovery

Training on Quartz-cleaned data has contributed to the discovery of the Factual Crystallisation Hypothesis - the finding that gradient norm, not loss, predicts the emergence of factual recall in language models. This validates Quartz's core premise: data quality is not preprocessing, it is architecture.

Threshold ~0.27 grad norm Predictor Grad norm, not loss Validates Ultra-clean data thesis

Enterprise

Production-grade data at scale

For teams training models commercially. We handle the cleaning, deduplication, licensing, and quality assurance - you focus on architecture.

Quartz Enterprise

Custom cleaning pipelines, domain-specific corpora, ongoing data delivery, and dedicated support for teams building production models.

Custom Corpora

Domain-specific datasets cleaned to your quality spec with full provenance tracking

Pipeline Licensing

Run our cleaning infrastructure on your proprietary data, on your hardware

Ongoing Delivery

Scheduled re-cleaning as source corpora update. Fresh data, same quality guarantees

The substrate matters

Clean data isn't a feature, it's the architecture. QT VI.1.3 32K Prelude - our new flagship - covers all 204 FLORES-200 languages losslessly where Llama 2 and Mistral fall back to raw bytes on 34, and beats them on the majority of the rest. Open source, Apache 2.0, free forever. Start building on Quartz.

Browse Datasets Visit AENEA

Clean data is theentire architecture

QT VI.1.3 - the new flagship

Previous generations

The cleanest training corpora available

Wikipedia Multilingual v7.3

Stack Exchange Q&A v1.0

QT Tokenizer Family

Custom Enterprise Corpora

The pipelines that produce the data

wiki_ultra_clean v7.3

se_ultra_clean v1

QT Tokenizer Trainer V.4

Validated in live model training

QT V.4 Tokenizer Family

Prelude-5 Training Run

Factual Crystallisation Hypothesis

Production-grade data at scale

Quartz Enterprise

The substrate matters

Clean data is the
entire architecture