Clean data is the
entire architecture
Quartz is the open data infrastructure layer behind AENEA. We publish ultra-clean datasets, the exact cleaning pipelines that produce them, and enterprise-grade data services for teams building their own models.
QT V.4 UltraLingo — SuperBPE
The most equitable multilingual tokenizer family available. Two variants — 64K for Overture (500M–2B) and 32K for Prelude (sub-500M) — covering 72 languages across 27 scripts. Both beat Llama 3 (128K vocab) at a fraction of the vocabulary, with the 32K achieving the best equity ratio of any QT tokenizer ever built.
| Metric | V.4.4 32K | V.4.1 64K | Llama 3 (128K) |
|---|---|---|---|
| Vocabulary | 32,000 | 64,000 | 128,256 |
| Mean fertility (tok/word) | 4.231 | 3.917 | 5.716 |
| Equity ratio (lower = fairer) | 31.5× | 32.3× | 118.6× |
| Total tokens (204 langs) | 14,125,437 | 12,979,330 | 16,764,198 |
| Token savings vs Llama 3 | −15.7% | −22.6% | — |
| Tibetan (tok/word) | 26.70 | 33.89 | 149.79 |
| Thai (tok/word) | 12.88 | 11.74 | 14.03 |
| Tamil (tok/word) | 3.88 | 3.16 | 12.45 |
| Hebrew (tok/word) | 2.87 | 2.45 | 5.76 |
| Script Family | V.4.1 64K | Llama 3 | Langs |
|---|---|---|---|
| Latin | 2.29 | 2.39 | 37 |
| Cyrillic | 2.47 | 2.59 | 5 |
| Hebrew | 2.45 | 5.76 | 2 |
| Arabic | 2.10 | 2.70 | 2 |
| Devanagari | 2.58 | 3.52 | 3 |
| Bengali | 2.95 | 8.07 | 1 |
| Tamil | 3.16 | 12.45 | 1 |
| Myanmar | 6.05 | 29.77 | 1 |
| Thai | 11.74 | 14.03 | 1 |
| Khmer | 13.29 | 40.91 | 1 |
| CJK | 18.80 | 19.75 | 4 |
| Tibetan | 33.89 | 149.79 | 1 |
| Innovation | Impact |
|---|---|
| Two-Stage SuperBPE | Superword tokens spanning word boundaries (of the, in order to) |
| Streaming Sharded Training | Full 5 GB corpus + SuperBPE on 16 GB RAM hardware |
| Indic Script-Aware Pre-tok | Virama-aware syllable segmentation for 10 Indic scripts |
| Equity-Balanced Stage 2 | Four-bucket corpus builder oversamples underserved scripts — Tibetan 38.6→26.7 TPW |
| Per-Bucket Chunk Sizing | CJK gets long chunks (1000 chars), underserved scripts get short chunks (200 chars) to bound RAM |
Underserved language spotlight: Hebrew 2.45 tok/word (vs Llama 3's 5.76 — 57% reduction). Tamil 3.16 (vs 12.45 — 75% reduction). Tibetan 26.70 (vs 149.79 — 82% reduction).
Previous generations
Earlier QT generations remain available on HuggingFace. V.3 32K SuperBPE pioneered two-stage training. V.2 offers 64K, 96K, and 114K Code variants. V.4.1 32K is also available as the non-equity-balanced 32K option.
The cleanest training corpora available
Every dataset is produced by our multi-pass cleaning pipelines with MinHash dedup, lint gates, and structural validation. We publish the exact scripts alongside the data — reproducibility is non-negotiable.
Wikipedia Multilingual v7.3
OpenStack Exchange Q&A v1.0
OpenQT Tokenizer Family
OpenCustom Enterprise Corpora
EnterpriseThe pipelines that produce the data
We don't just publish datasets. We publish the exact cleaning scripts that created them. Fork them, adapt them, run them on your own dumps.
wiki_ultra_clean v7.3
se_ultra_clean v1
QT Tokenizer Trainer V.4
Validated in live model training
The proof of a data stack is in the models it produces. Quartz-cleaned data and QT tokenizers are currently powering AENEA's most advanced training runs.
QT V.4 Tokenizer Family
LivePrelude-5 Training Run
LiveFactual Crystallisation Hypothesis
DiscoveryProduction-grade data at scale
For teams training models commercially. We handle the cleaning, deduplication, licensing, and quality assurance — you focus on architecture.
Quartz Enterprise
Custom cleaning pipelines, domain-specific corpora, ongoing data delivery, and dedicated support for teams building production models.
The substrate matters
Clean data isn't a feature. It's why QT V.4 beats Llama 3 on 126 languages with half the vocabulary and 3.8× better equity. Start building on Quartz.