Clean data is the
entire architecture
Quartz is the open data infrastructure layer behind AENEA. We publish ultra-clean datasets, the exact cleaning pipelines that produce them, and enterprise-grade data services for teams building their own models.
QT V.4.6 - the flagship
QT V.4.6 32K Prelude is our current flagship tokenizer: 32,000 vocabulary, 72 languages, 27 scripts, and the best cross-lingual equity of any QT tokenizer to date. V.4.6 fixes the Lao coverage gap present in earlier releases and ships with a cleaner training corpus. The companion V.4.1 64K remains available for larger models (500M–2B parameters) where the extra vocabulary headroom matters - though note it predates the Lao fix and does not yet cover Lao. Both are Apache 2.0: free to use, modify, and deploy.
| Metric | V.4.6 32K | V.4.1 64K | Llama 3 (128K) |
|---|---|---|---|
| Vocabulary | 32,000 | 64,000 | 128,256 |
| Mean fertility (tok/word) | 3.893 | 3.780 | 5.716 |
| Equity ratio (lower = fairer) | 19.7× | 32.3× | 118.6× |
| Total tokens (204 langs) | 13,941,670 | 12,979,330 | 16,764,198 |
| Token savings vs Llama 3 | −16.8% | −22.6% | - |
| Lao (tok/word) | 13.86 | 42.90 (no coverage) | - |
| Tibetan (tok/word) | 27.21 | 33.89 | 149.79 |
| Thai (tok/word) | 12.55 | 11.74 | 14.03 |
| Tamil (tok/word) | 3.79 | 3.16 | 12.45 |
| Hebrew (tok/word) | 2.83 | 2.45 | 5.76 |
How to read this: V.4.6 32K is the flagship - it wins on the metric that matters most (equity) while using a quarter of Llama 3's vocabulary, and it covers Lao, which V.4.1 64K does not. V.4.1 64K wins on raw fertility because it has twice V.4.6's vocab budget to spend on common tokens; choose it if you're training a 500M-2B model where the extra embedding parameters are affordable and Lao is not required. For sub-500M models, V.4.6's equity and coverage wins matter more than V.4.1 64K's fertility edge.
| Script Family | V.4.6 32K | Llama 3 | Langs |
|---|---|---|---|
| Arabic | 2.51 | 2.70 | 2 |
| Latin | 2.58 | 2.39 | 37 |
| Hebrew | 2.83 | 5.76 | 2 |
| Gurmukhi | 2.74 | 8.23 | 1 |
| Devanagari | 2.80 | 3.52 | 3 |
| Bengali | 3.17 | 8.07 | 1 |
| Tamil | 3.79 | 12.45 | 1 |
| Myanmar | 6.10 | 29.77 | 1 |
| Thai | 12.55 | 14.03 | 1 |
| Khmer | 13.55 | 40.91 | 1 |
| CJK | 19.94 | 19.75 | 4 |
| Tibetan | 27.21 | 149.79 | 1 |
| Innovation | Impact |
|---|---|
| Two-Stage SuperBPE | Superword tokens spanning word boundaries (of the, in order to) |
| Streaming Sharded Training | Full 5 GB corpus + SuperBPE on 16 GB RAM hardware |
| Indic Script-Aware Pre-tok | Virama-aware syllable segmentation for 10 Indic scripts |
| Equity-Balanced Stage 2 | Four-bucket corpus builder oversamples underserved scripts - V.4.6 Tibetan 38.6→27.2 TPW |
| Per-Bucket Chunk Sizing | CJK gets long chunks (1000 chars), underserved scripts get short chunks (200 chars) to bound RAM |
V.4.6 underserved language spotlight: Lao now covered at 13.86 tok/word (V.4.4 had no coverage - byte-fallback). Hebrew 2.83 tok/word (vs Llama 3's 5.76 - 51% reduction). Tamil 3.79 (vs 12.45 - 70% reduction). Tibetan 27.21 (vs 149.79 - 82% reduction). Khmer 13.55 (vs 40.91 - 67% reduction).
The cleanest training corpora available
Every dataset is produced by our multi-pass cleaning pipelines with MinHash dedup, lint gates, and structural validation. We publish the exact scripts alongside the data - reproducibility is non-negotiable.
Wikipedia Multilingual v7.3
OpenStack Exchange Q&A v1.0
OpenQT Tokenizer Family
OpenCustom Enterprise Corpora
EnterpriseThe pipelines that produce the data
We don't just publish datasets. We publish the exact cleaning scripts that created them. Fork them, adapt them, run them on your own dumps.
wiki_ultra_clean v7.3
se_ultra_clean v1
QT Tokenizer Trainer V.4
Validated in live model training
The proof of a data stack is in the models it produces. Quartz-cleaned data and QT tokenizers are currently powering AENEA's most advanced training runs.
QT V.4 Tokenizer Family
LivePrelude-5 Training Run
LiveFactual Crystallisation Hypothesis
DiscoveryProduction-grade data at scale
For teams training models commercially. We handle the cleaning, deduplication, licensing, and quality assurance - you focus on architecture.
Quartz Enterprise
Custom cleaning pipelines, domain-specific corpora, ongoing data delivery, and dedicated support for teams building production models.
The substrate matters
Clean data isn't a feature, it's the architecture. QT V.4.6 32K Prelude - our flagship - beats Llama 3 with 1/4 the vocabulary and 6× better cross-lingual equity. Open source, Apache 2.0, free forever. Start building on Quartz.