
The data layer for
document intelligence
Models will become true co-workers. We're building datasets to help them get there.

Knowledge work lives in documents. Helping models work through them takes data that captures the long tail of real-world complexity.

We build this from first principles, at scale. Every dataset is crafted by human experts, validated through in-house training, and refined continuously.




/ DATASETS
Featured Datasets
High quality, and wide coverage across 50+ languages and 21 domains, spanning the full spectrum of real-world complexity.

Parsing
Setting the gold standard for real-world document understanding. End-to-end parsing covering layout, reading order, 50+ language OCR, table-to-HTML, forms, PCM formulas, chart-to-mermaid, and more.

Layout Analysis
Layout detection built for production with reading-order intact. 19 element types across 21 domains, complexity-stratified from simple to deeply nested, distribution-matched to real traffic.

Form Understanding
First-of-its-kind form dataset with a deterministic graph schema. Text, indicators, inputs, and graphics are boxed and linked. Typed across 14 data classes for filling, parsing, and extraction.

Actuarial Modeling
Built by practicing actuaries from American insurers. Messy source documents transformed into professional Excel models with step-by-step expert reasoning traced throughout.
/ PRODUCT
Complete, living data products
We identify high-value capabilities, benchmark where models fall short, and craft the data to close the gap - validated through in-house training and extensive iteration.
What we ship is a complete data product, with insights and tooling to drive real progress.
Core Data
Domain specialists source, annotate, and review every dataset. Layered QA + QC, with distributions built for real-world coverage.
Expansion
Synthetic expansion rigorously developed on top of core data. Rich metadata for building your own splits and crafting training recipes.
Insights
Interactive delivery platform to explore how we built it. Sourcing, distributions, annotation logic, and ML learnings from our process.
Iterative
Accuracy, schemas, and coverage all improve continuously after delivery. An ongoing partnership around your evolving needs.

/ CONTACT
Get samples or build a dataset with us
Our library goes beyond what's listed here. Whether you need an off-the-shelf dataset or a custom build, we're ready to help.
How we work with you
Learn about your use-case
A short call to understand your goals, use-case, and data needs. From there we onboard your team onto our platform and marketplace.
Explore our datasets
Browse our datasets with your team. Detailed packets cover distributions, schemas, recipes, and more. Plus sample viewers with highly representative examples.
License what you need
Pick the datasets and splits that fit your requirements. Licensing is straightforward, with terms designed for your timeline.
Start training
Production-grade data delivered in days. With continuous quality updates, expanded coverage, and direct access to our team.

Bespoke Partnerships
When your work calls for purpose-built data, we create it together. Extend existing datasets for your distribution, or design something entirely new.
We co-design the exact data recipe your team needs.
- Collaborative schema design with your ML engineers
- Dedicated delivery team with the right domain expertise
- Scale from pilot to production volume smoothly