BACKED BYY COMBINATOR/ALT CAPITAL

The data layer for

document intelligence

Models will become true co-workers. We're building datasets to help them get there.

Mobile Illustration

Knowledge work lives in documents. Helping models work through them takes data that captures the long tail of real-world complexity.

Building Illustration

We build this from first principles, at scale. Every dataset is crafted by human experts, validated through in-house training, and refined continuously.

Transform Illustration

/ DATASETS

Featured Datasets

High quality, and wide coverage across 50+ languages and 21 domains, spanning the full spectrum of real-world complexity.

document understanding

Parsing

Setting the gold standard for real-world document understanding. End-to-end parsing covering layout, reading order, 50+ language OCR, table-to-HTML, forms, PCM formulas, chart-to-mermaid, and more.

20K+Documents3M+Elements50+Languages21Domains
document understanding

Layout Analysis

Layout detection built for production with reading-order intact. 19 element types across 21 domains, complexity-stratified from simple to deeply nested, distribution-matched to real traffic.

100K+Pages1.3M+Elements19Element Types21Domains
document action

Form Understanding

First-of-its-kind form dataset with a deterministic graph schema. Text, indicators, inputs, and graphics are boxed and linked. Typed across 14 data classes for filling, parsing, and extraction.

25KForms350K+Elements550K+Relationships10+Domains
document action

Actuarial Modeling

Built by practicing actuaries from American insurers. Messy source documents transformed into professional Excel models with step-by-step expert reasoning traced throughout.

IN PROGRESS

/ PRODUCT

Complete, living data products

We identify high-value capabilities, benchmark where models fall short, and craft the data to close the gap - validated through in-house training and extensive iteration.

What we ship is a complete data product, with insights and tooling to drive real progress.

Core Data

Domain specialists source, annotate, and review every dataset. Layered QA + QC, with distributions built for real-world coverage.

Expansion

Synthetic expansion rigorously developed on top of core data. Rich metadata for building your own splits and crafting training recipes.

Insights

Interactive delivery platform to explore how we built it. Sourcing, distributions, annotation logic, and ML learnings from our process.

Iterative

Accuracy, schemas, and coverage all improve continuously after delivery. An ongoing partnership around your evolving needs.

Section Gradient Background

/ CONTACT

Get samples or build a dataset with us

Our library goes beyond what's listed here. Whether you need an off-the-shelf dataset or a custom build, we're ready to help.

How we work with you

01

Learn about your use-case

A short call to understand your goals, use-case, and data needs. From there we onboard your team onto our platform and marketplace.

02

Explore our datasets

Browse our datasets with your team. Detailed packets cover distributions, schemas, recipes, and more. Plus sample viewers with highly representative examples.

03

License what you need

Pick the datasets and splits that fit your requirements. Licensing is straightforward, with terms designed for your timeline.

04

Start training

Production-grade data delivered in days. With continuous quality updates, expanded coverage, and direct access to our team.

Bespoke Gradient Background

Bespoke Partnerships

When your work calls for purpose-built data, we create it together. Extend existing datasets for your distribution, or design something entirely new.

We co-design the exact data recipe your team needs.

  • Collaborative schema design with your ML engineers
  • Dedicated delivery team with the right domain expertise
  • Scale from pilot to production volume smoothly

FLOATINGPOINT

Frontier-grade datasets for document AI

Featured Datasets

Company

Core Layout Gradient Background