RL Environments for Domains Where Human Judgment Is Required
Custom training environments with verifiable rewards—for healthcare, insurance, coaching, and case management.
Expertise That Can't Be Written Down
Tacit knowledge is what experts know but can't articulate. A senior underwriter spots fraud in 30 seconds. A veteran case manager knows which clients need a call versus an email. Ask them how they know? They shrug. "Experience."
This knowledge takes years to develop and walks out the door when experts leave. It isn't captured in documents, training manuals, or conversation logs—because the decision trace alone doesn't reveal the reasoning.
The Core Problem
A novice and an expert can reach the same conclusion for completely different reasons. The expert noticed three red flags and ruled them out. The novice got lucky.
You can't reverse-engineer reasoning from outcomes.
Our environments solve this by measuring the reasoning process itself—not just the conclusion. We design scenarios where expertise becomes visible through the questions asked, the information sought, and the factors weighed.
Measure Humans
Calibrate against your best practitioners. Understand what separates expert reasoning from novice pattern-matching. Build the ground truth that defines "good" for your domain.
Measure AI
Generate verifiable rewards for training. Dense signal per turn, not sparse end-of-conversation feedback. The same environments that measure humans produce training signal for models.
You Can't RLVR What You Can't Verify
Reinforcement Learning from Verifiable Rewards works for math and code because you can check the answer. But most high-stakes domains—healthcare, insurance, advisory, case management—don't have ground-truth answers you can verify programmatically.
We found a way to verify what was previously unverifiable.
Small Models, Big Gains
Preliminary results show that smaller models trained with our verifiable expert signal dramatically outperform larger base models on domain reasoning tasks.
Results from Insurance Case Management training.
Our fine-tuned 4B model more than doubles the performance of a larger 8B base model
Outperforms a model 8Ă— its size on domain reasoning tasks
A 4B model outperforming a 235B model on domain reasoning
A 4B model fine-tuned on 106 expert scenarios outperforms models 60Ă— its size on domain-specific reasoning.
Same architecture. Better training signal. These are early results—we're scaling scenario count now.
Custom RL Environments with Verifiable Rewards
Expert-Calibrated Ground Truth
Our scenarios contain information AI must uncover through skilled questioning. Did it ask what an expert would ask? Did it discover the factors that change the decision? Binary, verifiable, no learned reward model required.
Simulated Environments
We construct the game for your AI to play. Scenarios form a continuous space over your domain—a gym where your model works out against simulated clients and real-world constraints.
Organization-Specific Training
Every organization has unique constraints, history, and affordances. We build environments calibrated to your specific context—not generic domain models that miss what makes your organization different.
From Expertise Capture to Training Signal
Domain Mapping
Understand Expertise
We work with your target domain to understand what expertise looks like—the questions experts ask, the factors they weigh, the reasoning they can't articulate.
Reward Design
Define Ground Truth
Expert-calibrated ground truth. Grounded in expertise research, calibrated against human practitioners, producing verifiable signal for the first time in judgment domains.
Scenario Architecture
Build Environments
We build interactive scenarios with counterfactuals and branching paths. An agent exploring the scenario reveals its reasoning approach through the information it seeks.
Training Infrastructure
Generate Trajectories
Trajectory generation at scale. Dense rewards per turn. Compatible with standard RL pipelines—GRPO, PPO, DPO, TRL.
In the Pipeline
Insurance Case Management
Return to work reasoning, social rehabilitation planning, and case progression for injury and health insurance workflows.
Clinical Triage
Patient prioritization, symptom assessment, and escalation decisions in healthcare intake settings.
Strength & Conditioning Education
Exercise prescription, periodization, and client assessment reasoning for higher education programs training the next generation of practitioners.
Nutrition Coaching
Dietary assessment, behavior change strategy, and personalized guidance across diverse client populations.
Experimental Design
Scientific research methodology—hypothesis formation, variable control, statistical power, and study design reasoning.
How We Work With Labs
Training Signal License
Access to trajectory data from our environments for your training pipelines.
- Per-domain or comprehensive access
- Continuous generation at scale
- Compatible with GRPO, PPO, DPO, TRL, OpenPipe
Custom Environment Build
We design and build RL environments for your specific target domains.
- Turnkey—we build, maintain, and run the training
- Expert network access for calibration
- Ongoing scenario development
Built By Infrastructure Veterans
We've spent decades understanding how experts actually think—not from documents, but from building systems for the world's most elite organizations.
Smartabase / Teamworks
Built the world's leading human performance operating system—European Space Agency, US SOCOM, police, fire, and government agencies across 15+ countries.
AI Infrastructure
Head of AI Engineering previously VP Engineering at Avos (Steve Chen's company post-YouTube), leading the team for Series A with NEA and Google Ventures.
Get Access
RL environments for domains where human judgment is required.