classification
Classify a single email into one of 5 categories
A thread-aware benchmark for inbox decision-making under real workplace pressure. It combines incident response, executive coordination, budget approvals, duplicate threads, and reply-budget tradeoffs behind a single OpenEnv-compatible API.
Play all three tasks directly from the browser. Reset an episode, inspect the observation, submit one action, and inspect reward, done state, and grading details after each step.
Observation and compact episode state for the current browser session.
The current observation is rendered below so you can inspect inbox state before each step.
Reward, completion state, and raw grading details from the last action.
{"reward": 0.0, "done": true}
Three deterministic tasks cover classification, ranking, and high-pressure inbox triage with thread-level dependencies.
Classify a single email into one of 5 categories
Rank 8 emails by priority order
Full inbox triage with 25 emails, threaded dependencies, duplicate handling, and a fixed response budget
The landing page stays minimal, but the API surface is complete and interactive docs are available at /docs.
Runtime heartbeat and active session count.
Available benchmark tasks with difficulty and email count.
Current episode state for your session cookie.
Start a new episode and attach a session cookie.
Apply one action and receive observation, reward, and final grade details.
Typed action and observation contract for validation.
Minimal examples for calling the live environment directly from curl or any HTTP client.
curl -X POST /reset \
-H "Content-Type: application/json" \
-d '{"task_type":"ranking","seed":42}'
curl -X POST /step \
-H "Content-Type: application/json" \
-d '{"email_id":"e1","ranking":["e1","e2","e3","e4","e5","e6","e7","e8"]}'