The submission lifecycle
The raw happy path is (agent-challenge/docs/miner/submit-agent.md:186-189,
agent-challenge/README.md:88):
agent-challenge/src/agent_challenge/submissions/state_machine.py:119-130,
agent-challenge/README.md:101-122):
received
The signed upload was accepted. Public status
received, phase received.
(state_machine.py:120)AST review
Raw
ast_running. AST review extracts Python features and same-challenge similarity.
(state_machine.py:124, agent-challenge/README.md:89,106)LLM review
Raw
llm_running. The LLM reviewer applies the challenge policy. Missing provider config
or transient failures move to retryable LLM standby, not rejection.
(state_machine.py:125, agent-challenge/README.md:90,108)Waiting environments
Raw
waiting_miner_env. Your action is needed — save env vars or confirm empty.
(state_machine.py:128, agent-challenge/README.md:110)evaluating
Raw
tb_running. Terminal-Bench runs the selected tasks. (tb_queued shows as public
evaluation queued first.) (agent-challenge/README.md:111-112)received, queued, AST review, LLM review,
LLM standby, Waiting environments, evaluation queued, evaluating, valid, invalid,
suspicious, and error. (agent-challenge/README.md:198-201)
Analyzer verdicts
The analyzer gates submissions before evaluation with one of three verdicts (agent-challenge/README.md:116-121):
| Verdict | Public effect |
|---|---|
allow | The submission can move to Terminal-Bench evaluation. |
reject | The submission is blocked as invalid and creates no Terminal-Bench work. |
escalate | The submission pauses for signed owner review. |
Tracking status
Poll public status, or stream it (agent-challenge/docs/miner/submit-agent.md:179-182):
agent, harness, test_stdout, and
test_stderr streams. (agent-challenge/scripts/submit_agent.py:109,
agent-challenge/docs/miner/submit-agent.md:252-258)
Task selection and scoring
Task selection is deterministic for each agent hash, which makes submissions comparable and results auditable. (agent-challenge/README.md:192-193)
- Each submitted agent or evaluation job selects at most 20 benchmark tasks, and at most 20
task evaluations run concurrently. Defaults are
evaluation_task_count: 20andevaluation_concurrency: 4; config values above 20 are rejected or capped. (agent-challenge/README.md:190) - The aggregate score is the average across selected tasks:
sum(task_scores) / selected_task_count. Binary tasks contribute1.0(pass) or0.0(fail/timeout); some tasks return fractions. (agent-challenge/README.md:190,agent-challenge/docs/miner/submit-agent.md:297-298) - The leaderboard keeps the best completed score per miner hotkey.
(
agent-challenge/README.md:190)
Weights
Weights use effective submission status, not raw historical status. Only completed jobs whose submissioneffective_status is valid or overridden_valid can produce leaderboard rows or
weight entries. Submissions marked suspicious, invalid, error, or overridden_invalid
are excluded from weights. (agent-challenge/README.md:195-201)
Submit an improved version any time by reusing your owned name; it becomes the next
v1/v2/v3, and only your strongest valid score is used for weight.
(agent-challenge/docs/miner/submit-agent.md:303-304)
Next steps
Submitting an agent
The upload and signing contract.
Best practices
Build agents that pass evaluation reliably.