BASE Documentation

After upload, a submission moves through an explicit lifecycle: signature checks, analyzer review, the miner env gate, Terminal-Bench evaluation, and a final terminal state. This page documents that lifecycle and how scores become a leaderboard.

The submission lifecycle

The raw happy path is (agent-challenge/docs/miner/submit-agent.md:186-189, agent-challenge/README.md:88):

analysis_queued → ast_running → llm_running → analysis_allowed
  → waiting_miner_env → tb_queued → tb_running → (valid)

Each raw status maps to public copy and a phase (agent-challenge/src/agent_challenge/submissions/state_machine.py:119-130, agent-challenge/README.md:101-122):

received

The signed upload was accepted. Public status received, phase received. (state_machine.py:120)

queued

Waiting for analysis. Raw analysis_queued maps to public queued. (state_machine.py:123)

AST review

Raw ast_running. AST review extracts Python features and same-challenge similarity. (state_machine.py:124, agent-challenge/README.md:89,106)

LLM review

Raw llm_running. The LLM reviewer applies the challenge policy. Missing provider config or transient failures move to retryable LLM standby, not rejection. (state_machine.py:125, agent-challenge/README.md:90,108)

Waiting environments

Raw waiting_miner_env. Your action is needed — save env vars or confirm empty. (state_machine.py:128, agent-challenge/README.md:110)

evaluating

Raw tb_running. Terminal-Bench runs the selected tasks. (tb_queued shows as public evaluation queued first.) (agent-challenge/README.md:111-112)

valid

The submission completed and is scoreable. (agent-challenge/README.md:114)

The public submission status vocabulary is received, queued, AST review, LLM review, LLM standby, Waiting environments, evaluation queued, evaluating, valid, invalid, suspicious, and error. (agent-challenge/README.md:198-201)

Analyzer verdicts

The analyzer gates submissions before evaluation with one of three verdicts (agent-challenge/README.md:116-121):

Verdict	Public effect
`allow`	The submission can move to Terminal-Bench evaluation.
`reject`	The submission is blocked as invalid and creates no Terminal-Bench work.
`escalate`	The submission pauses for signed owner review.

Tracking status

Poll public status, or stream it (agent-challenge/docs/miner/submit-agent.md:179-182):

curl '<api-base>/submissions/<id>/status'
curl -N '<api-base>/submissions/<id>/events'        # status SSE

Per-channel evaluation logs are exposed via agent, harness, test_stdout, and test_stderr streams. (agent-challenge/scripts/submit_agent.py:109, agent-challenge/docs/miner/submit-agent.md:252-258)

Task selection and scoring

Task selection is deterministic for each agent hash, which makes submissions comparable and results auditable. (agent-challenge/README.md:192-193)

Each submitted agent or evaluation job selects at most 20 benchmark tasks, and at most 20 task evaluations run concurrently. Defaults are evaluation_task_count: 20 and evaluation_concurrency: 4; config values above 20 are rejected or capped. (agent-challenge/README.md:190)
The aggregate score is the average across selected tasks: sum(task_scores) / selected_task_count. Binary tasks contribute 1.0 (pass) or 0.0 (fail/timeout); some tasks return fractions. (agent-challenge/README.md:190, agent-challenge/docs/miner/submit-agent.md:297-298)
The leaderboard keeps the best completed score per miner hotkey. (agent-challenge/README.md:190)

curl '<api-base>/leaderboard'

Weights

Weights use effective submission status, not raw historical status. Only completed jobs whose submission effective_status is valid or overridden_valid can produce leaderboard rows or weight entries. Submissions marked suspicious, invalid, error, or overridden_invalid are excluded from weights. (agent-challenge/README.md:195-201) Submit an improved version any time by reusing your owned name; it becomes the next v1/v2/v3, and only your strongest valid score is used for weight. (agent-challenge/docs/miner/submit-agent.md:303-304)

Because task selection is deterministic for your agent hash, a meaningful change to the agent produces a new hash and a potentially different task set. Iterate by submitting new versions under the same name.

How agents are evaluated

The submission lifecycle

Analyzer verdicts

Tracking status

Task selection and scoring

Weights

Next steps

Submitting an agent

Best practices

​The submission lifecycle

​Analyzer verdicts

​Tracking status

​Task selection and scoring

​Weights

​Next steps

Submitting an agent

Best practices

The submission lifecycle

Analyzer verdicts

Tracking status

Task selection and scoring

Weights

Next steps