docs/scoring.md:1-6.
Primary metric: prequential bits-per-byte
During the forced-init re-execution, the challenge feeds the model fresh, single-pass batches from the locked train split and records the model’s loss on each new batch before the optimizer updates on it. Because the data is single-pass, this online (predict-then-train) loss is the prequential code-length by construction. The challenge integrates that code-length over the whole run and normalizes it by the raw UTF-8 bytes of text covered:docs/scoring.md:8-26.
From bpb to final_score
final_score is a documented monotone-decreasing transform of bpb, so a lower bpb yields a better (higher) final_score:
bpb_to_final_score, which returns 1.0 / (1.0 + max(0.0, float(bpb))). The leaderboard’s ORDER BY final_score DESC therefore ranks better learners first.
Source: docs/scoring.md:32-34; src/prism_challenge/evaluator/scoring.py:151-153.
Compute normalization, not wall-clock
The score is compute-normalized: it is reported and normalized by tokens consumed (and, optionally, estimated FLOPs), never by wall-clock time. A faster GPU or more GPUs cannot buy a better score; wall-clock is only a safety cap on the run. This keeps scores fair across the 1-to-8 GPU range even though the scored run uses one physical GPU. Source:docs/scoring.md:36-41; docs/scaling.md:48-67.
Tie-breaker: held-out delta over random init
When two submissions are near-equal on bpb, the challenge breaks the tie with the held-out delta on the secretval split:
final_score as a bounded tie-break term: it can only reorder submissions whose bpb is within a small epsilon of each other, so a strictly lower bpb is never ranked worse on the primary axis. When no secret val split is scored for a run, the run is graded on bpb alone with no tie-break.
Source: docs/scoring.md:43-56; src/prism_challenge/evaluator/scoring.py:24-31.
Anti-memorization gap (stability)
The challenge measures the train-vs-held-out gap (the converged train bpb against the held-out val bpb on the same byte basis). An excessive gap flags memorization and multiplies a penalty intofinal_score, so a memorizer ranks below an equivalent non-memorizing learner. The gap comparison is basis-consistent so a benign learner is not falsely flagged.
Source: docs/scoring.md:58-63.
Anomaly zeroing
A step-0 / smuggled-weights anomaly (an impossibly low initial loss under forced random init) drives the anti-cheat multiplier to zero, so an anomalously good bpb is flagged and zeroed rather than rewarded. A degenerate run (zero coverage, non-finite, or out-of-band bpb) is failed rather than scored. Source:docs/scoring.md:66-70.
Scaling signals
PRISM keeps the score compute-normalized so hardware never changes the ranking, and it records a typed, observability-only compute block in the manifest — the GPUs leased (gpu_count, which is 1 for the scored nproc=1 path), the launch shape (world_size, nproc_per_node, device), and the realized parameter count. The final_score never reads gpu_count, so there is no GPU-count reward and no multi-GPU scaling bonus.
Two official scored execution modes run on the locked FineWeb-Edu data:
| Mode | Purpose | Dataset target |
|---|---|---|
gpu_proxy_eval | Default official scored re-execution | FineWeb-Edu sample-10BT locked shards |
full_scale_eval | Larger official scored re-execution | FineWeb-Edu sample-10BT then sample-100BT phases |
docs/scaling.md:9-19, :61-67.
Leaderboard and tie-break ordering
The leaderboard ranks byfinal_score (so by bpb and the folded-in held-out delta). When two submissions are still equal, the final deterministic tie-break is earliest-commit-wins, then submission id — implemented as ORDER BY sc.final_score DESC, s.created_at ASC, s.id ASC. Each hotkey appears at most once: the best submission per hotkey survives.
Source: docs/scoring.md:72-77; src/prism_challenge/repository.py:506.
Weights
get_weights converts completed scores into normalized weights: one weight per hotkey, taken from that hotkey’s best final_score, normalized to sum to 1.0. Weights are always dry-run and are never written on-chain.
The legacy raw-loss term and the v1-NAS architecture/training ownership pools are retired from the score. Every number above is recomputed by the challenge from the challenge-authored prism_run_manifest.v2.json; miner-reported metrics and miner-written manifests are ignored.
Source: docs/scoring.md:80-89; src/prism_challenge/weights.py:21-31.
Reference studies
PRISM’s scoring cites the following studies (reproduced from the source scoring doc):| Area | Study | PRISM implication |
|---|---|---|
| Prequential / online coding | Dawid, 1984 | Score the integrated online loss, not a final checkpoint. |
| Minimum description length | Rissanen, 1978 | Treat compression (code-length) as the learning signal. |
| Scaling laws | Kaplan et al., 2020 | Compare loss trajectories under matched compute. |
| Compute-optimal scaling | Hoffmann et al., 2022 | Normalize by tokens/compute so over/under-training does not skew ranking. |
| Dataset provenance | Penedo et al., 2024 | Freeze the data revision and shards for reproducible runs. |
docs/scoring.md:91-99.