Skip to main content
PRISM scores a single thing: a model’s ability to learn from scratch, measured as online compression. The primary metric is a prequential bits-per-byte (bpb) score that the challenge computes itself from a forced-init re-execution. A held-out delta-over-random-init breaks near-ties, and an anti-memorization gap penalizes overfitting. Lower bits-per-byte is better. Source: docs/scoring.md:1-6.

Primary metric: prequential bits-per-byte

During the forced-init re-execution, the challenge feeds the model fresh, single-pass batches from the locked train split and records the model’s loss on each new batch before the optimizer updates on it. Because the data is single-pass, this online (predict-then-train) loss is the prequential code-length by construction. The challenge integrates that code-length over the whole run and normalizes it by the raw UTF-8 bytes of text covered:
bpb = (sum over consumed tokens of -log2 p(token)) / total_bytes_covered
Because the denominator is bytes, the metric is tokenizer-agnostic. Because it integrates the whole loss curve, a single good checkpoint cannot game it. Because each token is scored before being trained on, there is no held-out leakage by construction. And because the validator forces random init, smuggled pretrained weights are inert. Source: docs/scoring.md:8-26.

From bpb to final_score

final_score is a documented monotone-decreasing transform of bpb, so a lower bpb yields a better (higher) final_score:
final_score = 1 / (1 + bpb)        # before tie-break, penalty, and anti-cheat multiplier
This transform is implemented in source as bpb_to_final_score, which returns 1.0 / (1.0 + max(0.0, float(bpb))). The leaderboard’s ORDER BY final_score DESC therefore ranks better learners first. Source: docs/scoring.md:32-34; src/prism_challenge/evaluator/scoring.py:151-153.

Compute normalization, not wall-clock

The score is compute-normalized: it is reported and normalized by tokens consumed (and, optionally, estimated FLOPs), never by wall-clock time. A faster GPU or more GPUs cannot buy a better score; wall-clock is only a safety cap on the run. This keeps scores fair across the 1-to-8 GPU range even though the scored run uses one physical GPU. Source: docs/scoring.md:36-41; docs/scaling.md:48-67.

Tie-breaker: held-out delta over random init

When two submissions are near-equal on bpb, the challenge breaks the tie with the held-out delta on the secret val split:
heldout_delta = bpb(random-init twin on val) - bpb(trained model on val)
A larger improvement over the random-init twin is better. The held-out delta is folded into final_score as a bounded tie-break term: it can only reorder submissions whose bpb is within a small epsilon of each other, so a strictly lower bpb is never ranked worse on the primary axis. When no secret val split is scored for a run, the run is graded on bpb alone with no tie-break. Source: docs/scoring.md:43-56; src/prism_challenge/evaluator/scoring.py:24-31.

Anti-memorization gap (stability)

The challenge measures the train-vs-held-out gap (the converged train bpb against the held-out val bpb on the same byte basis). An excessive gap flags memorization and multiplies a penalty into final_score, so a memorizer ranks below an equivalent non-memorizing learner. The gap comparison is basis-consistent so a benign learner is not falsely flagged. Source: docs/scoring.md:58-63.

Anomaly zeroing

A step-0 / smuggled-weights anomaly (an impossibly low initial loss under forced random init) drives the anti-cheat multiplier to zero, so an anomalously good bpb is flagged and zeroed rather than rewarded. A degenerate run (zero coverage, non-finite, or out-of-band bpb) is failed rather than scored. Source: docs/scoring.md:66-70.

Scaling signals

PRISM keeps the score compute-normalized so hardware never changes the ranking, and it records a typed, observability-only compute block in the manifest — the GPUs leased (gpu_count, which is 1 for the scored nproc=1 path), the launch shape (world_size, nproc_per_node, device), and the realized parameter count. The final_score never reads gpu_count, so there is no GPU-count reward and no multi-GPU scaling bonus. Two official scored execution modes run on the locked FineWeb-Edu data:
ModePurposeDataset target
gpu_proxy_evalDefault official scored re-executionFineWeb-Edu sample-10BT locked shards
full_scale_evalLarger official scored re-executionFineWeb-Edu sample-10BT then sample-100BT phases
Source: docs/scaling.md:9-19, :61-67.

Leaderboard and tie-break ordering

The leaderboard ranks by final_score (so by bpb and the folded-in held-out delta). When two submissions are still equal, the final deterministic tie-break is earliest-commit-wins, then submission id — implemented as ORDER BY sc.final_score DESC, s.created_at ASC, s.id ASC. Each hotkey appears at most once: the best submission per hotkey survives. Source: docs/scoring.md:72-77; src/prism_challenge/repository.py:506.

Weights

get_weights converts completed scores into normalized weights: one weight per hotkey, taken from that hotkey’s best final_score, normalized to sum to 1.0. Weights are always dry-run and are never written on-chain. The legacy raw-loss term and the v1-NAS architecture/training ownership pools are retired from the score. Every number above is recomputed by the challenge from the challenge-authored prism_run_manifest.v2.json; miner-reported metrics and miner-written manifests are ignored. Source: docs/scoring.md:80-89; src/prism_challenge/weights.py:21-31.

Reference studies

PRISM’s scoring cites the following studies (reproduced from the source scoring doc):
AreaStudyPRISM implication
Prequential / online codingDawid, 1984Score the integrated online loss, not a final checkpoint.
Minimum description lengthRissanen, 1978Treat compression (code-length) as the learning signal.
Scaling lawsKaplan et al., 2020Compare loss trajectories under matched compute.
Compute-optimal scalingHoffmann et al., 2022Normalize by tokens/compute so over/under-training does not skew ranking.
Dataset provenancePenedo et al., 2024Freeze the data revision and shards for reproducible runs.
Source: docs/scoring.md:91-99.