# Step 1 harness & operator run-log

Operator: Red (Claude/Opus), acting as harness only — **not** a reviewer. Set up and ran the
four GPT-5.5 peer-review instances per `protocol.md` §4–§6 and `runs/prompts/README.md`.
Date: 2026-06-03.

## Model & parameters
- **Model:** `gpt-5.5-2026-04-23` (version-pinned alias of `gpt-5.5`), OpenAI API.
- **Reasoning effort:** `high`. Temperature not set (reasoning model default).
- **API key:** `OPENAI_API_KEY` from `ghrbook_website/.env` (gitignored). First OpenAI use on Red.
- **Access:** OpenAI Responses API. **No hosted tools** — no `web_search`, no browsing, no
  code-interpreter. Web is therefore off *structurally*, not by instruction: the only network
  egress is the model API call itself.

## What each instance can see (and only this)
Tools given to the model (local Python functions; see `runner.py`):
- `list_docs()` → the document names below.
- `read_doc(name)` → full text of one document.
- `write_note(name, content)` → scratch memory under `runs/scratch/<ROLE>/` (model's own use).

The read-only **text mount** (`runs/_mount/`, built by `extract_mount.py` with `pdftotext -layout`):
- `proof` ← `runs/inputs/27_jan_2026_bmj-2026-089636_proof_hi.pdf` (submitted proof incl. the bound
  supplementary appendix, tables S1–S14, SAP pointer).
- `INDEX` ← `pdfs/INDEX.md` (reference number → source map).
- `ref01`…`ref24` ← the cited-reference PDFs. **22 present; ref05 and ref16 absent by design**
  (the prompts already tell reviewers to mark anything resting on them "uncheckable").

**Nothing else is reachable.** `read_doc` resolves names against the mount only and rejects
anything outside it — no path traversal, no other files. `_not-for-reviewers/`, `../zhang/`, the
raw data, analysis code, published paper, decision letter, human reviews, and authors' reply were
**never mounted** and cannot be read.

## Prompt handling (integrity)
- Each instance's instruction is the text **below the `---` rule** in `runs/prompts/<ROLE>.md`,
  sent **verbatim** as the `user` message. Operator notes above the rule were not sent.
- A fixed `developer` message (identical for all four) describes only the offline harness and the
  three tools, and bridges the prompts' "pdfs/…" wording to `read_doc`. It contains **no** content
  about the paper, the trial, or any expected findings. Full text is teed at the top of every
  transcript so it can be audited.
- One brand-new session per role; no shared state; instances cannot see each other.

## Transcripts
Teed live to `runs/R-STAT.md`, `runs/R-CLIN-A.md`, `runs/R-CLIN-B.md`, `runs/R-LIT.md`. Each contains:
header (model, effort, web-disabled, timestamp, doc list) · developer message · verbatim user
instruction · every reasoning summary the API returns · every `[tool_call]` (tool + args + size of
result) · the complete final review.
- **Note:** each `[tool_call]` line logs *which* document was read and the byte size of the result,
  **not** the document body inline (the bodies are the static mount files, identical across runs;
  inlining them 4× would bloat the logs without adding information). The model's reasoning summaries
  and final review are teed in full. If a fully-inlined log is wanted, it's a one-line change in
  `runner.py`.

## Files in this run package
- `extract_mount.py` — builds `runs/_mount/` from the proof + cited PDFs (mechanical; operator did
  not read contents).
- `runner.py` — the offline harness. `python3 runner.py <ROLE>` runs one instance; `--smoke` runs a
  cheap mechanics check.
- `runs/_mount/` — extracted read-only text (proof + refs + INDEX). **Extracted paper text — do not
  commit** (should be gitignored).
- `runs/scratch/<ROLE>/` — per-instance scratch notes (if the model wrote any).
- `runs/R-*.md` — the four transcripts (deliverable).

## Reproduce
```
cd drafts/notes/ai-peer-review
python3 extract_mount.py            # rebuild the mount
python3 runner.py R-STAT            # (and R-CLIN-A, R-CLIN-B, R-LIT)
```

## Operator attestations
- Operator did **not** read the manuscript, the reference texts, the model outputs, the human
  reviews, or anything under `_not-for-reviewers/` / `../zhang/`. Handled all as opaque files.
- Smoke test passed before the real runs (tools fire, model reads via `read_doc`, no web,
  correct trivial answer).
- All document accesses during the runs are within the mounted allow-list (verifiable from the
  `[tool_call]` lines in each transcript) — withheld-material leakage is structurally impossible.


## Update — 2026-06-03 (post-failure hardening + R-LIT optimization)

- **R-STAT first attempt FAILED** (`APITimeoutError`): the agent read 18 documents (proof + 17
  refs incl. a 300 KB reference) into one growing context; the next request exceeded the SDK's
  default timeout and the runner had no handling, so it died with no review. That attempt still
  incurred token spend (visible on the OpenAI dashboard); no usable output.
- **Hardening (all runs):** request timeout 1800 s + 4 auto-retries; graceful failure capture —
  any crash now writes `RUN FAILED / END` instead of a silent incomplete transcript.
- **Token ledger:** every run logs `api_calls` + input/output/reasoning/total tokens to a
  `===== TOKENS =====` footer in its transcript and a row in `runs/_token_ledger.csv`.
- **Two retrieval modes** (recorded in each transcript header):
  - `full-read` — R-STAT, R-CLIN-A, R-CLIN-B: `read_doc` returns whole documents.
  - `targeted-retrieval` — R-LIT: `search_doc(name, query)` returns matching passages with char
    offsets; `read_doc(name, offset, length)` reads a bounded (<=30k-char) window. The citation
    reviewer looks up the relevant passage in a cited study as needed instead of ingesting/holding
    every reference in full, so context stays bounded. Rationale vs RAG: a citation check is a
    targeted lookup *within a known target document* where exact facts/numbers matter and the
    process must be auditable; agentic keyword+window retrieval is more precise on exact facts and
    fully transparent in the transcript, with no embedding/vector-store infrastructure.
- **Re-run order:** R-STAT canary (full-read, hardened) -> R-CLIN-A, R-CLIN-B -> R-LIT (targeted).
