# AI Peer Review — Study Design and Protocol

This document describes how the study behind the note [*What Is the Role of AI in Peer Review?*](../ai-peer-review.html) was designed and run. It covers the thesis, the design principles, the blind AI review panel (Step 1), and the analysis pipeline (Steps 2–7).

**Target paper:** Zhang et al., "Prophylactic tranexamic acid for the prevention of postpartum haemorrhage in women with placenta praevia." Submitted manuscript proof dated 27 January 2026 (`bmj-2026-089636`); published *BMJ* 2026;393, [doi:10.1136/bmj-2026-089636](https://www.bmj.com/content/393/bmj-2026-089636), 13 May 2026. Trial registration NCT05811676. *The BMJ* practices [open peer review](https://www.bmj.com/content/393/bmj-2026-089636/peer-review), so the human reports, decision letters, author replies, and both manuscript versions are public.

## Thesis (an n = 1 demonstration, not a validation)

Two grounded claims, no overreach:

1. **On equal footing** — given the same materials the human reviewers had (the submitted manuscript and its cited references), how does an AI peer-review panel compare to a genuinely strong human panel: what does it catch, miss, or invent? The one affordance the AI is given that human reviewers rarely spend time on is reading every cited paper in full.
2. **With the data and code** — released only with the final paper — what does an independent reanalysis show, and which review points, human or AI, does that ground truth vindicate?

The AI panel is never given the raw data. The data-stage work (Step 5) is due-diligence on the *final published* paper, not a blind peer review, which keeps the thesis defensible.

## Design principles

The pipeline is **symmetric, descriptive, and compartmentalized**:

- **Symmetric** — the AI panel and the human panel get the same analytic treatment, each handled by a separate, fresh analyst instance that never sees the other side. Neither side is handicapped; in particular, the human panel is a strong one (a journal statistical editor and established antifibrinolytics methodologists), so this is not a strawman.
- **Descriptive** — the analysis maps what was raised (agreement, disagreement, non-overlap), not what *should* have been raised. The one normative judgment kept is **groundedness**: does a point correctly describe what is actually in the manuscript or a cited source? That is auditable against documents rather than a matter of taste. The study deliberately avoids authoring a "should have caught X" yardstick, which would be circular — an AI-authored gold standard inherits the AI reviewer's blind spots and judges the humans against an AI-shaped ruler.
- **Compartmentalized** — every analysis step is a fresh model instance with a defined input set, so contamination never accumulates across the pipeline.

Where the study needs an external signal for "did this point matter," it reads the authors' reply and the submitted → published diff. What actually changed in the paper is the real-world outcome of peer review, and it is external to the analysis.

## Who runs what

- **GPT-5.5** is the AI peer reviewer — the object under study. Used only in Step 1, blind.
- **Claude Opus** runs the analysis (Steps 2–7), each step a fresh instance with a scoped input set:
  - Steps 2 (AI-panel summary) and 3 (human-panel summary) are each **blind to reviewer type** — they receive de-identified review text only and never see the other panel.
  - Step 4 (compare) is blind in its first phase, then unblinds.
  - Steps 5–7 (data reanalysis, post-publication review, write-up) are non-blind by design.

## Step 1 — the blind AI peer-review panel

Four GPT-5.5 instances reviewed the submitted manuscript independently, in fresh sessions with no shared state.

**Inputs (exactly these, mounted read-only):**

1. The **submitted manuscript proof**, which already contains the supplementary appendix (tables S1–S14, the per-protocol detail, and the pointer to the pre-specified protocol and statistical analysis plan) — exactly as the human reviewers received it at submission.
2. The **cited references of the submitted manuscript** — 24 references, 22 of them as full-text PDFs (references 5 and 16 were unavailable and any point resting on them was marked uncheckable).

The raw data and analysis code were withheld, because they were not available to the human reviewers either (data "on reasonable request"; code promised only in the published article). They enter the study at Step 5.

**References are the submitted set, not the published set.** The submitted manuscript cited 24 references; the published version added two. Step 1 used only the 24 the authors actually submitted, so the panel saw the manuscript as the human reviewers first did.

**Web access was off, at the harness level.** The paper and its reviews have been public since 13 May 2026, so any live browse would risk contamination — the instances literally could not browse. Citation-checking was preserved a stronger way: the cited PDFs were provided locally and a dedicated literature reviewer read them to verify each citation.

**The four roles** mirror a real journal review panel, plus the one affordance available without the data:

| Reviewer | Focus |
|---|---|
| Statistician | Design and analysis: outcome definition, power, model specification, clustering, multiplicity, and whether the conclusions are supported; compares reported analyses against the pre-specified protocol and SAP. |
| Obstetrics / maternal-health methodologist | Clinical substance: whether the outcomes matter to women and clinicians, how blood loss is ascertained, adverse-event capture, and fit with the prior evidence on antifibrinolytics. |
| Generalist clinical reviewer | Interpretability, applicability, and safety: is the primary outcome clinically meaningful, is the trial powered for the outcomes it comments on, how generalizable is it, and does the framing match the evidence. |
| Literature / citation reviewer | Whether each citation is accurate and fairly used — opening the cited PDF and checking it says what the manuscript attributes to it. (No human analogue; this is the affordance.) |

No instance was told it had a human analogue. The prompts were written from each *role*, not from the points the human reviewers actually raised, to avoid manufacturing a match. The full prompt and required output format for each reviewer are reproduced verbatim at the top of its report (the four `ai-review-*.md` files). Each report ends with a "What I could not assess" section, surfacing the model's own account of its blind spots, which the study compares against what it actually missed.

**Knowledge cutoff.** GPT-5.5 has a December 2025 cutoff. The trial's registry entry and protocol predate it, so the model could have known the trial's *design*; the results, manuscript, peer reviews, and data were public only afterward (submission proof 27 January 2026, published 13 May 2026), so it could not have known the *findings*. The instances ran fully offline, so they could not have retrieved the published paper or its reviews. (One human reviewer disclosed using ChatGPT to polish the language of their report — honest texture: AI is already inside the human review loop.)

## Steps 2–7 — the analysis pipeline

**Step 1.5 — de-identification (mechanical).** Each AI review is reduced to its review body and the human reviews are stripped of names, competing-interest statements, AI-use disclosures, and journal formatting; both panels are relabeled in randomized order. A sealed key holds the mapping for unblinding at Step 4b. Blinding is best-effort and metadata-blind, not guaranteed indistinguishable — content texture still leaks (the AI panel includes a full-citation-checking role with no human analogue, and four identically structured reports read differently from human prose). The study does not hide this; it measures it at Step 4a.

**Step 2 — AI-panel summary + a drafted "AI first decision"** *(fresh analyst, blind to type).* Given only the de-identified AI reviews: describe where the four reviews agree, disagree, and do not overlap, then, acting as an editor over the four reports, draft a first decision (accept / minor / major / reject, with rationale).

**Step 3 — human-panel summary** *(fresh analyst, blind to type).* The same descriptive summary for the de-identified human reviews. No decision is drafted here; the real human first decision already exists (the BMJ decision letter) and is used directly at Step 4b. *Asymmetry to disclose:* the AI first decision is one synthetic editor instance, while the human first decision was a real BMJ editorial committee — not perfectly parallel, and stated as such.

**Step 4 — compare AI vs human** *(fresh analyst, two phases).*

- **4a (blind):** given the two summaries with sources still masked, the analyst first guesses which set is AI and which is human and states its cues (itself a finding — can a frontier model tell AI review from human review, and on what tells?), then compares the two for overlap, depth, and severity. This phase has read access to the manuscript and cited PDFs so non-overlapping points can be checked for groundedness — distinguishing a real novel catch from a confident-but-wrong one.
- **4b (unblind):** reveal the key, hand the analyst both first decisions — the real BMJ editorial decision and the Step 2 AI decision — and compare the editorial calls.

**Step 5 — data due-diligence reanalysis** *(fresh analyst, non-blind).* Given the published paper, protocol, raw data, and the authors' code, independently reproduce and probe the published result, and adjudicate the items the AI panel deferred ("could not assess without data") and the table-arithmetic flags it raised. This establishes the data-level ground truth behind the note's claims about reproducing and stress-testing the result. (Data fidelity was pre-checked: the shared workbook reproduces the primary outcome and serious-adverse-event counts exactly — see [`data-verification.md`](data-verification.md).)

**Step 6 — post-publication review** *(fresh analyst, non-blind).* Given everything now known, including the Step 5 reanalysis: is the published paper sound, and what did each review process miss that the data now reveal? A quality verdict informed by ground truth, not a re-run of the AI-vs-human comparison.

**Step 7 — write-up.** Compile all documents from the review process and draft the spine.

## Standing integrity rules

- Step 1 instances were blind to all post-submission materials, with web off at the harness level.
- Steps 2 and 3 analysts are blind to reviewer type and never see the other panel; Step 4 is blind until the key is revealed at 4b.
- De-identification is mechanical, and whoever performs it cannot also be a downstream analyst.
- Groundedness is checked against primary documents; the study does not author a normative "should have caught" yardstick.
- Importance is read from the authors' reply and the submitted → published diff, not asserted by the analyst.
- Source-first: no claim about what a paper, review, or reply says without reading it directly.