What Is the Role of AI in Peer Review?

A Small Illustration Using Open Peer Review

This note is a companion to a forthcoming chapter on publishing and peer review in Global Health Research in Practice. It uses a single published trial as a worked example. The trial is sound in its central finding; the point here is not to indict it but to watch peer review at work, and to ask what an AI reviewer adds to, and misses from, the job. The working pieces behind it — the four AI reviewer reports, the de-identified human reviews, and the study design and run package — are collected in the companion materials.

What is the role of artificial intelligence in peer review? The question is usually raised as a worry. Reviewers pasting manuscripts into chatbots, or authors generating papers faster than anyone can vet them. These are legitimate concerns, but only part of the story. What are we missing about the potential of AI in peer review by limiting the conversation to these issues?

A key piece of context is that academic peer review is under tremendous strain. A 2025 feature in Nature described an “overloaded” review process and called the situation a crisis. The arithmetic behind it will make sense to anyone who has served as an editor or reviewer. The number of articles indexed in Scopus and Web of Science was roughly 47% higher in 2022 than in 2016. This exponential growth has outpaced any increase in the number of practicing scientists, so the reviewing burden carried by each researcher has risen sharply. Reviewing is volunteer work, done for free on top of everything else, and editors increasingly report difficulty finding qualified people willing to do it. The standing pressure on the process is to go faster with fewer hands.

Into that strain comes a new failure mode. An audit of 2.5 million biomedical papers published between 2023 and early 2026 found citations to studies that do not exist rising more than twelvefold, from about 4 per 10,000 papers in 2023 to roughly 57 by early 2026, the climb starting in mid-2024 as AI writing tools spread. By the first weeks of 2026, 1 paper in 277 carried at least one invented citation.

How those fabrications were found should reframe the question. Humans did not catch them. They were surfaced by an automated system that checked 97.1 million references against their sources. For a real and growing part of the reviewing job, a machine is already better suited to the work than a person.

That is a narrow claim. I suspect many reviewers would be glad to receive a report alongside a manuscript that says all included citations are real. But let’s go a step further. Could AI judge whether a paper was cited accurately and appropriately, true to the original report? Expert human reviewers can do this when reviewing manuscripts in their specialty, but even then there are bound to be some references in a list of 20 to 60 that a reviewer has not read. Are human peer reviewers always tracking down these unread papers to verify claims?

Verifying claims gets us into the realm of judgement, not just mechanical database lookups. Let’s ask the big question. How do today’s frontier models compare with expert human reviewers across the whole task? Not only checking references, but weighing methods, judging whether a result means what the authors say it does, and deciding whether a paper should be revised or rejected? A clean demonstration could show where AI assistance genuinely helps and where it might not, which is the practical question for anyone who runs a journal.

I have a stake in this. I serve as an academic editor at a journal whose policy is, in effect, that AI stays out of review, and I have come to think that policy forgoes a real opportunity to improve both the quality and the efficiency of peer review with AI assistants used deliberately and in the open. To probe it, I set up a small, deliberately limited test. I’ll tell you what I learned and what I would do if the decision about journal policy were mine.

The manuscript

I set out to find a particular kind of paper. I started with journals that have open peer review so the human reports are on the record to compare against. I also looked for papers where the data were available for re-running analyses. Finally, the paper and peer reviews had to be published online after the model’s training cutoff, so the model cannot have absorbed the outcome in advance. One recent trial meets all three conditions.

In January 2026, a team of obstetricians submitted a large randomized trial to The BMJ. The question was practical and important: does prophylactic tranexamic acid, a cheap antifibrinolytic drug, reduce postpartum hemorrhage in women with placenta previa delivering by cesarean, a group at high risk of catastrophic bleeding? The trial randomized 1,732 women across 24 maternity hospitals in China. It was published in May 2026.

The BMJ practices open peer review. The reviewers’ reports, the editors’ decision, the authors’ replies, and both versions of the manuscript are all published alongside the paper. That openness is the precondition for everything below: it let me take the exact manuscript the human reviewers saw and hand it, unaltered, to a panel of AI reviewers under controlled conditions.

Screenshot of The BMJ web page for the trial, showing the Peer review tab with a table listing the original article (27 January 2026), first decision (1 March 2026), author response (24 March 2026), and ICMJE forms (11 April 2026), each with an access-document link.
Figure 1: The trial’s open peer-review record at The BMJ: the original submission, the editors’ first decision, the authors’ response, and the ICMJE disclosure forms are each posted in full alongside the paper. The published reviews, decisions, and both versions of the manuscript are what make the comparison in this note possible.

The setup

This is a demonstration on one paper, not a validation study. With an n of one, nothing here generalizes to a rate at which AI “beats” or “loses to” human reviewers. What one careful case can do is illustrate the difference. It helps that this is a strong paper, well conducted and thoroughly reviewed, which makes it a clean test case. A weaker manuscript, or a journal without this depth of review, might expose strengths and failure modes on both sides that this comparison cannot.

I asked Anthropic’s Claude Code (Opus 4.8) to assemble a panel of four AI reviewers, each a separate, fresh instance of a frontier model, OpenAI’s GPT-5.5, to run fully offline with no web access. Each received only what the human reviewers had at submission: the submitted manuscript, its bound statistical appendix and protocol, and the full text PDFs of its cited references. They were given four roles that mirror a real review panel: a statistician, two clinical reviewers, and a citation checker. Each wrote an independent report and an editorial recommendation; all four are reproduced in full in the companion materials. The study authors submitted their manuscript to the journal in late January 2026, more than a month after the training cutoff date for the AI model.

Where the two panels agreed

I’ll start with the verdict because it is the cleanest result. Both panels reached the same editorial decision. The human review process ended in a decision to revise and resubmit. The AI panel, synthesizing its four reports, recommended major revision. Same disposition, independently reached.

They also overlapped on the substance that mattered most:

  • The composite primary outcome. The trial defined hemorrhage as calculated blood loss of at least 1,000 mL or a red-cell transfusion within two days. Both panels pressed on this: the two components mix a formula-based estimate with a bedside clinical decision that varies across hospitals, and the benefit might rest mostly on the formula.
  • Overstated safety. The trial reported four serious adverse events in each arm and concluded it found no increase in harm. Both panels objected that with eight events total the trial is far too small to rule out a clinically important excess of rare harms like thrombosis, and that the wording should say “no signal detected,” not “no increase.”
  • Limited reach. Both flagged that the trial was powered for its composite endpoint, not for the outcomes that matter most to women—death, hysterectomy, intensive care.

On the core scientific reading of the paper, in other words, the machine panel and the human panel were not far apart.

Where they diverged

The differences map to the relative strengths of machines vs humans.

The AI panel was relentless on the mechanical and the verifiable. It checked the arithmetic of the trial’s own tables and found five places where the reported numbers were internally impossible: a mean blood-loss difference of -24.70 mL paired with a confidence interval running from -41.30 to 90.70, far too lopsided around its own estimate to be a real interval; a laboratory complication listed as “71 (1.1%)” when 71 is plainly not 1.1% of the sample. It opened the cited references and caught citations that did not support their claims: a stated 56% hemorrhage rate drawn from a narrow, selected cohort and presented as general, and a claim that the trial’s benefit was “substantially lower” in number-needed-to-treat than a major predecessor trial when the two are nearly identical. And its AI statistician flagged that the absolute risk difference and confidence interval the analysis plan called for were missing from the results table, which reported a relative risk only.

The human panel contributed the judgment calls. The journal’s statistical reviewer made the single sharpest interpretive point in either review: the trial’s effect fell short of the 20% reduction it was designed and powered to detect, so the conclusion should be moderated. The human reviews also questioned whether adverse events gathered by telephone six weeks after delivery were vulnerable to recall bias, and pressed the authors to bring the trial into line with data-sharing policy. These are the contributions of readers asking what the result means and whether it can be trusted, rather than whether every reported number is accurate.

Checking the analysis, not just the description of it

In most of public health and medicine, a reviewer evaluates a description of an analysis. The data stay on the authors’ machines, the code is rarely shared, and the methods section is taken on trust. Because the authors released this trial’s participant-level data and analysis code with the paper, I could do the rarer thing and run the analysis independently.

This trial also provided a fixed benchmark to check against: a published, peer-reviewed statistical analysis plan, refereed before the results were in. Where a plan like that exists, a reviewer’s main statistical questions become did the authors follow it and did they carry it out correctly?

On the first question, the authors largely did follow their plan. The confirmatory core matched what was pre-specified: the primary outcome and its population, the log-binomial mixed model, the single interim analysis and the significance threshold it set, the decision not to adjust for multiplicity. The two senior authors attest in the paper to the trial’s fidelity to its protocol and analysis plan, and on the analyses that carry the result, the attestation holds. The departures were limited and mostly minor in effect: one pre-specified item dropped from the reporting1 and a figure legend that labeled a post-hoc subgroup as pre-specified.

On the second question, whether the plan was carried out correctly, having access to the data was essential, and the answer is yes. The headline result reproduces exactly: postpartum hemorrhage in 29.7% (251/845) of the tranexamic-acid group versus 35.1% (297/846) of placebo, a relative risk of 0.85 from the trial’s pre-specified model—a log-binomial mixed model with a center random effect, adjusted for maternal age and placenta-previa type. It holds up under pressure, too. My sense is that the trial is carefully conducted and faithfully reported, still more the exception than the rule.

Well-meaning false positives

Placenta accreta spectrum—a dangerous condition in which the placenta grows into the wall of the uterus—was a pre-specified subgroup of this trial and part of its claim to study a high-risk population. The submitted manuscript’s bound protocol disclosed, in its version history, that the team had revised the accreta rate after the trial: an initial 39.0% drawn from administrative inpatient coding was replaced, after independent re-review of surgical and pathology records by two senior obstetricians with a third as arbiter, by a confirmed rate of 17.9%—reversing 358 of 661 original diagnoses. The re-adjudication moved toward accuracy, since administrative coding is known to overstate accreta. The published methods describe the re-review and report the corrected 17.9% rate; the full magnitude of the reversal appears only in the protocol version history bound into the submission.

The AI panel flagged it anyway. The AI statistician noted that a pre-specified subgroup had been “changed after the trial and in response to peer-review feedback” and asked for results under both classifications; the clinical reviewers asked routine transparency questions (e.g., were the adjudicators blinded to treatment arm, were cases not initially coded as accreta also re-checked). None of the three human reviewers raised it. My hunch is that they saw the same disclosure and judged it to be a documented, sensible diagnostic re-review not worth flagging.

That is the difference between the two panels in miniature. The machine reads exhaustively and surfaces everything the documents hold—a reversed subgroup definition, a citation that overstates its source, a confidence interval that cannot be right. This will inevitably generate false positives, but an academic editor could retain or reject the flags at the desk review stage. As a tired reviewer who has read manuscripts clearly not ready for publication, wondering why the paper was even sent out for review, I would appreciate some level of automated pre-screening. AI can catch fabricated references, flag implausible values, and surface potential protocol deviations so we can focus on evaluating how well the study answers the given research question.

Key Takeaways

This is a single case, so keep in mind the associated limitations. Here’s what I think it showcases:

  1. AI review with today’s frontier models can complement, not replace, human peers. Its most reliable strengths were the tireless and the exhaustive: checking the arithmetic of every table, opening every cited paper to verify what it says, reading the protocol’s version history line by line. The AI panel was thorough on these details partly because it was prompted to be. Those are exactly the contributions human referees are least likely to finish, and used well they free a human reviewer to spend attention where it counts.

  2. The human reviewers’ distinctive contribution was judgment. Whether an effect that fell short of the trial’s own threshold is worth moderating; whether telephone follow-up invites recall bias; what a result means for the women in the trial. But don’t bet against AI eventually approaching human-level peer review in this respect.

  3. False positives come with AI review. The AI panel raised more than the humans did—every table inconsistency, every shaky citation, the reversed subgroup definition. Most of it was real, and yet not decisive in this case: the arithmetic errors were fixed before print, the reclassification was disclosed and sound. That kind of AI recall is genuinely useful, but it comes with noise. I can imagine other cases where a dogged AI reviewer would surface an undisclosed protocol change that human reviewers could easily miss. And if I’m being honest, as an author reading referee reports of my own work, I always think 75% of reviewer comments are false positives. At least the AI reviewers can be told to be respectful.

  4. Open peer review is what made all of this possible. The comparison and the reanalysis depend entirely on The BMJ publishing its reviews, its decision letters, and both versions of the manuscript. Without that openness there is no way to set a human panel and a machine panel side by side, or to check either against the trial’s own data. Whatever role AI ends up playing in peer review, the case for conducting it in the open only grows.

What I would do if I ran a journal today. I would build this pre-screening at the journal level rather than leave it to individual reviewers running their own tools. A single automated pass over each submission—the arithmetic, the citations, the protocol and statistical plan against what the paper reports—would go back to the authors as questions before the manuscript ever reached a reviewer, sparing reviewers a round of work on problems the authors can fix first. For a journal without a dedicated statistical reviewer, a pass like this could raise many of the questions a statistician would, without standing in for one. Two conditions would be non-negotiable: the screen raises questions, it does not desk-reject on its own; and because submissions are confidential, any tool would have to run under terms that keep unpublished manuscripts out of model training.

Footnotes

  1. The pre-specified item the paper dropped from its reporting is the absolute risk difference. The analysis plan called for the primary outcome to be reported as a rate difference with a confidence interval, and the published table left that cell blank; the paper reports a number-needed-to-treat instead. The difference is about five percentage points.↩︎