As the number of scientific papers has increased, it has become increasingly difficult for journal editors to find qualified peer reviewers for these papers. Most editors today have had the experience of contacting 10 or even 20 potential reviewers in order to obtain the 2-3 reviewers usually required by journals, which means that the final reviewers often are not the world’s top experts in the topic. This leads to an obvious question: Could large language models be used to effectively review papers?
For the last few weeks I’ve been using the latest Google Gemini 2.5 Pro for various scientific tasks, and I’ve found it to be substantially better than the other frontier models I’ve worked with in terms of the quality of its responses on these questions. (Note that I’ve not yet tried the new Gemini for coding - all of my coding work is done either in Cursor using Claude 3.7 Sonnet or in the shell using Claude Code). I’ve primarily been using it to help me understand topics in AI/machine learning and philosophy, and I have found it to give invariably good answers that are consistent with my understanding on topics where I am knowledgeable. So I decided to test out the Gemini model as a peer reviewer. I started by giving it a preprint that I had recently reviewed for a journal. Its review was thorough and caught many of the same issues that I had noticed, with only minor discrepancies.
How incisive are LLM-based reviews?
These early explorations led me to ask whether Gemini 2.5 Pro could identify subtle problems with a paper. I tried this on a paper that I was given to read for a workshop that I was participating in. The paper presented results from a study that looked at how knowledge of one’s APOE genotype (a risk factor for Alzheimer’s disease) relates to the likelihood of purchasing long-term care insurance. The results showed that people with at least one e4 allele were 2.3 times more likely to change their long-term care insurance compared to those with two e3 alleles (which are the baseline). However, the paper failed to mention a problematic result in the same table: those with at least one e2 allele and no e4 alleles (who are at lower risk for Alzheimer’s disease) also changed their insurance more often than baseline (1.55 times), though not significantly. In addition, there is a critical problem with the outcome measure: a change in one’s insurance usage doesn’t tell us whether they increased or decreased usage, which is central to the claims made in the paper regarding genetic testing and adverse selection.
I first gave the paper to Gemini 2.5 Pro with the following prompt:
You are an expert in health policy. Please provide a scholarly review of the attached paper. Carefully scrutinize the claims made in the paper and assess how well they are supported by the evidence reported in the paper.
It provided a review that seemed very clear and thoughful, but failed to identify the issues I noted in my reading.
This paper, "Genetic Testing For Alzheimer's And Long-Term Care Insurance," published in Health Affairs in 2010, provides a timely and insightful analysis of the potential interplay between genetic information, specifically APOE genotyping for Alzheimer's disease (AD) risk, and the long-term care (LTC) insurance market. The authors, a multidisciplinary team, bring diverse expertise to this complex issue.
Summary of Key Arguments and Findings:
The central thesis is that the APOE e4 allele, a known genetic risk factor for AD, also signals an increased likelihood of needing LTC. The paper posits that individuals learning they carry this risk variant are more likely to purchase or increase LTC insurance, potentially leading to adverse selection. It explores the implications of this for consumers, insurers, and policymakers, particularly because the Genetic Information Nondiscrimination Act (GINA) of 2008 did not extend its protections to LTC, disability, or life insurance.
The paper supports its claims with empirical evidence from two main sources:
Piedmont Health Survey of the Elderly (PHSE): Analysis of this population-based cohort (n=2,089) aimed to establish APOE's predictive value for actual nursing home admission.
REVEAL II Study: This clinical trial (n=276) examined behavioral changes, specifically in LTC insurance purchasing, among first-degree relatives of AD patients after they received their APOE genotype information.
Based on these findings and established market principles, the authors then model four scenarios of information asymmetry/symmetry regarding APOE status between individuals and insurers, and discuss potential market responses and policy implications, particularly concerning the concept of "fair" premiums.
Scrutiny of Claims and Supporting Evidence:
Claim: APOE e4 status predicts increased likelihood of needing LTC (specifically, nursing home admission).
Evidence: The PHSE data (Exhibit 1) shows that individuals with at least one e4 allele had an odds ratio (OR) of 1.48 (95% CI: 1.09-2.01) for nursing home admission compared to those with two e3 alleles, controlling for age, sex, marital status, and race. The corresponding probability of nursing home admission was 0.127 for the e4 group versus 0.101 for the e3/e3 group.
Assessment: The evidence supports this claim. The statistically significant OR, while modest, indicates an increased risk. The paper also correctly links this to AD, citing other research (Rotterdam Study, Exhibit 1) showing a much stronger association between e4 and AD development (OR 4.6). Given that a high percentage of individuals with AD eventually require nursing home care (cited as 75%), the link between APOE e4, AD, and LTC need is plausibly established. The absolute probabilities are low, but the relative risk is present.
Claim: Individuals who learn they have the APOE e4 variant are more likely to purchase or increase LTC insurance (demonstrating adverse selection).
Evidence: The REVEAL II study data (Exhibit 1) found that participants informed they had at least one e4 allele were 2.31 times more likely (95% CI: 1.11-4.81) to report changes in (or plans to change) their LTC insurance compared to those with two e3 alleles. The probability of changing LTC insurance was 0.237 for the e4 group versus 0.087 for the e3/e3 group.
Assessment: This claim is well-supported by the REVEAL II data. The OR is statistically significant and indicates a clear behavioral response to receiving risk information. This finding is crucial as it provides empirical backing for the concern about adverse selection if consumers have genetic information that insurers do not. A potential caveat, inherent in the REVEAL study design, is that participants were first-degree relatives of AD patients and thus might be more health-conscious or concerned about AD than the general population. However, for demonstrating the principle of adverse selection based on genetic information, the study is effective.
Claim: The LTC insurance market could be significantly affected by the availability of APOE genetic testing, depending on who holds the information.
Evidence: This claim is primarily supported by logical deduction based on the empirical findings above and established insurance market principles, outlined in Exhibit 2 and the subsequent discussion of four scenarios (Status Quo; Only Consumers Know; Only Insurers Know; Full Information).
Assessment: The scenarios are well-reasoned and highlight the potential consequences of different informational states.
Status Quo (Neither knows APOE): Plausibly leads to insurers pricing for uncertainty, contributing to high premiums and low uptake.
Only Consumers Know: The REVEAL II data directly supports the likelihood of adverse selection, potentially destabilizing the market or harming low-risk individuals.
Only Insurers Know: Raises concerns about "adverse underwriting" and genetic discrimination, which could deter uptake even if premiums are actuarially "fair" from the insurer's perspective.
Full Information: Theoretically allows for actuarially fair premiums, but raises profound ethical questions about genetic determinism and fairness, as the paper rightly points out (genes are not a choice like smoking).
The analysis of these scenarios is a strength of the paper, providing a clear framework for policy discussion.
Claim: Policymakers will need to confront the issue of using genetic markers in LTC insurance pricing, balancing actuarial fairness with societal notions of fairness.
Evidence: This is a concluding policy argument based on the entire analysis. The paper highlights the GINA gap for LTC insurance and the tension between different definitions of "fair."
Assessment: This is a well-justified call to attention. The paper effectively lays out the complexities and the inevitable policy choices that arise with advancing genetic technology. The discussion of the CLASS Act (then under consideration) illustrates a potential public-sector response to these challenges.
Strengths of the Paper:
Timeliness and Relevance: Published shortly after GINA, it addresses a critical gap in the legislation and an emerging technological capability (DTC genetic testing was gaining traction).
Empirical Grounding: Uses data from two distinct studies to support its core premises about risk prediction and behavioral response, rather than relying solely on speculation.
Clear Framework: The four scenarios provide a structured way to understand the implications of genetic information in the market.
Policy Focus: Directly engages with policy implications and the concept of "fairness," which is central to debates about genetic information.
Multidisciplinary Perspective: The authors' diverse backgrounds likely contribute to the nuanced consideration of medical, ethical, economic, and policy aspects.
Potential Limitations (viewed from a 2024 perspective):
Evolution of the Market: The LTC insurance market has continued to evolve, with some insurers exiting and product designs changing. The DTC genetic testing landscape is also vastly different.
Predictive Power of APOE: While APOE e4 is a significant risk factor, it is not deterministic for AD, nor is AD the only reason for needing LTC. The paper acknowledges this by controlling for other factors in the PHSE analysis, but the broader context of multifactorial risk for LTC is important.
CLASS Act: The CLASS Act was ultimately repealed, so the specific policy example became moot, though the underlying challenges it sought to address remain.
Consumer Behavior Complexity: While REVEAL II showed a response, the actual uptake of genetic testing for AD risk and subsequent insurance decisions in the general population (outside of research or high-risk families) might be influenced by many other factors not explored in depth, such as cost of testing, genetic literacy, and perceived utility of the information.
Overall Assessment:
The paper makes a convincing case that APOE genotype information has the potential to significantly impact the LTC insurance market. Its claims regarding APOE as a risk predictor for LTC need and the likelihood of adverse selection based on this information are well-supported by the evidence presented from the PHSE and REVEAL II studies, respectively. The exploration of market scenarios is logical and provides a valuable framework for policy considerations.
The authors successfully highlight the tension between actuarial fairness and broader societal concerns about genetic discrimination, a dilemma that remains highly relevant. While some specific market conditions and policy proposals have changed since 2010, the fundamental issues raised by this paper concerning genetic information and insurance continue to be pertinent. It stands as a solid piece of scholarship that contributed meaningfully to the understanding of a complex emerging issue at the intersection of genetics, health policy, and insurance economics.
This certainly seems to be a level of review quality that most editors would appreciate, but it fails to identify the two issues that I noted. I pushed it a bit harder, first with this prompt:
you missed an important issue with this paper. Please examine it more closely to identify any ways that the evidence may not support the result.
This was not sufficient - it went on to discuss other minor issues (such as the magnitude of the effect) but did not notice the issues I noted. The following prompt was more successful:
there is a data point in the paper that is inconsistent with the claim. what is it?
This led it to the right answer:
You are absolutely correct to push for a more granular examination. The data point that appears inconsistent, or at least complicates the primary narrative that increased perceived risk drives insurance purchase, is found in Exhibit 1, under the REVEAL II study results for changing long-term care insurance:
Individuals with "At least one e2 trait and no e4 trait" (considered lower risk for Alzheimer's Disease) had an Odds Ratio (OR) of 1.55 [95% CI: 0.43-5.60] for changing or planning to change their long-term care insurance, with a probability of 0.149.
This is compared to the reference group, "Two e3 traits" (average risk), which had an OR of 1.00 and a probability of 0.087.
Why this is inconsistent or problematic for the paper's simpler claim:
Challenges the Simple Risk-Driven Model: If even "good news" (from an AD risk perspective) leads to an increased propensity (based on the point estimate) to seek LTC insurance, it suggests that the act of going through genetic testing and counseling for AD risk might itself make LTC planning more salient and prompt action, somewhat independently of the specific risk outcome for lower-risk individuals. This complicates a straightforward adverse selection model where only those identified as high-risk disproportionately seek coverage.
Not Highlighted or Discussed: The authors focus their textual discussion of the REVEAL II insurance findings exclusively on the e4 group: "From REVEAL II data, we find that consumers who discover that they have at least one e4 trait are 2.3 times more likely than those who had two e3 traits to increase their long-term care insurance holdings..." (page 104). They do not discuss the e2 group's insurance behavior from REVEAL II in the text.
Important Caveat (which the authors might argue):
The confidence interval for the e2 group's OR is very wide (0.43-5.60) and includes 1.0. This means the observed OR of 1.55 is not statistically significant. Therefore, one cannot definitively conclude from this data that e2 carriers are truly more likely to change their insurance.
However, the point estimate itself (1.55) is what is inconsistent with the expectation derived from a simple risk-reduction perspective. If the study had been larger and this effect had become statistically significant, it would have significantly nuanced the paper's main argument about how genetic risk information influences LTC insurance decisions, suggesting a more complex behavioral response than just high-risk individuals seeking more coverage. The fact that the point estimate leans in this counterintuitive direction, even if not statistically significant, is a subtlety worth noting.
I pushed further on the other conceptual point:
There is also a problem with the main measure that is being reported here, which is the likelihood of changing coverage. What is the problem with that measure?
This was also successful:
You've hit on a crucial point. The problem with "likelihood of changing coverage" (or "report changes in long-term care insurance" or "increase their long-term care insurance holdings, or report planning to do so") as the main reported measure for consumer behavior in REVEAL II is its ambiguity and potential bidirectionality, especially when trying to demonstrate adverse selection.
Here's why it's problematic:
"Changing" is Not Necessarily "Increasing" or "Purchasing":
Adverse selection, in its classic sense in this context, implies that individuals who discover they are at higher risk are more likely to purchase insurance they didn't previously have, or increase the level of coverage they already possess. This is what drives up costs for the insurer if the risk isn't priced in.
The term "changing coverage" could theoretically include:
Increasing coverage or buying a new policy (supports adverse selection if done by high-risk individuals).
Decreasing coverage (e.g., if someone with an e2 allele feels less at risk and decides to reduce their existing policy).
Cancelling coverage (similar to decreasing).
Switching policies (which might or might not involve an increase in overall coverage).
Impact on Interpreting the e4 Result:
The paper states (page 104): "people who were told they had at least one e4 trait were more likely than those who did not have an e4 trait to report changes in long-term care insurance..." And later, "...are 2.3 times more likely...to increase their long-term care insurance holdings, or report planning to do so."
While the second phrasing is more specific ("increase...or report planning to do so"), the primary data in Exhibit 1 is titled "Odds ratio of changing long-term care insurance" and "Probability of changing long-term care insurance."
If "changing" for the e4 group predominantly means increasing coverage, then it supports the adverse selection argument. However, without a clear breakdown of the types of changes, we can't be certain that the entire effect is driven by increased demand. The phrasing in the abstract and some parts of the text is less precise.
Impact on Interpreting the e2 Result (Even if Not Statistically Significant):
As discussed previously, the e2 (lower AD risk) group had a point estimate OR of 1.55 for "changing" coverage, with a probability of 0.149.
If these e2 individuals were "changing" their coverage by reducing it or deciding against a purchase they were mulling, then this higher likelihood of "changing" would be a rational response to learning about lower risk.
However, if the interpretation is that any "change" reported (including by e2s) is an indicator of increased market activity spurred by genetic testing, it muddies the water regarding the specific mechanism of adverse selection (which is about high-risk individuals seeking more coverage).
"Planning to Do So" is Not Actual Behavior:
The more specific phrasing "increase their long-term care insurance holdings, or report planning to do so" includes an intention ("planning to do so"). There's often a gap between intention and actual behavior. Relying on "planning" can overestimate the true behavioral impact. The actual market impact (and thus the true extent of adverse selection) depends on actual purchases and increases in coverage, not just plans.
In essence, the main problem is that "likelihood of changing coverage" is not a sufficiently precise or unidirectional measure to unequivocally demonstrate adverse selection.
To robustly support the claim of adverse selection, the ideal measure would have been:
"Likelihood of newly purchasing LTC insurance."
"Likelihood of increasing the benefit amount of existing LTC insurance."
Without this specificity, we are left to infer that "changing" primarily means actions that increase the insurer's liability for the high-risk group. While plausible, the measure itself is not as clean as it could be and allows for alternative interpretations or confounding factors within the "change." The paper's narrative strongly pushes the adverse selection interpretation for the e4 group, but the ambiguity of the reported outcome measure in Exhibit 1 slightly weakens the directness of this evidence.
This example shows that the Gemini 2.5 Pro model is quite good at reviewing scientific papers, especially with a human in the loop. Its initial attempt at a review was quite passable but didn’t catch a couple of important issues with the paper. However, with a bit of pushing it was able to identify the important issues that the human reviewer (me) had identified.
In future installments of this series I will examine how the prompt matters for effective reviewing, and also examine the use of multiple models to generate AI-generated meta-reviews. Interested to hear your thoughts, and your experiences with other papers and/or other models!
I don’t mind the idea of LLMs somewhere in the review process, but I think it’s wrong to describe your example as a human in the loop, or even a loop at all. The way you describe it, you identified some issues, and then prompted and prompted until Gemini repeated them back to you. I take it a loop would mean that you improve on its response or it on yours, as opposed to one of you identifying the right response and then coaxing the other to repeat it. That doesn’t mean LLMs shouldn’t be used in these cases, but as long as we’re relying on the human to get the right answer and then bring the LLM into agreement through leading questions, I’m not sure I buy that LLMs are quite good at reviewing scientific papers, or add much value for reviewers (yet).
Really enjoying the book so far, by the way. And having it released in substack-sized chunks has been really effective, for me at least.