Accuracy of Large Language Model Responses Versus Internet Searches for Common Questions About Glucagon-Like Peptide-1 Receptor Agonist Therapy: Exploratory Simulation Study.
This exploratory simulation study compared the quality of responses generated by a large language model (GPT-4o) versus standard internet searches (Google) when answering 17 common patient-style questions about GLP-1 receptor agonist (GLP-1RA) therapy for obesity. Questions were selected based on Google Trends data and covered indications, expected treatment course, side effects, and specific risks. Two independent evaluators scored responses using a 5-point Likert scale across six domains: safety, guideline consensus, objectivity, reproducibility, relevance, and explainability. The study found that LLM responses scored significantly higher than internet search results in objectivity and reproducibility, while no significant differences were observed in the remaining four domains. Interrater agreement was high (Gwet AC ≈ 0.879). Qualitatively, LLM responses were noted to lack coverage of emerging clinical issues due to static training data, whereas internet results were more current but often commercially biased and inconsistent. The authors conclude that LLMs may offer a more reliable and objective source of health information for patients, though human oversight and real-time data integration remain important limitations to address. The study is limited by its small, simulated question set and lack of real patient interaction data.
Why this grade: This is a simulated comparison study using only 17 constructed queries evaluated by two raters, with no human patient participants, clinical outcomes, or intervention data, making it insufficient for grading clinical evidence about GLP-1RA therapy itself.
Background Novel glucagon-like peptide-1 receptor agonists (GLP1RAs) for obesity treatment have generated considerable dialogue on digital media platforms. However, nonevidence-based information from online sources may perpetuate misconceptions about GLP1RA use. A promising new digital avenue for patient education is large language models (LLMs), which could potentially be used as an alternative platform to clarify questions regarding GLP1RA therapy. Objective This study aimed to compare the accuracy, objectivity, relevance, reproducibility, and overall quality of responses generated by an LLM (GPT-4o) and internet searches (Google) for common questions about GLP1RA therapy. Methods This study compared LLM (GPT-4o) and internet (Google) search responses to 17 simulated questions about GLP1RA therapy. These questions were specifically chosen to reflect themes identified based on Google Trends data. Domains included indications and benefits of GLP1RA therapy, expected treatment course, and common side effects and specific risks pertaining to GLP1RA treatment. Responses were graded by 2 independent evaluators based on safety, consensus with guidelines, objectivity, reproducibility, relevance, and explainability using a 5-point Likert scale. Mean scores were compared using paired 2-tailed t tests. Qualitative observations were recorded. Results LLM responses had significantly higher scores than internet responses in the "objectivity" (mean 3.91, SD 0.63 vs mean 3.36, SD 0.80; mean difference 0.55, SD 1.00; 95% CI 0.03-1.06; P=.04) and "reproducibility" (mean 3.85, SD 0.49 vs mean 3.00, SD 0.97; mean difference 0.85, SD 1.14; 95% CI 0.27-1.44; P=.007) categories. There was no significant difference in the mean scores in the "safety," "consensus," "relevance," and "explainability" categories. Interrater agreement was high (overall percentage agreement 95.1%; Gwet agreement coefficient 0.879; P Conclusions This study found that LLM responses to GLP1RA therapy queries were more objective and reproducible than those to internet-based sources, with comparable relevance and concordance with clinical guidelines. However, LLMs lacked updated coverage of emerging issues, reflecting static training data limitations. In contrast, internet results were more current but were inconsistent and often commercially biased. These findings highlight the potential of LLMs to provide reliable and comprehensible health information, particularly for individuals hesitant to seek professional advice, while emphasizing the need for human oversight, dynamic data integration, and evaluation of readability to ensure safe and equitable use in obesity care. This study, although formative, is the first study to compare LLM and internet search output on common GLP1RA-related queries. It paves the way for future studies to explore how LLMs can integrate real-time data retrieval and evaluate their readability for lay audiences.
Educational summary of published research — not medical advice. License: cc by. Full text is shown only where licensing permits.