Prompting strategies on two large language models improved how the artificial intelligence (AI) interpreted pain and fatigue reported by survivors of childhood cancers for better symptom monitoring and care, according to findings published in Communications Medicine.
The study authors noted that the findings support using carefully designed prompts in chatbots to enable a more context-aware analysis of symptom narratives, and offer a scalable approach to improving support for symptom monitoring.
“About 40%-60% of a clinical encounter is a patient talking to their physician about symptoms and related health experiences,” said corresponding author I-Chan Huang, PhD, Department of Epidemiology & Cancer Control, St. Jude Children's Research Hospital. “We have provided a proof of concept that large language models could help analyze that underutilized conversational data to detect symptom severity and its functional impact and assist physician decision-making to provide better care to survivors.”
Background and Study Methods
The study authors noted that automated tools are needed to better understand and evaluate how symptoms impact daily functioning for survivors of childhood cancers.
Researchers evaluated how ChatGPT-4o and Llama-3.1 both performed at classifying self-reported pain and fatigue from childhood cancer survivors by analyzing different prompting strategies. They analyzed semi-structured interviews from 30 childhood cancer survivors who were between the ages of 8 and 17 years and their caregivers for a total of 819 narratives relating to pain and fatigue. Experts annotated each narrative for physical, social, or cognitive functional impacts for the AI's reference standard.
Four prompting strategies were used: zero-shot (which gives the large language model no examples), few-shot (which gives the model some examples), step-by-step reasoning (Chain-of-Thought, ie, s), and generated knowledge. The different model outputs were compared with the reference standard annotations.
Key Findings
The researchers found that the strategies of generated knowledge and step-by-step reasoning outperformed both zero-shot and few-shot prompting in both the ChatGPT-4o and Llama-3.1 LLMs in terms of physical, social, and cognitive functional impact classifications.
“We found that simple prompts were not effective,” Dr. Huang said. “However, our more sophisticated prompting strategies performed significantly better and had a higher concurrence with our human reviewers.”
ChatGPT-4o was found to achieve more of a balanced precision and discrimination across all three areas of functional impact, while the Llama-3.1 model demonstrated more sensitivity but significantly lower precision especially in terms of physical and social functioning.
“These AI-driven approaches provide us with a new way to unlock the complex symptom information hidden in the wealth of patient-physician conversations that currently go unused,” Dr. Huang said. “By making this information easier to capture and analyze, we can help physicians better identify survivors who need additional support in real time and improve care for this growing population.”
DISCLOSURES: The study was supported by grants from the National Cancer Institute, Cancer Center Support, and the American Lebanese Syrian Associated Charities, the fundraising and awareness organization of St. Jude. For full disclosures of the study authors, visit nature.com.

