I tried using ChatGPT to auto-generate OpenMRS patient summaries. The results: It’s not there yet. ⛔️

A succinct, easy-to-skim Patient Summary is a powerful tool. This is a key takeaway from every health job I’ve ever had. As a nurse, I saw firsthand the reality that the average emergency patient has ~ 15 seconds before the doctor has formed an impression; and in digital health, many healthcare providers mention the value of something like a “digital resident” that summarizes the patient’s story.

And then: We observed this need for better patient summaries during our Site Visits in Nairobi this year at OMRS ‘24.

So, I wanted to rapidly experiment with an idea for using LLMs as “:robot:Robo-Residents” to make a short patient summary.

The Summary

Hypothesis: LLMs like ChatGPT may help summarize the chaos of EMR patient chart data in a short ~3 sentence summary, if given training examples; however, I suspected this could contain:

  • (1) lies (made up data or inappropriate, mis-leading, unsafe assumptions written as fact),
  • (2) hyperfixations (unhelpful detail on specific pieces of data), and;
  • (3) cost-heavy requirements (I suspected the cheaper mini model would not perform as well).

Method:

  1. I used a paid version of ChatGPT, and tried the following process on 3 models: 4o, 4o-mini-, and o1-preview.
  2. I created 9 sample patient summaries representing a few different disciplines (general OPD, Mental Health, NCD, and primary care).
  3. I then created a prompt which included these sample summaries for training purposes.
  4. I then copy-pasted into the model all obs data for 2 example fake patients (just their obs data) from dev3 → Legacy Admin → Manage Observations to see what the LLM would summarize for a particular demo data patient.

Results:

  • Overall, All 3 of my hypothesized concerns were correct. The most concerning issue was the model’s insistence on making-up not specified data like age or diagnosis, despite direct instructions not to make inferences.
  • Lies: I changed the prompt after I kept finding the model was making up ages for the patients, and I realized all my sample summaries had a patient age in them (“Mr. Name is XX years old and…”). I also added a specific line in the prompt “VERY IMPORTANT: You must not include information that cannot be determined from the data provided. Peoples’ lives and safety depends on this!” And yet, the model still added inferred data that was NOT actually in the original Obs data.
  • Cost: 4o-mini’s summary was less clinically clear than the full 4o model. If an implementer were going to use an LLM provider’s API service, the non-minis would be substantially more money per query.
  • Hyperfixation and actual Clinical Value: In my clinical opinion, the summaries were not quite the level of clinical quality and story-telling I was looking for.

What this means for us:

  1. If any Implementer wants to use an LLM for patient summary generation, my extremely cautious thoughts would be:

    i. Explore Medical/Clinical-specific LLMs rather than general ones like ChatGPT.

    ii. Perform very, very, very careful QA, as the risk of data invention/hallucination/mis-leading inference is even higher than I suspected.

  2. These LLMs will keep getting better - I suspect we will see improvement not long from now and it will be worth checking in again in the near future. But for now, I don’t think this is immediately worth pursuing by the Global Support Team, because it would require so much specialized attention and care that I think is beyond our core mission. Thankfully many companies and groups worldwide are working on LLMs for the purpose of medical summarization.

The Actual Results

Here are 2 sample patient summaries from our dummy data on dev3.

  • From the 01 preview model: George Anderson is a male patient diagnosed as HIV positive on September 4, 2024. He has experienced significant weight loss, decreasing from 86 kg in November 2023 to 69 kg in May 2024. Laboratory results indicate elevated white blood cells (11.0 × 10⁹/L) on September 4, 2024, suggesting a possible infection, and anemia with a hemoglobin level of 10.9 g/dL in May 2023. On June 21, 2024, he presented with a positive bacteriuria test and a fever of 38.0°C, indicating a likely urinary tract infection at that time. Previous notes highlight social concerns, including possible homelessness and food insecurity. His immunization history includes the Bacillus Calmette–Guérin (BCG) vaccine and oral polio vaccine administered in June 2024.

  • From the 4o model: George Anderson is a 48-year-old male recently diagnosed as HIV positive on September 4, 2024. His HIV status was confirmed on September 6, 2024, during a structured SOAP note evaluation. Despite the recent diagnosis, he has not reported any associated symptoms as of his last encounter. George’s past medical history includes multiple, unspecified issues (“Many Problems: 1 2 3 4”) and a positive urine bacteriuria test on June 21, 2024. He has received several immunizations, including the Bacillus Calmette-Guérin (BCG) vaccine and Vitamin A, administered in June 2024. Bloodwork from May 2023 shows elevated platelets (394) and slightly low hemoglobin (10.9). Notably, he had a body temperature of 38.0°C on June 21, 2024. George’s blood pressure, oxygen saturation, and weight have fluctuated across visits, with concerns about his living situation noted by a provider in May 2023, recommending he see a social worker for support.

The Prompt and Training Examples

Here is what I used as my prompt and training examples. My full notes and sources of inspiration for the summaries are here.

For just this chat, you are MedGPT, an AI Resident Doctor. Your goal is to take miscellaneous inputs from a patient file and summarize the patient’s history in about 100-150 words. Your mission is to (1) make it easy for human clinicians to quickly know the history of a patient, and (2) to ensure clinically important information about a patient is brought to the forefront rather than being buried in a chart history.

It is very important to not miss anything that may be of interest to a nurse, doctor, social worker, or other type of healthcare worker.

Patient summaries play a crucial role in improving healthcare efficiency and the continuity of care, as they provide an accurate picture of who the patient is and how they have been doing. A good patient summary should be a narrative that synthesizes the information, provides context, and alerts downstream clinicians about any follow-ups needed by the patient. There is great potential for generative AI technology to automate narrative summaries and save healthcare workers time.

Before I give you some patient data, here are some examples of the kind of summary we are looking for:

Examples:

Devan Modi is a 65 y/o patient who was diagnosed with HTN 25 years ago at age 40. Initially started on HCTZ 50mg daily. His BP was stable until 3 years ago age 62 when amlodipine 5mg daily was started. He’s been stable on those medications since then. He was also diagnosed with diabetes 5 years ago at age 60 and started on Meftofim at that time. He was recently admitted to the hospital for PUD and was Dx’d with H pylori.

Romeo Hernandez is a 48 year old Hispanic male with a 2 month history of Rheumatoid Arthritis and a strong family history of autoimmune disease, who presents with a four month history of joint pain and weakness, a 2 month history of headaches with lightheadedness and nausea, and a one month history of subjective fever, and night sweats alongside symptoms of burning eyes and dry mouth, decreased RBC and Hemoglobin, and a CT with an enlarged parotid gland.

Ms Jane Doe is a female with a past medical history of hypertension (HTN), hyperlipidemia (HLD), migraines, gastroesophageal reflux (GERD), depression, psoriasis, left frontal arteriovenous malformation (AVM), an 8mm wide nicked anterior communicating artery aneurysm, and a fusiform 9mm ectasia of the left anterior cavernous carotid artery.

John Doe is a 44 year old man scheduled for today for a routine physical examination. No medications are on file for John. His last annual physical was April 18 2023, where he was advised to continue his current exercise regimen, a flu vaccination was offered and administered. A note was left by Dr. Smith in 2023 to to review his exercise regimen at his next visit.

Mary Smith is a 59 year old woman last seen on September 21 2024, when she was diagnosed with Type 2 Diabetes Mellitus. She has been prescribed Insulin glargine, 40 units at bedtime. She was referred for telehealth check-ins to monitor blood sugar levels. She was instructed to modify her diet to lower carbohydrate intake and was advised to monitor her symptoms closely. Importantly, she has past records of hypoglycemia.

Lisa Green is a 37 year old woman last seen on August 18 2024 for knee surgery. She was prescribed Ibuprofen 400 mg every 8 hours as needed for pain. She was to receive Physical therapy twice a week for six weeks; Avoid strenuous activities; focus on gentle knee exercises; Wound check and removal of sutures, and to wear a knee brace at all times except during physical therapy. Her next follow-up is in two weeks to reassess progress.

Bob Harrison is 70-year-old male with a history of diabetes underwent a left total hip replacement on March 1, 2016, and has a chronic diabetic foot ulcer. After rehabilitation at a skilled nursing facility, he was referred for home health physical therapy and skilled nursing services starting on April 7, 2016. Currently, he is homebound due to unsteady gait and stairs, requires assistance with ambulation and transfers, and has a slow-healing ulcer on his left foot that was surgically debrided. The treatment plan includes physical therapy to improve mobility and balance, skilled nursing care for wound management and edema, and education for his caregiver on fall prevention and dressing changes.

Mr. Carl Smith is a man with a history of chronic homelessness and severe mental health conditions, including major depressive disorder, anxiety disorder, ADHD, intermittent explosive disorder, and a possible traumatic brain injury. He has experienced significant personal trauma, including abuse, parental neglect, early exposure to substance abuse, and the loss of both parents. Carl has struggled to maintain employment due to difficulties with anger management, concentration, and social interactions, often resulting in job loss. Physically, he suffers from degenerative disc disease and lumbar radiculopathy, leading to chronic back pain. Despite engaging in outpatient mental health treatment and receiving housing support, he continues to face significant challenges in daily functioning and has never been able to sustain gainful employment.

Alice Green is a 47-year-old woman with bipolar disorder and major depression has experienced chronic homelessness, substance abuse, and severe mental health symptoms such as anxiety, racing thoughts, and social isolation. Her traumatic history includes childhood abuse and a violent assault, significantly impacting her ability to maintain employment and stable housing. She has been prescribed various medications over the years to manage her mental health, including Seroquel, Haldol, Depakote, Zoloft, Neurontin, Lithium, and Prozac. Currently, she is taking Seroquel but continues to struggle with daily functioning and has been unable to work since 2003 due to her mental health impairments.

VERY IMPORTANT: You must not include information that cannot be determined from the data provided. Peoples’ lives and safety depends on this!

(Then from here I copy-pasted a long list of obs.)

3 Likes

Here are some screenshots showing where I called out ChatGPT on inventing information:

(These were all with model 4o)

3 Likes

For completeness I also used a different fake patient, a child, and added a copy of his medication history. This definitely improved the summary quality but introduced new issues:

View the full example exchange for Joshua Johnson here:

3 Likes

It’s worth pointing out another factor is this that’s easy to miss using the chat interface, which is that the way the chat interface is designed, each individual generation of a response (so a patient summary here) could be generated using a different finely-tuned model in the selected model family. This means that if you were to re-generate the response, you will see different output (desirable) that may have different kinds of mistakes, errors, or omissions.

It’s also worth noting that because OpenAI preserves and stores content you enter, if you enter real patient data into ChatGPT, you may run foul of any applicable patient privacy laws. (OpenAI offers an “opt-out” for using entered data—“content”—for training, but even opting-out does not guarantee they do not otherwise retain the data).

8 Likes

Very interesting @grace … yes, I was going to add the privacy issues as well but @ibacher beat me to it.

At PIH we have developed an AI working group, that, among other things, is coming up with the AI Guidelines for the organization.

3 Likes

Excellent analysis and very timely! I agree with everything already said. I think that prompt engineering could help with the organization of the summary, or perhaps using AI to perform an “orientation” prior to the encounter (rather than an overall summary) might be an interesting experiment. We would have to have a local model where patient data was not shared, I think. For example, asking for a summary of any pertinent follow-up or results that have come in since the last time I saw this patient? What was the plan at the end of the last encounter and has anything changed? etc.

1 Like

It is nice. I think this task is actually possible to do with LLMs.

Sorry for duplicating my answer, I want to write it here too as I want to make sure you know about this very likely mistakes.

You cannot give examples like that.

You have to determine the format just by directing its format and that 's why I would definitely try without examples, as in this case this can be a source of halucination actually.

I am suspecting this because we also use LLMs and they give zero examples but still could get nice format. Also we never seen any halucination or… in a very rare case that was just not evident for us. I think it can happen if we want it to do something that is out of his knowledge. Also for us if the patient age was specified (ex: 58) but during the report he said 60 then it actually used the new data already.

Also stick to the best models, you cannot go with small or mini models (so sonnet 3.5 is not bad pick, maybe 4o and o1-preview). (Also O1-preview system prompt is sort of overwritten by the internal CoT direction, so you have to be aware of that.)

@grace This is really interesting, thank you for the detailed summary of your experiment.

My immediate thought came back to the clearer defined problem statement:

On average, ER / ED doctors have 15 seconds to understand a patients history before making decision…

Experimenting with LLM’s to provide that information seems reasonable, though can improve the (or a) summary by design too.

Could we utilise available data to create a short, (possibly) narrative-structured summary of the patient including conditions, allergies, active meds etc instead of relying on a user reviewing many different tiles and pages to build a picture?

Thanks @grace for the topic.

Last month I attended the HL7 WGM conference in Atlanta with an AI track, where Clinica Alemana from Chile presented their usage of AI in facilities. I was there for the AI track for just a day so I haven’t seen more… Not sure if they have always the AI track, but if that’s the case, it might be a good way to get some high level picture on AI usage in healthcare for those who are interested.

In Clinica Alemana they have trained their own LLM (sorry I didn’t catch or ask for more details on which model they used) with clinical data only. Since it’s a much smaller model than OpenGPT it’s less able to infer things and it’s not trained to do so. There is no privacy issues as well since all knowledge stays in their system. They use it to create a patient visit summary (shorter period as @akanter suggested) that is basically a template to be reviewed by doctors at the end of the visit. The content is clearly marked as AI generated in their UI and only once reviewed it’s treated as reliable.

As I understand it at this point it’s basically used to save doctors on some typing and be more concise than what humans sometimes produce :wink: Since it’s a summary for just a short period, it can be easily reviewed and edited.

The more exciting use of AI for me in healthcare is automatic detection of some diseases based on factors discovered by AI learned from thousands of patients (their CT scans, lab tests, etc.) Something that a doctor learns to do to some degree after years of clinical experience. I haven’t dived much into healthcare AI in my engineering journey other than reading some research papers months back though… Maybe some day :slight_smile:

2 Likes

This is an interesting issue, so thanks @grace for initiating the discussion. A few thoughts:

  • the examples bias the models as @havlikmarcell pointed out. The model puts more weight on your opinions, so if you have age in your examples, it will also try to fit that in.
  • the instruction is very long, and models don’t have that much attention. I find that short questions and direct tasks are more accurate. Like if you ask “What’s George’s age?”, it would admit it doesn’t have that information. So perhaps there should be an intermediate step that asks for the facts, before those are passed to the final generative step that completes some form of template? Is that the retrieval-augmented generation (RAG)?
  • We need to experiment more with these tools because, it seems, we cannot run away from them. If one group does test them and put the guardrails, another, less experienced (and one with not much resources) will try to use them at the detrimental of patients (kind like those embarrassing academic articles with AI generated text that seep through peer review AI scandal rocks academia as nearly 200 studies are found to have been partly generated by ChatGPT | Daily Mail Online)
  • I’m happy to participate in initiatives, if any, that the OpenMRS community might be taking to explore this space.