A medical student asks ChatGPT for help with a difficult diagnosis. A nursing educator uses Claude to prepare a case study before class. A faculty team tests Gemini to help manage students’ special consideration requests around the clock.
None of this is hypothetical. It’s already happening, and it’s happening fast.
GenAI is now part of everyday health education. Learners use it to make sense of hard topics, practise clinical reasoning, and prepare for patient conversations. Educators use it to create quizzes, develop teaching resources, and cut through admin.
The speed of adoption is striking, but the evidence behind many of the claims is still thin, especially in health professions education.
Promising is not the same as proven
To be clear: GenAI almost certainly has a role to play in health education. It may help learners practise more, get feedback faster and access support when they need it most. It may help educators spend less time on repetitive work.
But “may help” is doing a lot of heavy lifting in most of these conversations.
There is some evidence. But right now, it’s patchy, inconsistent, and rarely big enough to match the claims being made. Much of the enthusiasm comes from short interventions, small samples, limited contexts, and studies that sometimes conflate performance with learning (see here, here, and here). Those are reasonable starting points. They aren’t the same as robust evidence of educational value.
Also, we must make a distinction between doing better on a task while using GenAI and actually developing knowledge or skill.
General-purpose GenAI use may improve task performance without improving, and sometimes harming, sustained learning.
A tool can produce fluent, well-structured feedback and still do nothing for learning. A chatbot can answer exam-style questions and still weaken a student’s judgement if they start outsourcing their thinking to it. A system that saves one educator an hour can create two hours of extra work for someone else.
So maybe there is space here to invest in pedagogically-designed GenAI tools, but that’s a topic for another discussion.
In health education especially, technical capability is not enough. What matters is what actually changes because of it.
We keep confusing adoption with impact
This is where the current debate slips up. We blur adoption with impact. We confuse efficiency with improvement. We mistake “promising” for “proven”.
That matters a lot, because health education is not simply about passing on information. It’s about shaping judgement, professional identity, accountability, safe practice and other points.
We’re not just teaching students “what to know”. We’re teaching them how to think, how to act, how to make decisions that affect real people, how to work with others, and why it matters.
So what would “better” evidence look like?
Here’s a useful rule of thumb: The bigger the claim, the stronger the evidence should be.
If the claim is that GenAI gives better feedback, what’s in place to ensure it’s specific and accurate? Do learners actually act on it? Does it build their ability to evaluate their own work, or does it just give them an answer? Does it develop feedback literacy over time?
If the claim is that GenAI improves performance, a short-term bump in test scores is not enough. Does the learning stick? Does it transfer to new situations? Does it support reasoning or replace it?
If the claim is about assessment – for example, using GenAI in grading or progression decisions – the bar should be higher again. Any such system should be judged for fairness, transparency and bias. A system can be consistent and still be unfair.
And if someone claims that an educational GenAI tool will ultimately benefit patients, that claim deserves very careful handling.
We need evidence that links teaching practice to clinical behaviour, and clinical behaviour to health outcomes. That is a long chain. It should not be assumed. (See ChatGPT Health, Google AI health and Claude for Healthcare and Life Sciences.)

Not every tool needs a randomised trial
None of this means GenAI tools should be locked in a drawer until someone runs a randomised controlled trial. That’s not realistic. An educator using GenAI to draft a formative quiz is a very different situation from a system using it to flag struggling students or inform high-stakes decisions about progression.
But every claim carries an obligation. The claim and the evidence should match.
This matters especially because the most important outcomes in health education are often invisible in the short term. Good health professionals aren’t made by information alone. We develop over years through practice, reflection, feedback, supervision, experience and the broader conditions in which we learn and work.
Some of those processes may be improved by GenAI. Some may be narrowed by it, particularly if learning becomes too focused on what is easy to prompt, score or automate.
Responsibility is shared
Improving the evidence base is not one person’s job.
Researchers need to be clear about what they’re claiming. Educators need to distinguish between experimenting with a tool and demonstrating that it improves learning. Universities need to resist turning pilot projects into policy before the consequences are understood.
The industry and accrediting bodies need to support AI literacy without endorsing untested systems. Vendors need to make narrower and more honest claims. Learners need to be treated as partners in the conversation, not simply as end users or risks to be managed.
We also need to keep evaluating GenAI after it enters classrooms, clinics and curricula. Educational tools do not remain stable once they’re introduced into real settings. Students adapt. Educators redesign tasks. Institutions change rules. GenAI models are updated, sometimes substantially and with little warning. The context keeps changing.
What works for one cohort, discipline or assessment may not work for another. This means evaluation cannot be a one-off event. It needs to be built into implementation.
The real question
The question now is whether we’re willing to hold it to the same standards we expect of everything else in health and medicine.
We would not introduce a new clinical intervention on the basis that it “seems useful/may help”. We would not accept vague claims of benefit without evidence that those benefits are real, meaningful and sustained.
Education should not be different simply because the risks are less visible or take longer to emerge.
If GenAI is to play a meaningful role in preparing future health professionals, then it needs to be judged with the same care we apply to the rest of health practice.
Not simply because it is new. But because it matters.