Kristyn Beam, the paper’s first author, also has been impressed with the AI’s capabilities — though she admits rooting against it on the test.
“I wanted it not to do well, so from that perspective I was happy,” she said. “It’s a little bit of an existential thing, where you’ve trained for decades to be able to do all these things, then a computer can just come and all of a sudden do it.”
She realizes, however, that not only will newer versions of the model perform better — they’re now testing the next iteration, GPT4, against the same test and against the anesthesiology board exam — but that once humans figure out what it can and can’t do, it will be a potentially powerful tool in doctors’ offices and hospital clinics.
“I think if you move past that initial resistance and say, ‘This is coming, how can this actually help me do my job better?’ then you can move past the feeling of ‘What have these past several decades been for? What did I do all this hard work for?’” she said. “It is really important to figure out how to bring that into the clinical world and to bring it in safely, so that we’re not affecting patients in a bad way but using every tool available to us to deliver the best care we can.”
Part of that process will depend on understanding what these large language models are and why they do what they do, said Andrew Beam, who is the founding editor of a new journal, NEJM AI, focusing on AI in medicine.
These models are fundamentally prediction machines, he said, and are extraordinarily sensitive to prompts while insensitive to things a human respondent might think important, like what the user actually wants or even whether the answer is right.
For more technical requests, in fact, wrong answers may be common simply because most of the humans answering the question got it wrong. A workaround, he said, is in the prompt, asking the model to answer as if it is an expert or the smartest person in the world.
“Imagine that it’s read 1,000 questions that are very similar, and imagine that it’s a very difficult question and it’s reading it on posts and nine times out of 10, the text that it sees coming next is wrong. The average completion it actually sees is wrong,” Andrew Beam said. “But it’s able to kind of triangulate and know that the really smart people actually get this right. … Often the default completion is the correct one, but in some instances, it’s not. You have to kind of trick it into giving you the correct one.”
Another issue is what are called “hallucinations” where, if the answer isn’t in its data set, the large language model can make things up, including sources formatted to look quite convincing, while entirely imaginary.
It’s important to be aware of these limitations, but Andrew Beam said he doesn’t think they’ll be problems for long. None of them are problems of fundamental theory, he said, and workarounds are already being devised. Creating prompts that result in correct answers has been recognized as important enough that “prompt engineering” has become a new job description.
“I think of it almost like incantations where you have to say the right mystical phrase to the AI to get it to do the thing you want it to do,” Andrew Beam said. “A lot of people don’t realize that it will just happily make things up that sound completely, completely realistic.”
A corollary to all this, Beam said, is that it is important to know which version of a particular large language model you’re using. For example, ChatCPT 3.5, released late last year, is still freely available on the company website even though another version, GPT4, is more accurate. That version is available on a subscription basis. Most users will likely be attracted to the free tool and should keep in mind its limitations, he said.
“AI has been the thing that I’ve been interested in for 15 or 20 years and it has always been something that will happen, not something that is happening,” Andrew Beam said. “I definitely feel like something is happening now. This feels qualitatively different.”