thumbnail

Dr ChatGPT? Assessing the Mental Health Care Capabilities of Chatbots

Frontiers in AI & Mental Health: Research & Clinical Considerations

A recurring series exploring cutting edge research and clinical applications of artificial intelligence in mental health treatment

by Christopher Campbell, MD Candidate (Medical University of South Carolina) (with assistance from ChatGPT4o/Scholar AI in summarizing the research studies)

Artificial intelligence is being investigated for use in mental health from multiple angles. Each offers new possibilities and challenges. One is AI’s role in diagnosing mental illness and detecting relapse. Novel diagnostic capabilities are being explored which rely on analyzing vast datasets, including biometric data from smartwatches and insights derived from social media activity. Such diagnostic abilities are fascinating and will likely enhance the overall scope of care mental health care professionals will be able to provide. Those (and other) novel diagnostic approaches will be explored in a future issue of this column.

My focus here is AI’s ability to assess patients based on their reported symptoms and clinical presentation, similar to how a clinician would evaluate a patient in a traditional diagnostic setting.

The ability to make a diagnosis is a crucial component of mental health care, as the given diagnosis guides the next steps in treatment and clinical management. This raises key questions: Can artificial intelligence (and specifically large language models or “LLMs”) accurately interpret clinical scenarios and provide appropriate treatment recommendations? Could they eventually function as mental health care clinicians?

To answer such questions, we must first consider the core pillars essential for mental health care: distilling subjective and objective information into an appropriate assessment, formulating a suitable treatment plan, and ensuring treatment administration or coordination. Of course, clinical patient care involves other essential aspects, including establishing trust and rapport, refining the initial differential diagnosis, and providing psychotherapy. While those play vital roles, this article will focus on the initial phase of the clinical encounter: medical knowledge, diagnostic abilities, and treatment recommendations.

There are a variety of different research studies that assess the ability of ChatGPT and other LLMs to perform various clinical functions, including demonstrating medical knowledge, providing guidance on treatment recommendations, and generating accurate psychodynamic patient formulations. While this article focuses on AI’s ability to assess and manage clinical conditions, future discussions will explore its application in other important areas of mental health care, predominantly the relationship between patient and clinician which is a cornerstone of psychotherapy.

For now, I want to provide an overview of LLMs’ clinical capabilities  regarding assessment, diagnostics, and treatment planning, identify its current limitations, and explore key considerations for future research and development. These will first be explored by examining LLMs performance in general medicine to establish their broader capabilities before considering LLMs’ performance in mental health domains, including their ability to create psychodynamic formulations and appropriate psychiatric assessments and plans.

LLMs and Medical Knowledge

Medical knowledge is a prerequisite for any clinician’s ability to provide an appropriate assessment and plan.  Therefore, assessing how well large language models perform on medical licensing exams is a key step in evaluating their potential to provide sound clinical care. To date, multiple studies have examined how well ChatGPT and other LLMs perform on the USMLE (the United States Medical Licensing Examination) and other standardized assessments of medical knowledge, the results of which will be explored in detail in the following section.

LLM Performance on Standardized Medical Examinations

Earlier large language models, including ChatGPT-3.5 and Med-Palm [a large language model specifically designed by Google for use by medical clinicians that is trained on medical data], have been the focus of multiple studies assessing their ability to pass medical licensing examinations. These studies used a variety of sources for sample questions, including official USMLE freely available practice tests and practice questions from widely used, evidence-based medical education resources such as the AMBOSS question bank. A study published in Nature demonstrated that Med-Palm was the first LLM capable of passing versions of the USMLE (Singhal et al., 2023). Additionally, the ChatGPT-3.5 model (which was the most up-to-date ChatGPT model available at the time of examination) was consistently able to achieve passing scores (approximately 60% correct) on USMLE examination material, with percentage correct ranging from 42 to 64% depending on the specific analysis performed (Kung et al., 2023; Gilson et al., 2023).

Subsequently, similar study designs have assessed USMLE performance on the newer, more advanced models of ChatGPT-4 and Med-Palm-2 . One such study published in Nature demonstrated that GPT-4 performed significantly better than ChatGPT, achieving 90% correctness compared to ChatGPT’s 62.5% correctness. Notably, GPT-4 also performed better than the average human user on questions from the practice question bank, suggesting that LLMs may now outperform medical students on standardized medical knowledge questions (Brin et al., 2023).

Overall, the data indicate that LLMs currently meet or exceed medical students’ abilities to perform on medical licensing examinations. This suggests  that LLMs not only have the required medical knowledge, but they are also able to apply this knowledge appropriately in standardized testing scenarios.

LLM Performance on Clinical Reasoning Questions

The USMLE examinations encompass a wide range of question types. Questions are often one or multiple paragraphs long and depict a wide range of clinical scenarios. USMLE and other standardized test questions range from basic fact recall (first-order) to complex, multi-step clinical reasoning (third-order). While such questions sample a wide range of knowledge and reasoning abilities, assessing LLMs’ ability to handle higher-order clinical reasoning questions is crucial.

Interestingly, multiple studies have sought to examine GPT-4’s performance on such clinical reasoning questions, and it has been demonstrated that GPT-4 can outperform first- and second-year medical students on clinical reasoning questions (Strong et al., 2023). Furthermore, a study by Rohaid et al. (2023) demonstrated that GPT-4 was able to demonstrate advanced clinical management in neurosurgery board exam case scenarios, achieving a passing score of approximately 82.6%. An additional study by Katz et al. (2024) demonstrated that GPT-4 was able to outperform the median clinician score in the fields of internal medicine and psychiatry; GPT-4 performed better than a large fraction of clinicians in other disciplines of medicine as well.

While ChatGPT has been shown to perform well on clinical reasoning questions and cases, a study by Shikino et al. (2024) focused on cases involving atypical disease presentations, ChatGPT’s diagnostic capabilities were diminished and negatively correlated as the degree of atypicality increased, highlighting a key limitation.

LLM Performance Expected to Continually Improve

The results of these studies indicate that ChatGPT and similar LLMs demonstrate sufficient medical knowledge and are proficient at applying such knowledge in clinical scenarios. Notably, the majority of these studies were completed with now-outdated LLM models.  Previous studies have shown that significant improvements have been demonstrated when comparing 2022 LLM models with 2023 LLM models (Nori et al., 2023; Katz et al., 2024). Several newer models have since been released such as ChatGPT-4o (released 2024), and most recently ChatGPT-4o3, ChatGPT-4o3-high, and ChatGPT-4o3-mini (released February 2025). It is likely that if the aforementioned studies were repeated with the newer models, performance would increase significantly.

LLMs in Mental Health Care

Having considered the general medical capabilities of LLMs, let’s shift our attention to LLM performance in mental health care domains specifically. While every health care discipline has its complexities and nuances, a large portion of general medicine lends itself well to algorithmic approaches to care, given the objective data (in the form of specific biomarkers, imaging results, physical exam findings, and more) that drive and dictate the subsequent diagnosis, assessment, and plan. Though the use of objective data also exists in mental health, the field largely specializes in working with the patient’s subjective experience and psychosocial history. Additionally, psychodynamic approaches to mental health care require insight into and interpretations about family dynamics, transference, dreams, defenses, and other complex phenomena. Although algorithmic approaches have value in mental health care, there also exists a significant degree of nuance and individualized approach to understanding and addressing the complex interplay of lived and subjective experience unique to each patient.

Ultimately, mental health clinicians work with written and spoken language. Thus, LLMs’ performance with language is interesting and worth exploring. It could be speculated that the lack of objective data may lend itself poorly to the utilization of LLMs in mental health care. At the same time, it is plausible that LLMs may perform well in mental health domains, given their established capabilities with language.

Looking at these issues more deeply, the following studies were selected to highlight LLMs’ abilities to form appropriate mental health assessments and plans.

The first study, Assessing the potential of ChatGPT for psychodynamic formulations in psychiatry: An exploratory study, examined the ability of ChatGPT-4 to create a psychodynamic case formulation based on a previously published psychodynamic case report. Researchers used a previously published psychoanalytic case report as the patient case. ChatGPT analyzed the case and was given four distinct prompts to generate a psychodynamic formulation, each with varying levels of specificity. Afterwards, the responses elicited by ChatGPT were evaluated for appropriateness by by five independently working psychiatrists[1]. ChatGPT performed remarkably well and was able to achieve 80-100% appropriateness in its responses when evaluated by multiple independent psychiatrists functioning as evaluators. Such findings are important as they demonstrate that a generic chatbot with no specific psychoanalytic training is able to formulate appropriate psychodynamic responses. This technology could be utilized for various purposes, including aiding current practicing therapists in their own formulations or helping students who are learning the basics of these modes of thought.

A subsequent study, ChatGPT is not ready to provide mental health assessments and interventions, completed in July 2023, assessed more broadly the capabilities of ChatGPT to function as a psychiatric provider via the formation of an appropriate assessment and plan for three increasingly complex patient scenarios. Each patient scenario included a patient presenting with difficulty sleeping along with various degrees of increasingly complex clinical pictures, psychosocial circumstances, and comorbid conditions. The assessment and plan created by ChatGPT for each scenario were assessed for effectiveness, accuracy, and appropriateness by a panel of psychiatrists from various psychiatric institutions. Ultimately, this study found that for the least complex patient scenario, ChatGPT provided relatively appropriate recommendations. However, this study found that ChatGPT failed to provide appropriate responses for the two subsequent, increasingly complex cases.

While this study demonstrated that ChatGPT was not able to provide appropriate and safe treatment plans for all complexities of patients, it is possible that the responses elicited by ChatGPT might be more clinically relevant when given with a more tailored and specific set of prompts. Further, given the significant advances in LLM technology in the nearly two years since the release of this study, newer models will likely demonstrate improved clinical capabilities. As such, ongoing research is necessary to monitor the rapidly developing capabilities of newer LLM models.

Conclusion

The studies offered here demonstrate that LLMs are now capable of applying medical knowledge to perform at or near human capabilities on various types of standardized testing. Additionally, it appears that LLMs currently display some proficiency in forming appropriate assessments and treatment plans in mental health care, though such abilities are limited especially in the face of cases with increased complexity or atypicality (Shikino et al. 2024; Ismail Dergaa et al., 2024). It is to be expected that models will continue to improve in their clinical abilities as the technology powering them continues to increase.

Regardless of the potential of AI to ultimately perform at or above human levels in mental health care, a vital question remains: even though we can replace aspects of clinical care with AI, should we?

There are a variety of ways in which AI can be integrated into mental health care, such as note writing, diagnostic assistance, assessment and plan formulation, augmentation of psychotherapy, mental illness and relapse monitoring.

The case could be made that the scalability of AI justifies its widespread implementation as the total amount of net good from such implementation, manifesting in decreased cost and increased accessibility, would be more positive than if it had not been used at all. But there’s a multitude of risks, implications, and long-term consequences of that implementation. One of particular importance in mental health is the implicit devaluation of the human-to-human relationship. That alone may indicate that the negative impact to full integration of AI in mental health would outweigh the apparent benefits.

Ultimately, it’s the mental health care community and the trained clinicians at the front lines of patient treatment who should lead determinations around the degree to which clinicians should or should not incorporate the assistance of LLMs and other AI-based technology in clinical care. The answers are complex and cannot be adequately addressed in this brief piece. However, I think we need to incorporate outcome data from research studies exploring various degrees of incorporation (from minimal LLM assistance to full LLM care management), patient perspectives and preferences on the matter (to ensure respect for patient autonomy), and the legal and malpractice implications of delegating care decisions to autonomous non-human agents. Such research must be continuously evaluated by clinicians and policymakers to advance, slow, or redirect the integration of AI into mental health care in ways that most align with the needs of our patients.

[1] The study does not specify whether any or all of the psychiatrists were educated and trained (or licensed) in psychoanalysis or psychodynamic therapy. While I suggest that materially bears on the value of the study, the research and outcomes are still of interest.

 

References:

Ali, R., Tang, O. Y., Connolly, I. D., Fridley, J. S., Shin, J. H., Zadnik, P. L., Cielo, D., Oyelese, A. A., Doberstein, C. E., Telfeian, A. E., Gokaslan, Z. L., & Asaad, W. F. (2023). Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank. Neurosurgery , 93(5). https://doi.org/10.1227/neu.0000000000002551

Brin, D., Sorin, V., Vaid, A., Soroush, A., Glicksberg, B. S., Charney, A., Nadkarni, G. N., & Klang, E. (2023). Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Scientific Reports, 13(1).  https://doi.org/10.1038/s41598-023-43436-9

Gilson, A., Safranek, C. W., Huang, T., Socrates, V., Chi, L., Taylor, R. A., & Chartash, D. (2023). How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Medical Education, 9(9), e45312. https://doi.org/10.2196/45312

Hwang, G., Dong Yun Lee, Seol, S., Jung, J., Choi, Y., Eun Sil Her, Min Ho An, & Rae Woong Park. (2024). Assessing the potential of ChatGPT for psychodynamic formulations in psychiatry: An exploratory study. Psychiatry Research (Print), 331, 115655–115655. https://doi.org/10.1016/j.psychres.2023.115655

Katz, U., Cohen, E., Shachar, E., Somer, J., Fink, A., Morse, E., Beki Shreiber, & Wolf, I. (2024). GPT versus Resident Physicians — A Benchmark Based on Official Board Scores. NEJM AI, 1(5). https://doi.org/10.1056/aidbp2300192

Kiyoshi Shikino, Shimizu, T., Otsuka, Y., Tago, M., Takahashi, H., Watari, T., Sasaki, Y., Iizuka, G., Tamura, H., Nakashima, K., Kotaro Kunitomo, Suzuki, M., Aoyama, S., Kosaka, S., Teiko Kawahigashi, Matsumoto, T., Fumina Orihara, Morikawa, T., Nishizawa, T., & Yoji Hoshina. (2024). Evaluation of ChatGPT-Generated Differential Diagnosis for Common Diseases with Atypical Presentation: Descriptive Research (Preprint). JMIR Medical Education, 10, e58758–e58758. https://doi.org/10.2196/58758

Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J., & Tseng, V. (2023). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health, 2(2), e0000198. https://doi.org/10.1371/journal.pdig.0000198

Med-PaLM. (n.d.). Med-PaLM. https://sites.research.google/med-palm/

Nori, H., King, N., McKinney, S. M., Carignan, D., & Horvitz, E. (2023). Capabilities of GPT-4 on Medical Challenge Problems. Www.microsoft.com. https://www.microsoft.com/en-us/research/publication/capabilities-of-gpt-4-on-medical-challenge-problems/

Singhal, K., Azizi, S., Tu, T., S. Sara Mahdavi, Wei, J., Hyung Won Chung, Scales, N., Ajay Tanwani, Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Abubakr Babiker, Schärli, N., Aakanksha Chowdhery, Mansfield, P., Demner-Fushman, D., & Blaise. (2023). Large language models encode clinical knowledge. Nature, 620. https://doi.org/10.1038/s41586-023-06291-2

Strong, E., DiGiammarino, A., Weng, Y., Kumar, A., Hosamani, P., Hom, J., & Chen, J. H. (2023). Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations. JAMA Internal Medicine, 183(9), 1028–1030. https://doi.org/10.1001/jamainternmed.2023.2909

What is a large language model (LLM)? – University of Arizona Libraries. (2023). Arizona.edu. https://ask.library.arizona.edu/faq/407985

 

 

 

CAI Report Editor