A growing body of research indicates that artificial intelligence tools used in medicine can blunt the severity of symptoms reported by women and people from Black and Asian communities, potentially steering patients toward less aggressive or delayed care. As hospitals and clinics worldwide increasingly deploy large language models and other AI systems to streamline notes, triage, and treatment planning, concerns are rising that these tools may reinforce existing inequities in healthcare. Studies conducted by leading US and UK universities show a troubling pattern: certain AI medical applications tend to reflect and amplify bias, offering less empathetic or less comprehensive guidance to female patients and to patients from racialized groups. The implications are serious, given the pressure on health systems to expand access to care while maintaining quality and safety. Researchers warn that bias in AI could contribute to under-treatment of sensitive cases, especially in settings where data are already skewed toward male-centered health research. Against the backdrop of rapid AI product development by major players aiming to reduce clinician workload, these findings underscore the need for rigorous scrutiny, transparent benchmarks, and robust safeguards. The conversation around AI in medicine thus turns from “Can machines diagnose better?” to “Can machines help ensure fair and accurate care for everyone, regardless of gender or race?” This article explores the current evidence, the mechanisms behind bias, the practical consequences for patients, and the paths forward for safer, more equitable AI in health care.
The expanding role of AI in clinical practice and its potential risks
Artificial intelligence has moved from experimental applications to routine tools in many hospitals and clinics. Large language models (LLMs) such as Gemini and ChatGPT, along with AI medical note-taking platforms from startups and major technology firms, are increasingly used to transcribe patient visits, extract medically relevant information, and generate clinical summaries. In practice, clinicians rely on these AI outputs to inform decisions, automate documentation, and accelerate patient throughput in crowded health systems. Microsoft has publicly highlighted AI-powered medical tools that claim to outperform humans in specific diagnostic tasks, signaling the heightened expectations for AI to reduce physician workload and speed up treatment. Yet as AI becomes more embedded in patient-facing workflows, questions about reliability, safety, and equity intensify. When AI outputs influence triage decisions or the perceived urgency of symptoms, any bias in the model’s reasoning or data can have tangible consequences for patient outcomes. Researchers emphasize that the risk is not merely about incorrect facts or outdated guidelines, but about the nuanced ways in which AI may underrepresent symptoms or misinterpret concerns raised by certain patient groups. This is especially salient for populations whose health data have been historically underrepresented or mischaracterized in clinical research. The stakes are high: biased AI could contribute to patterns of under-treatment that mirror or widen existing disparities in Western healthcare systems. Consequently, the adoption of AI in clinical settings requires vigilant oversight, ongoing validation against real-world diversity, and continuous calibration to ensure fairness across patient groups. The conversation has thus shifted from “Can AI diagnose better?” to “How can AI help deliver equitable, high-quality care for all patients?” The objective is to harness AI’s efficiency without sacrificing clinical nuance or fairness. This dual aim—enhancing care while safeguarding against bias—drives the current policy and research agenda in this rapidly evolving field.
Evidence from leading studies: gender, race, and empathy gaps in AI medical guidance
A sequence of research programs spearheaded by prominent institutions has produced findings that raise concerns about bias in AI medical advice, particularly for women and for Black and Asian patients. In June, a collaboration at MIT’s Jameel Clinic examined a suite of popular AI models, including OpenAI’s GPT-4, Meta’s Llama 3, and a health-focused model called Palmyra-Med. The researchers reported that these tools tended to recommend lower levels of care for female patients relative to male patients with comparable clinical presentations. In some cases, the models even suggested self-treatment at home for certain conditions that would ordinarily warrant professional evaluation. The implication is alarming: if an AI system advises less aggressive care for women, those patients may experience delayed diagnoses, worsening conditions, or missed opportunities for early intervention. The same MIT work also noted diminished empathetic tone in AI responses when female patients or patients describing distress were involved, signaling a hidden dimension of bias that extends beyond clinical recommendations to the perceived care relationship between patient and system.
In parallel, MIT researchers highlighted that AI outputs could treat mental health concerns of Black and Asian patients with less compassion than those of white patients seeking similar support. The concern here is not only about clinical triage but about the affective quality of the interaction. In medical contexts, compassionate guidance—how to approach symptoms, when to seek help, and how to cope with anxiety or depression—can significantly influence a patient’s willingness to pursue care and adhere to treatment plans. If AI models deliver less comforting or validating responses to patients of certain racial backgrounds, they may inadvertently discourage help-seeking behaviors or reduce the perceived legitimacy of patients’ concerns. As one associate professor at MIT’s Jameel Clinic noted, these biases imply that “some patients could receive much less supportive guidance based purely on their perceived race by the model.” That framing underscores the risk that even well-intentioned AI systems could entrench or intensify disparities that already exist in how health information is communicated and how people respond to it.
Beyond gender and race, researchers have examined how the AI’s language and the user’s own communication style interact to shape medical advice. A separate MIT study found that patients who wrote messages with typos, informal language, or uncertain phrasing—features common among non-native speakers or users with limited digital literacy—were 7 to 9 percent more likely to be advised against seeking care, even when the underlying clinical information was the same. This finding points to a subtle but consequential fairness issue: language style itself becomes a risk factor in AI-driven recommendations. If non-native English speakers or users uncomfortable with technology face higher odds of receiving cautionary advice or even discouragement from pursuing care, AI could widen health inequities in populations that already face barriers to access. The studies collectively suggest that the data and design choices underlying LLMs can translate into measurable differences in how patients are treated, depending on gender, race, and language.
The biases observed in these studies are tied to the data on which LLMs are trained. General-purpose models draw on vast swaths of internet text and other publicly available materials—sources that inherently reflect social and cultural biases. As a result, the training data can encode stereotypes and unequal patterns of care, which the models then reproduce in medical interactions. Researchers emphasize that this is not a problem that can be solved with a single tweak; it requires a multi-pronged approach that includes diverse data, robust audits, post-training safeguards, and ongoing clinical validation. The challenge is not only to avoid incorrect medical facts but to guard against biased recommendations and biased interpersonal dynamics embedded in AI outputs. The MIT researchers also warned about the risk that well-meaning AI tools could magnify discriminatory tendencies if their decision logic or the language they use inadvertently exhibits bias toward certain demographic groups. The caution is clear: AI in health care must be designed with explicit attention to equity to prevent harm to the most vulnerable patients.
Industry responses to these concerns have varied. OpenAI and Google have acknowledged that warming up AI health capabilities requires careful testing, clinician input, and real-world benchmarking. OpenAI pointed to improvements since early model versions and stressed ongoing efforts to reduce harmful or misleading outputs, including work with clinicians and researchers to stress-test models, assess risks, and refine safeguards. Google has asserted a strong commitment to mitigating bias, describing ongoing work on privacy-preserving data techniques and safeguards against discrimination. Yet critics argue that real-world validation with diverse patient populations and transparent reporting of model performance across demographic groups remains essential to ensure safety and equity. The tension between rapid AI deployment and rigorous fairness evaluation is a recurring theme in conversations with researchers and health-system leaders, who urge that any clinical tool must pass stringent, representative testing before being adopted widely. The evidence underscores the need for robust, domain-specific evaluation frameworks that can detect bias not only in outputs but in the interactive dynamics of AI-enabled care.
Language, literacy, and accessibility: how user communication shapes AI guidance
The interaction between a patient’s language, literacy, and the AI’s interpretation of symptoms is more than a technical detail; it directly affects clinical outcomes. The MIT findings on the impact of typos and informal phrasing reveal a broader issue: the way patients communicate—whether due to language proficiency, education level, or familiarity with digital tools—can alter the recommendations they receive. When an AI model interprets a message with informal language or uncertain phrasing, it may respond with a more conservative or cautionary stance, even if the clinical content is identical to a more formally written version. The practical consequence is that patients who express themselves less formally, or who rely on nonstandard English, may face higher barriers to timely and appropriate care. This is particularly concerning for immigrant communities, multilingual populations, and people with limited access to digital training, who may already experience disparities in health outcomes. The research thus illuminates a mechanism by which health inequities can be amplified by AI: language and literacy gaps intersect with algorithmic design to influence medical decisions and care pathways.
To mitigate these risks, researchers and clinicians advocate for several lines of action. First, AI systems should be designed with language-agnostic safety nets that minimize reliance on the exactness of phrasing when evaluating clinical content. Second, AI should provide clear, contextual clarifications when uncertainties arise, rather than defaulting to generic cautionary messages. Third, there should be explicit checks to ensure that patients’ expression styles do not become proxies for risk stratification inappropriately. Fourth, the healthcare community should actively invest in training clinicians to understand AI outputs in a way that accounts for potential linguistic biases. Finally, it is vital to build user interfaces that help patients convey symptoms in ways that are effective regardless of language background, thereby supporting equitable access to accurate information and appropriate care. Taken together, these measures aim to decouple the quality of medical guidance from a patient’s language or digital literacy, aligning AI-enabled care more closely with clinical needs rather than communication style.
Training data, safeguards, and the ongoing struggle against bias
The root causes of bias in AI-driven medicine lie in the data used to train models and the criteria by which those models learn to generate guidance. General-purpose models—such as GPT-4, Llama, and Gemini—are trained on enormous datasets drawn from the internet, medical literature, and other public sources. The biases present in those sources inevitably surface in model outputs. Researchers emphasize that biases can be reinforced when the model’s programming emphasizes efficiency, surface-level accuracy, or user engagement over nuanced clinical reasoning and equity. In response, developers can incorporate safeguards after training, but such post-hoc edits may not fully correct entrenched biases that are already baked into the model’s behavior. A key takeaway from the research community is the importance of proactive data governance: curating diverse, representative, and high-quality medical data for training; excluding or down-weighting datasets that perpetuate stereotypes; and designing robust evaluation protocols that specifically test for disparities across gender, race, and language groups.
Travis Zack, a clinician and AI researcher, has underscored the reality that AI tools depend on sources that may not always reflect best medical practice or equitable treatment patterns. He notes that an AI output can be sourced from a complicated mix of clinical guidelines, expert reviews, and medical literature, but the final response may inadvertently encode bias if the underlying data reflect societal prejudices. Zack and his team have also highlighted that some AI tools rely on medical journals and regulatory labels as primary sources and require rigorous citation when used in clinical contexts. The standard practice of attaching sources to outputs helps clinicians verify the information, but it also raises questions about how to surface and address potential bias in those sources themselves. The emphasis across the research community is on comprehensive transparency: showing how inputs lead to outputs and documenting where gaps in data or representation might influence recommendations.
In parallel, major tech players recognize the need to benchmark health AI against clinically meaningful tasks and diverse patient scenarios. OpenAI and other organizations have collaborated to develop benchmarks that assess LLM performance in health, accounting for user queries that vary in style, relevance, and detail. The aim is to create a performance gauge that can reveal how models cope with a broad spectrum of patient presentations and language styles, rather than optimizing solely for narrow metrics like accuracy on a fixed dataset. These benchmarking efforts are intended to guide developers in refining models to be safer, more reliable, and more equitable in real-world use. The challenge remains formidable: ensure that benchmarks capture the complexity and heterogeneity of patient populations while remaining practical for widespread clinical deployment. Nonetheless, the pursuit of better benchmarks is a critical component of the broader strategy to reduce biased outcomes and improve the clinical relevance of AI tools.
Public-sector implementations and privacy considerations: UK and Europe as proving grounds
Beyond private-sector products, AI models designed for public-sector use have been developed and tested in settings with sweeping data access and strict governance requirements. In the United Kingdom, researchers and public health bodies collaborated on a generative AI model named Foresight, created with anonymized patient data from tens of millions of medical events, including hospital admissions and COVID-19 vaccination records. The goal was to predict probable health outcomes, such as hospitalizations or cardiac events, while balancing predictive power against privacy protections. The scale of data—tapping into tens of millions of patient records—illustrates both the potential for AI to inform health policy and the significant data governance challenges that accompany such efforts. Officials emphasized that working with national-scale data allows for a more representative portrait of demographics and disease patterns, but privacy remains a central concern. The Foresight project was paused at one point to allow regulatory authorities to review a data-protection complaint lodged by major medical associations, reflecting the tensions between data access for AI development and patient privacy rights. This pause underscores the importance of transparent oversight and robust privacy safeguards whenever large-scale health data are used to train or fine-tune AI models.
In Europe, researchers have explored models like Delphi-2M, which aims to predict disease susceptibility many years into the future using anonymized data from large biobanks. European projects illustrate the same balancing act between leveraging comprehensive data for predictive insight and guarding citizen privacy. The abundance of data in public-health AI initiatives opens possibilities for early detection and prevention, but it also heightens concerns about who controls the data, how it is shared, and how individuals’ health information might be used or misused. Privacy concerns are not merely regulatory hurdles; they are fundamental to sustaining trust in AI-enabled health systems. The UK and European experiences demonstrate both the promise and the risk of public-sector AI in health, highlighting the need for careful governance, independent audits, and ongoing dialogue with patients and clinicians about how data are used and what outcomes are expected.
Benefits, challenges, and the complex reality of AI in medicine
Despite the concerns regarding bias, AI in health care also offers substantial potential benefits. Proponents argue that AI can help to standardize certain aspects of care, reduce clinician burnout, and enable faster, more accurate triage and decision support in overloaded health systems. Microsoft has reported progress in AI-enabled diagnostic capabilities, claiming improvements in identifying complex conditions and in supporting clinicians with decision-making tasks. The idea is not to replace human clinicians but to augment their capabilities, freeing time for direct patient interaction and enabling more comprehensive reviews of complex cases. A growing consensus among researchers is that AI’s value lies in its ability to address gaps in health care delivery where data are incomplete, biased, or inconsistently applied, provided that AI systems are carefully designed to minimize harm and maximize fairness.
At the same time, many researchers stress that the real-world impact of AI depends heavily on how it is integrated into clinical workflows. AI must be used with clear guardrails, human oversight, and continuous monitoring to detect bias and correct course when necessary. The potential to advance health equity hinges on using AI to identify underserved populations, tailor interventions to diverse groups, and illuminate gaps in research and funding that have historically disadvantaged women and minority communities. The MIT researchers, while cautioning about biases, also emphasized the broader benefits of AI for healthcare. They highlighted the possibility of redirecting model development toward addressing critically underserved health gaps, rather than chasing incremental gains in task performance that doctors already perform well. In their view, AI can be a powerful catalyst for advancing public health and clinical care if wielded responsibly.
Developers and health systems are increasingly calling for a multi-layered approach to bias reduction that includes data governance, transparent model behavior, clinician oversight, patient education, and robust accountability mechanisms. This involves not only refining the AI models themselves but also shaping the context in which they are deployed, such as ensuring that outputs are clearly labeled with confidence levels, offering human review pathways for high-stakes recommendations, and creating user interfaces that support equitable patient communication. A key element of this approach is the establishment of standardized health-specific benchmarks that capture a wide range of patient scenarios, demographic groups, and communication styles. These benchmarks should be used to drive continuous improvement, with results publicly shared in a manner that supports independent evaluation by clinicians, researchers, and policymakers. The goal is to create an ecosystem in which AI tools contribute to better, fairer care without inadvertently perpetuating the very disparities they aim to mitigate.
Toward safer, fairer AI-enabled care: governance, best practices, and practical steps
The path to safer AI in medicine requires coordinated action across industry, academia, regulators, and the clinical community. Several practical steps have emerged from expert recommendations and ongoing research:
-
Diversify and curate training data: Prioritize medical datasets that are representative of diverse patient populations, with explicit avoidance of data that encode harmful stereotypes. Where feasible, include longitudinal data that reflect real-world variations in symptoms, access to care, and health outcomes across gender, race, and language groups.
-
Implement rigorous, transparent benchmarking: Develop and adopt health-specific benchmarks that evaluate not only diagnostic accuracy but also equity, empathy, interpretability, and resilience to user-language variability. Publicly report performance across demographic groups and across clinical contexts.
-
Build robust safeguards and human-in-the-loop processes: Design AI systems that require clinician confirmation for high-stakes decisions, provide explainable rationales, and offer clearly labeled outputs with uncertainty estimates. Establish standardized review procedures for model outputs that touch on sensitive health topics or vulnerable populations.
-
Invest in privacy-by-design and data governance: Protect patient privacy in all AI workflows, including data minimization, secure storage, and clear attribution of data sources. Ensure compliance with applicable privacy regulations and implement privacy-preserving techniques where possible.
-
Train clinicians and patients on AI use: Provide clinicians with training to interpret AI outputs responsibly, recognize bias signals, and engage patients in shared decision-making. Help patients understand AI-assisted guidance, its limitations, and the value of professional medical judgment.
-
Encourage independent audits and external validation: Support third-party evaluations of AI tools in real-world clinical settings, with mechanisms for remediation when biases or safety concerns are identified.
-
Foster collaboration and transparency: Promote open dialogue among technology developers, healthcare providers, patient advocates, and regulators to align on safety, ethics, and equity goals. Share learnings and best practices from real-world deployments to accelerate improvement for everyone.
The overarching message is clear: AI can transform health care, but only if safeguards are embedded at every stage—from data selection and model design to deployment, monitoring, and governance. The promise of AI lies not in flawless automation but in intelligent augmentation, guided by rigorous science, ethical principles, and an unwavering commitment to patient welfare. When those conditions are met, AI has the potential to illuminate disparities, highlight overlooked health needs, and help healthcare systems deliver more timely, precise, and compassionate care to all patients—women, Black and Asian patients, multilingual communities, and beyond.
Conclusion
The current evidence demonstrates a complex reality: AI in medicine can both advance and threaten patient care depending on how it is designed, trained, and used. Studies from MIT and allied institutions reveal tangible biases in AI-enabled medical guidance, especially toward women and racialized groups, along with language-related biases that can affect non-native speakers. These findings do not imply that AI is inherently harmful; rather, they highlight a critical design and governance challenge that must be addressed to prevent harm and to realize AI’s potential for equitable health outcomes. The same body of work recognizes meaningful benefits that AI can bring to healthcare—reducing clinician burnout, accelerating diagnostic support, and enabling proactive patient engagement—if the systems are developed and deployed with fairness and safety as core principles. Industry leaders have acknowledged the need for improvements and benchmarking, and researchers continue to advocate for diverse data, rigorous testing, and transparent accountability. The path forward involves coordinated governance, robust data practices, clinician oversight, and patient-centered communication that preserves trust in AI-enabled care. Only by embracing these safeguards and committing to continuous evaluation can AI tools help close the very gaps they currently risk widening, ensuring that every patient, regardless of gender, race, or language, receives accurate, compassionate, and appropriate medical guidance.