Web Conference 2024 · Oral

Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries

Yiqiao Jin1*Mohit Chandra1*Gaurav Verma1Yibo Hu1Munmun De Choudhury1Srijan Kumar1

* Equal contribution

Georgia Institute of Technology logo 1 Georgia Institute of Technology

Proceedings of the ACM Web Conference 2024 (WWW '24)  ·  May 13–17, 2024  ·  Singapore

5.82×
more incorrect responses in non-English languages
50.5%
consistency drop in Hindi vs. English
4 × 3
languages × health Q&A datasets evaluated
XLingHealth
first cross-lingual healthcare benchmark
Abstract

Equity in AI-Mediated Health Information

Large language models (LLMs) are transforming the ways the general public accesses and consumes information. Their influence is particularly pronounced in pivotal sectors like healthcare, where lay individuals are increasingly appropriating LLMs as conversational agents for everyday queries. While LLMs demonstrate impressive language understanding and generation proficiencies, concerns regarding their safety remain paramount in these high-stake domains. Moreover, the development of LLMs is disproportionately focused on English. It remains unclear how these LLMs perform in the context of non-English languages, a gap that is critical for ensuring equity in the real-world use of these systems. This paper provides a framework to investigate the effectiveness of LLMs as multi-lingual dialogue systems for healthcare queries. Our empirically-derived framework XLingEval focuses on three fundamental criteria for evaluating LLM responses to naturalistic human-authored health-related questions: correctness, consistency, and verifiability. Through extensive experiments on four major global languages, including English, Spanish, Chinese, and Hindi, spanning three expert-annotated large health Q&A datasets, and through an amalgamation of algorithmic and human-evaluation strategies, we found a pronounced disparity in LLM responses across these languages, indicating a need for enhanced cross-lingual capabilities. We further propose XLingHealth, a cross-lingual benchmark for examining the multilingual capabilities of LLMs in the healthcare context. Our findings underscore the pressing need to bolster the cross-lingual capacities of these models, and to provide an equitable information ecosystem accessible to all.

Overview of the XLingEval framework and XLingHealth benchmark
Overview of the XLingEval framework and the XLingHealth benchmark for assessing cross-lingual LLM behavior in healthcare.
Contributions

Three Core Contributions

XLingEval Framework

A unified framework that evaluates LLMs along three healthcare-critical axes: correctness, consistency, and verifiability — combining algorithmic metrics with expert human evaluation. The framework provides a reproducible methodology applicable to any multilingual LLM deployment.

XLingHealth Benchmark

The first cross-lingual healthcare benchmark, spanning four major world languages — English, Spanish, Chinese, and Hindi — over three expert-annotated Q&A datasets: HealthQA, LiveQA, and MedicationQA. Professionally translated with rigorous quality controls.

Empirical Cross-Lingual Disparity

Comprehensive evidence that LLMs are markedly less safe and reliable in non-English healthcare conversations, with the largest gaps in Hindi and Chinese — calling for equitable, language-aware LLM development and policy reform.

Dataset

The XLingHealth Benchmark

Table 1 — XLingHealth dataset statistics
Table 1 · XLingHealth dataset statistics across languages and datasets.

XLingHealth is the first benchmark purpose-built for evaluating multilingual LLM behavior in healthcare. It unifies three established expert-annotated Q&A corpora, each covering a different facet of consumer health information-seeking.

  • HealthQA

    1,134 dev-set Q&A pairs sourced from Patient.info, covering diverse consumer health topics across primary care and specialty medicine.

  • LiveQA

    246 consumer health Q&A pairs collected from NIH-affiliated trusted platforms, reflecting real-world health questions submitted by the public.

  • MedicationQA

    690 drug-related queries submitted to MedlinePlus, paired with reference answers drawn from authoritative medical references.

All three datasets are professionally translated into Spanish, Simplified Chinese, and Hindi with quality controls to ensure medical accuracy and cultural appropriateness.
Methodology

The XLingEval Framework

Figure 2 — XLingEval evaluation pipelines
Evaluation pipelines for the three XLingEval criteria across languages.
01

Correctness

Compare LLM responses against expert ground-truth answers using both LLM-judge comparative analysis (with chain-of-thought prompting) and human evaluation by medical annotators across all four languages. Responses are categorized along a four-point scale from comprehensive and appropriate to incorrect or misleading, enabling fine-grained cross-lingual comparison.

02

Consistency

Vary the sampling temperature τ and probe whether the model produces stable answers to the same question, scoring at the surface (n-gram overlap, response length), semantic (BERTScore, sentence-embedding similarity) and topic (LDA, HDP) levels. Consistency across sampling runs serves as a proxy for model reliability and confidence calibration.

03

Verifiability

Cast the LLM as a discriminator that must authenticate medical claims, measuring macro-precision, macro-recall, macro-F1, accuracy, and AUC over correct versus incorrect or irrelevant answer pairs. This axis tests whether the model can distinguish trustworthy from untrustworthy health information — a critical safety property for deployment.

Findings

What We Found

Finding 01

Correctness drops sharply in non-English languages

GPT-3.5 produces 18.12% fewer comprehensive and appropriate answers and is 5.82× more likely to give an incorrect response in non-English languages. The performance degradation persists for the open-source MedAlpaca model, indicating the gap is not model-specific but reflects a systemic bias in LLM training toward English-language data.

Table 2 — GPT-3.5 correctness across languages
Table 2 · GPT-3.5 correctness across languages
Table 3 — MedAlpaca correctness across languages
Table 3 · MedAlpaca correctness across languages
Finding 02

Consistency degrades dramatically in Chinese and Hindi

Compared with English, semantic consistency drops by 9.1% in Spanish, 28.3% in Chinese, and 50.5% in Hindi. This gradient tracks the relative representation of each language in LLM pretraining corpora and suggests that non-English speakers receive less predictable, less stable healthcare information — with Hindi speakers facing the steepest reliability penalty.

Figure 3 — Consistency metrics across datasets
Figure 3 · Consistency metrics across the three datasets
Finding 03

Verifiability is weaker for under-represented languages

GPT-3.5 reaches Macro-F1 ≈ 0.85 on HealthQA in English and Spanish but only ≈ 0.73 and ≈ 0.65 in Chinese and Hindi respectively — drops of 14.6% and 23.4%. The model's ability to distinguish trustworthy from untrustworthy medical claims deteriorates markedly as language resource availability decreases, posing a direct patient safety concern.

Figure 4 — HealthQA verifiability results
Figure 4 · HealthQA verifiability results across languages
Figure 5 — LiveQA verifiability results
Figure 5 · LiveQA verifiability results across languages
Table 4 — Average verifiability performance
Table 4 · Average verifiability performance across models, languages, and datasets
Implications

What This Means

Equity and Access

Healthcare information delivered by LLMs is not equally trustworthy across languages, putting non-English speakers at systematically higher risk of receiving incorrect, inconsistent, or unverifiable medical guidance — a direct equity and patient safety concern.

Likely Causes

The disparity tracks pretraining corpus composition: LLMs are disproportionately trained on English text, with limited exposure to nuanced medical content in other languages. Addressing the gap requires deliberate multilingual data curation and evaluation-driven fine-tuning.

Beyond Healthcare

The correctness / consistency / verifiability lens introduced by XLingEval applies directly to other high-stakes multilingual dialogue domains — legal advice, financial guidance, and educational tutoring — wherever language-based inequity can cause real-world harm.

Citation

Cite This Work

bibtex.bib
@inproceedings{jin2024better,
  title={Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries},
  author={Jin, Yiqiao and Chandra, Mohit and Verma, Gaurav and Hu, Yibo and De Choudhury, Munmun and Kumar, Srijan},
  booktitle={Proceedings of the ACM Web Conference 2024},
  pages={2627--2638},
  year={2024}
}