A HIPAA-Aware Benchmark and Evaluation Harness for Clinical LLMs to Quantify Hallucination, Bias, and PHI Leakage

Authors

  • Valentina Palama MSc in Computer Information Systems (Prairie View A&M University), USA

Keywords:

Clinical large language models, HIPAA compliance, hallucination detection, algorithmic bias, protected health information leakage, healthcare AI evaluation

Abstract

The growing use of large language models (LLMs) in clinical practice has brought up serious doubts about the reliability, fairness, and compliance with regulations. Model hallucinations may undermine clinical decision-making in a health care setting, algorithmic bias may create health inequities, and unintended disclosure of protected health information (PHI), may contravene privacy rules. Although more focus is given to clinical LLM assessment, current benchmarks pay much attention to overall performance and do not address these safety and compliance risks in a holistic manner in a HIPAA-conscious system. This paper suggests a uniform benchmark and assessment harness that is specifically rigorously developed in clinical LLMs to quantify hallucination, bias, and leakage of PHI systematically. The framework includes clinically-based exercises, de-identified and artificial data sets, and automated identification tools consistent with the HIPAA-related privacy categories. Through the combination of several dimensions of evaluation into a single and reproducible harness, the benchmark can allow comparative clinical LLM evaluation of safety, fairness and privacy measures. The findings demonstrate that there is a significant variability in model behavior, which indicates trade-offs between clinical capability and risk exposure. The work adds a useful evaluation system to help implement LLMs responsibly, regulate, and monitor their use in healthcare settings.

Published

18-08-2025

How to Cite

Palama, V. (2025). A HIPAA-Aware Benchmark and Evaluation Harness for Clinical LLMs to Quantify Hallucination, Bias, and PHI Leakage. Well Testing Journal, 34(S3), 830–849. Retrieved from https://welltestingjournal.com/index.php/WT/article/view/274

Issue

Section

Research Articles

Similar Articles

<< < 6 7 8 9 10 11 12 13 14 15 > >> 

You may also start an advanced similarity search for this article.