A HIPAA-Aware Benchmark and Evaluation Harness for Clinical LLMs to Quantify Hallucination, Bias, and PHI Leakage
Keywords:
Clinical large language models, HIPAA compliance, hallucination detection, algorithmic bias, protected health information leakage, healthcare AI evaluationAbstract
The growing use of large language models (LLMs) in clinical practice has brought up serious doubts about the reliability, fairness, and compliance with regulations. Model hallucinations may undermine clinical decision-making in a health care setting, algorithmic bias may create health inequities, and unintended disclosure of protected health information (PHI), may contravene privacy rules. Although more focus is given to clinical LLM assessment, current benchmarks pay much attention to overall performance and do not address these safety and compliance risks in a holistic manner in a HIPAA-conscious system. This paper suggests a uniform benchmark and assessment harness that is specifically rigorously developed in clinical LLMs to quantify hallucination, bias, and leakage of PHI systematically. The framework includes clinically-based exercises, de-identified and artificial data sets, and automated identification tools consistent with the HIPAA-related privacy categories. Through the combination of several dimensions of evaluation into a single and reproducible harness, the benchmark can allow comparative clinical LLM evaluation of safety, fairness and privacy measures. The findings demonstrate that there is a significant variability in model behavior, which indicates trade-offs between clinical capability and risk exposure. The work adds a useful evaluation system to help implement LLMs responsibly, regulate, and monitor their use in healthcare settings.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Well Testing Journal

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
This license requires that re-users give credit to the creator. It allows re-users to distribute, remix, adapt, and build upon the material in any medium or format, for noncommercial purposes only.

