LoRA-MedSim: An Enhanced Multimodal Framework for Clinical Reasoning and Realistic Patient-Doctor Interaction

Khadeja Fahmy; Mohamed Zorkany; Abd El-Hady Ammar

doi:10.14440/hpr.0175

Submit to HPR

Apply for special issue

Cite this article

Download

Citations

120

Views

More by Authors Links

Khadeja Fahmy

Journal Browser

Volume | Year

Issue

Forthcoming Issue

Current Issue

View All

News and Announcements

View All

RESEARCH ARTICLE

LoRA-MedSim: An Enhanced Multimodal Framework for Clinical Reasoning and Realistic Patient-Doctor Interaction

Khadeja Fahmy^1*, Mohamed Zorkany², Abd El-Hady Ammar¹

Show Less

¹ Department of Communication and Electronics, Faculty of Engineering, Al-Azhar University, Nasr City, Cairo 11765, Egypt

² Department of Communication and Electronics, National Telecommunication Institute, Nasr City, Cairo 11765, Egypt

HPR 2025 , 13(3), e81240023; https://doi.org/10.14440/hpr.0175

Submitted: 16 June 2025 | Revised: 11 August 2025 | Accepted: 18 August 2025 | Published: 3 October 2025

© 2025 by the Author(s). Licensee Health Psychology Research, USA. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution -Noncommercial 4.0 International License (CC BY-NC 4.0) ( https://creativecommons.org/licenses/by-nc/4.0/ )

Download PDF

Cite

XML

Abstract

Background

Present-day medical artificial intelligence (AI) models for medical visual question answering and report generation frequently function passively. Rather than simulating the dynamic nature of clinical workflow, they primarily respond to direct commands. This restricts their applicability in real-world healthcare settings, where complex reasoning and multi-turn dialogue are crucial.

Objective

This paper proposes a novel multimodal framework, based on Low-Rank Adaptation (LoRA), that simulates authentic doctor–patient interactions. The objective is to develop an AI assistant capable of participating in multi-turn diagnostic conversations with improved reasoning, interpreting radiological images, and addressing patient queries.

Methods

We used a curated and enriched version of the VQA-Med CLEF 2019 dataset to perform LoRA fine-tuning on a large-scale vision–language model. To train the model for both visual diagnosis and natural language interaction, the dataset was supplemented with simulated patient queries, radiological reports, and follow-up questions.

Results

Compared with baseline and prompt-based methods, our model demonstrated superior performance. It achieved higher accuracy and lower loss values while producing outputs that were more interpretable across both textual and visual domains. The model’s capacity to manage intricate, multi-turn diagnostic queries was further validated through structured assessments.

Conclusion

The proposed framework represents a significant advancement toward AI assistants designed for therapeutic settings. By integrating dialogue-based reasoning with multimodal understanding, it bridges the gap between passive AI tools and active diagnostic agents capable of interacting with patient data in real-world clinical scenarios.

Keywords

Healthcare system

Artificial intelligence

Low-rank adaptation

Fine-tuning

Large language model

Visual question answering

Funding

None.

References

Esteva A, Chou K, Yeung S, et al. Deep learning-enabled medical computer vision. NPJ Digit Med. 2021;4(1):5. doi: 10.1038/s41746-020-00376-2

Haug CJ, Drazen JM. Artificial intelligence and machine learning in clinical medicine, 2023. N Engl J Med. 2023;388(13):1201-1208. doi: 10.1056/ NEJMra2302038

Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat Med. 2022;28(1):31-38. doi: 10.1038/s41591-021-01614-0

Wang P, Zhang H, Yuan Y. Mcpl: Multi-modal collaborative prompt learning for medical vision-language model. IEEE Trans Med Imaging. 2024;43(12):4224-4235. doi: 10.1109/ TMI.2024.3418408

Li SS, Balachandran V, Feng S, et al. Mediq: Question-Asking LLMs for Adaptive and Reliable Medical Reasoning. [arXiv Preprint]; 2024.

Liu H, Liao Y, Ou S, et al. Med-PMC: Medical Personalized Multi-Modal Consultation with a Proactive Ask-First-Observe-Next Paradigm. [arXiv Preprint]; 2024.

Tiwari A. Labour Monitoring in Pregnant women using Phonocardiography, Electrocardiography and Electromyography Technique. [arXiv Preprint]; 2023.

Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025;31:943-950. doi: 10.1038/s41591-024-03423-7

Hu EJ, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models. International Conference on Learning Representations ICLR. Vol. 1. 2022. p. 3.

Kunisky D. Spectral pseudorandomness and the road to improved clique number bounds for Paley graphs. Exp Math. 2024:1-28. doi: 10.1080/10586458.2024.2400182

Khare Y, Bagal V, Mathew M, et al. MmBERT: Multimodal BERT pretraining for improved medical VQA. In: Proceedings 2021 IEEE 18th International Symposium Biomed Imaging (ISBI). United States: IEEE; 2021. p. 1033-1036.

Abacha AB, Hasan SA, Datla VV, Liu J, Demner- Fushman D, Muller H. VQA-Med: Overview of the Medical Visual Question Answering Task at Imageclef 2019. Geneva: Zenodo.

Hu T, Qi H, Huang Q, Lu Y. See Better Before Looking Closer: Weakly Supervised Data Augmentation Network for Fine-Grained Visual Classification. [arXiv Preprint]; 2019.

ImageCLEF. VQA-Med 2019 Dataset. Available from: https://zenodo.org/records/10499039 [Last accessed on 2025 Jul 30].

Gapyak V, Rentschler CE, März T, Weinmann A. Anℓ¹-plug-and-play approach for MPI using a zero shot denoiser with evaluation on the 3D open MPI dataset. Phys Med Biol. 2025;70(2):025028. doi: 10.1088/1361-6560/ada5a

Liu H, Liao Y, Ou S, Wang Y, Liu H, Wang Y. Med-PMC: Medical Personalized Multi-Modal Consultation with a Proactive Ask-First-Observe-Next Paradigm. [arXiv Preprint]; 2024.

Liu C, OuYang C, Cheng SB, Shah A, Bai W, Arcucci R. G2D: From Global to Dense Radiography Representation Learning Via Vision-Language Pre- Training. United States: Cornell University; 2023.

Qin J, Liu C, Cheng S, Guo Y, Arcucci R. Freeze the backbones: A parameter-efficient contrastive approach to robust medical vision-language pre-training. In: ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). United States: IEEE; 2024. p. 1686-1690.

Liu C, Cheng S, Shi M, Shah A, Bai W, Arcucci R. IMITATE: Clinical prior guided hierarchical vision-language pre-training. IEEE Trans Med Imaging. 2024;44:519-529. doi: 10.1109/ TMI.2024.3449690

Bai J, Bai S, Chu Y, et al. Qwen Technical Report. [arXiv Preprint]; 2023.

Conflict of interest

The authors declare that they have no conflict of interest.

Previous article in this issue

Next article in this issue

Health Psychology Research, Electronic ISSN: 2420-8124 Published by Health Psychology Research