Health Psychology Research / HPR / Volume 13 / Issue 3 / DOI: 10.14440/hpr.0175
Cite this article
8
Download
34
Citations
120
Views
Journal Browser
Volume | Year
Issue
Search
News and Announcements
View All
RESEARCH ARTICLE

LoRA-MedSim: An Enhanced Multimodal Framework for Clinical Reasoning and Realistic Patient-Doctor Interaction

Khadeja Fahmy1* Mohamed Zorkany2 Abd El-Hady Ammar1
Show Less
1 Department of Communication and Electronics, Faculty of Engineering, Al-Azhar University, Nasr City, Cairo 11765, Egypt
2 Department of Communication and Electronics, National Telecommunication Institute, Nasr City, Cairo 11765, Egypt
HPR 2025 , 13(3), e81240023; https://doi.org/10.14440/hpr.0175
Submitted: 16 June 2025 | Revised: 11 August 2025 | Accepted: 18 August 2025 | Published: 3 October 2025
© 2025 by the Author(s). Licensee Health Psychology Research, USA. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution -Noncommercial 4.0 International License (CC BY-NC 4.0) ( https://creativecommons.org/licenses/by-nc/4.0/ )
Abstract

Background

Present-day medical artificial intelligence (AI) models for medical visual question answering and report generation frequently function passively. Rather than simulating the dynamic nature of clinical workflow, they primarily respond to direct commands. This restricts their applicability in real-world healthcare settings, where complex reasoning and multi-turn dialogue are crucial.

Objective

This paper proposes a novel multimodal framework, based on Low-Rank Adaptation (LoRA), that simulates authentic doctor–patient interactions. The objective is to develop an AI assistant capable of participating in multi-turn diagnostic conversations with improved reasoning, interpreting radiological images, and addressing patient queries.

Methods

We used a curated and enriched version of the VQA-Med CLEF 2019 dataset to perform LoRA fine-tuning on a large-scale vision–language model. To train the model for both visual diagnosis and natural language interaction, the dataset was supplemented with simulated patient queries, radiological reports, and follow-up questions.

Results

Compared with baseline and prompt-based methods, our model demonstrated superior performance. It achieved higher accuracy and lower loss values while producing outputs that were more interpretable across both textual and visual domains. The model’s capacity to manage intricate, multi-turn diagnostic queries was further validated through structured assessments.

Conclusion

The proposed framework represents a significant advancement toward AI assistants designed for therapeutic settings. By integrating dialogue-based reasoning with multimodal understanding, it bridges the gap between passive AI tools and active diagnostic agents capable of interacting with patient data in real-world clinical scenarios.

Keywords
Healthcare system
Artificial intelligence
Low-rank adaptation
Fine-tuning
Large language model
Visual question answering
Funding
None.
References
  1. Esteva A, Chou K, Yeung S, et al. Deep learn­ing-enabled medical computer vision. NPJ Digit Med. 2021;4(1):5. doi: 10.1038/s41746-020-00376-2

 

  1. Haug CJ, Drazen JM. Artificial intelligence and machine learning in clinical medicine, 2023. N Engl J Med. 2023;388(13):1201-1208. doi: 10.1056/ NEJMra2302038

 

  1. Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat Med. 2022;28(1):31-38. doi: 10.1038/s41591-021-01614-0

 

  1. Wang P, Zhang H, Yuan Y. Mcpl: Multi-modal collaborative prompt learning for med­ical vision-language model. IEEE Trans Med Imaging. 2024;43(12):4224-4235. doi: 10.1109/ TMI.2024.3418408

 

  1. Li SS, Balachandran V, Feng S, et al. Mediq: Question-Asking LLMs for Adaptive and Reliable Medical Reasoning. [arXiv Preprint]; 2024.

 

  1. Liu H, Liao Y, Ou S, et al. Med-PMC: Medical Personalized Multi-Modal Consultation with a Proactive Ask-First-Observe-Next Paradigm. [arXiv Preprint]; 2024.

 

  1. Tiwari A. Labour Monitoring in Pregnant women using Phonocardiography, Electrocardiography and Electromyography Technique. [arXiv Preprint]; 2023.

 

  1. Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025;31:943-950. doi: 10.1038/s41591-024-03423-7

 

  1. Hu EJ, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models. International Conference on Learning Representations ICLR. Vol. 1. 2022. p. 3.

 

  1. Kunisky D. Spectral pseudorandom­ness and the road to improved clique number bounds for Paley graphs. Exp Math. 2024:1-28. doi: 10.1080/10586458.2024.2400182

 

  1. Khare Y, Bagal V, Mathew M, et al. MmBERT: Multimodal BERT pretraining for improved medical VQA. In: Proceedings 2021 IEEE 18th International Symposium Biomed Imaging (ISBI). United States: IEEE; 2021. p. 1033-1036.

 

  1. Abacha AB, Hasan SA, Datla VV, Liu J, Demner- Fushman D, Muller H. VQA-Med: Overview of the Medical Visual Question Answering Task at Imageclef 2019. Geneva: Zenodo.

 

  1. Hu T, Qi H, Huang Q, Lu Y. See Better Before Looking Closer: Weakly Supervised Data Augmentation Network for Fine-Grained Visual Classification. [arXiv Preprint]; 2019.

 

  1. ImageCLEF. VQA-Med 2019 Dataset. Available from: https://zenodo.org/records/10499039 [Last accessed on 2025 Jul 30].

 

  1. Gapyak V, Rentschler CE, März T, Weinmann A. Anℓ¹-plug-and-play approach for MPI using a zero shot denoiser with evaluation on the 3D open MPI dataset. Phys Med Biol. 2025;70(2):025028. doi: 10.1088/1361-6560/ada5a

 

  1. Liu H, Liao Y, Ou S, Wang Y, Liu H, Wang Y. Med-PMC: Medical Personalized Multi-Modal Consultation with a Proactive Ask-First-Observe-Next Paradigm. [arXiv Preprint]; 2024.

 

  1. Liu C, OuYang C, Cheng SB, Shah A, Bai W, Arcucci R. G2D: From Global to Dense Radiography Representation Learning Via Vision-Language Pre- Training. United States: Cornell University; 2023.

 

  1. Qin J, Liu C, Cheng S, Guo Y, Arcucci R. Freeze the backbones: A parameter-efficient contras­tive approach to robust medical vision-language pre-training. In: ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). United States: IEEE; 2024. p. 1686-1690.

 

  1. Liu C, Cheng S, Shi M, Shah A, Bai W, Arcucci R. IMITATE: Clinical prior guided hierar­chical vision-language pre-training. IEEE Trans Med Imaging. 2024;44:519-529. doi: 10.1109/ TMI.2024.3449690

 

  1. Bai J, Bai S, Chu Y, et al. Qwen Technical Report. [arXiv Preprint]; 2023.
Conflict of interest
The authors declare that they have no conflict of interest.
Share
Back to top
Health Psychology Research, Electronic ISSN: 2420-8124 Published by Health Psychology Research