LoRA-MedSim: An Enhanced Multimodal Framework for Clinical Reasoning and Realistic Patient-Doctor Interaction

Background
Present-day medical artificial intelligence (AI) models for medical visual question answering and report generation frequently function passively. Rather than simulating the dynamic nature of clinical workflow, they primarily respond to direct commands. This restricts their applicability in real-world healthcare settings, where complex reasoning and multi-turn dialogue are crucial.
Objective
This paper proposes a novel multimodal framework, based on Low-Rank Adaptation (LoRA), that simulates authentic doctor–patient interactions. The objective is to develop an AI assistant capable of participating in multi-turn diagnostic conversations with improved reasoning, interpreting radiological images, and addressing patient queries.
Methods
We used a curated and enriched version of the VQA-Med CLEF 2019 dataset to perform LoRA fine-tuning on a large-scale vision–language model. To train the model for both visual diagnosis and natural language interaction, the dataset was supplemented with simulated patient queries, radiological reports, and follow-up questions.
Results
Compared with baseline and prompt-based methods, our model demonstrated superior performance. It achieved higher accuracy and lower loss values while producing outputs that were more interpretable across both textual and visual domains. The model’s capacity to manage intricate, multi-turn diagnostic queries was further validated through structured assessments.
Conclusion
The proposed framework represents a significant advancement toward AI assistants designed for therapeutic settings. By integrating dialogue-based reasoning with multimodal understanding, it bridges the gap between passive AI tools and active diagnostic agents capable of interacting with patient data in real-world clinical scenarios.
- Esteva A, Chou K, Yeung S, et al. Deep learning-enabled medical computer vision. NPJ Digit Med. 2021;4(1):5. doi: 10.1038/s41746-020-00376-2
- Haug CJ, Drazen JM. Artificial intelligence and machine learning in clinical medicine, 2023. N Engl J Med. 2023;388(13):1201-1208. doi: 10.1056/ NEJMra2302038
- Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat Med. 2022;28(1):31-38. doi: 10.1038/s41591-021-01614-0
- Wang P, Zhang H, Yuan Y. Mcpl: Multi-modal collaborative prompt learning for medical vision-language model. IEEE Trans Med Imaging. 2024;43(12):4224-4235. doi: 10.1109/ TMI.2024.3418408
- Li SS, Balachandran V, Feng S, et al. Mediq: Question-Asking LLMs for Adaptive and Reliable Medical Reasoning. [arXiv Preprint]; 2024.
- Liu H, Liao Y, Ou S, et al. Med-PMC: Medical Personalized Multi-Modal Consultation with a Proactive Ask-First-Observe-Next Paradigm. [arXiv Preprint]; 2024.
- Tiwari A. Labour Monitoring in Pregnant women using Phonocardiography, Electrocardiography and Electromyography Technique. [arXiv Preprint]; 2023.
- Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025;31:943-950. doi: 10.1038/s41591-024-03423-7
- Hu EJ, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models. International Conference on Learning Representations ICLR. Vol. 1. 2022. p. 3.
- Kunisky D. Spectral pseudorandomness and the road to improved clique number bounds for Paley graphs. Exp Math. 2024:1-28. doi: 10.1080/10586458.2024.2400182
- Khare Y, Bagal V, Mathew M, et al. MmBERT: Multimodal BERT pretraining for improved medical VQA. In: Proceedings 2021 IEEE 18th International Symposium Biomed Imaging (ISBI). United States: IEEE; 2021. p. 1033-1036.
- Abacha AB, Hasan SA, Datla VV, Liu J, Demner- Fushman D, Muller H. VQA-Med: Overview of the Medical Visual Question Answering Task at Imageclef 2019. Geneva: Zenodo.
- Hu T, Qi H, Huang Q, Lu Y. See Better Before Looking Closer: Weakly Supervised Data Augmentation Network for Fine-Grained Visual Classification. [arXiv Preprint]; 2019.
- ImageCLEF. VQA-Med 2019 Dataset. Available from: https://zenodo.org/records/10499039 [Last accessed on 2025 Jul 30].
- Gapyak V, Rentschler CE, März T, Weinmann A. Anℓ¹-plug-and-play approach for MPI using a zero shot denoiser with evaluation on the 3D open MPI dataset. Phys Med Biol. 2025;70(2):025028. doi: 10.1088/1361-6560/ada5a
- Liu H, Liao Y, Ou S, Wang Y, Liu H, Wang Y. Med-PMC: Medical Personalized Multi-Modal Consultation with a Proactive Ask-First-Observe-Next Paradigm. [arXiv Preprint]; 2024.
- Liu C, OuYang C, Cheng SB, Shah A, Bai W, Arcucci R. G2D: From Global to Dense Radiography Representation Learning Via Vision-Language Pre- Training. United States: Cornell University; 2023.
- Qin J, Liu C, Cheng S, Guo Y, Arcucci R. Freeze the backbones: A parameter-efficient contrastive approach to robust medical vision-language pre-training. In: ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). United States: IEEE; 2024. p. 1686-1690.
- Liu C, Cheng S, Shi M, Shah A, Bai W, Arcucci R. IMITATE: Clinical prior guided hierarchical vision-language pre-training. IEEE Trans Med Imaging. 2024;44:519-529. doi: 10.1109/ TMI.2024.3449690
- Bai J, Bai S, Chu Y, et al. Qwen Technical Report. [arXiv Preprint]; 2023.