Abstract | In recent years, the intersection of natural language processing (NLP) and healthcare informatics has witnessed a revolutionary transformation. One of the most groundbreaking developments in this realm is the advent of large language models (LLM), which have emonstrated remarkable capabilities in analysing clinical data. This paper aims to explore the potential of large language models in medical text classification, shedding light on their ability to discern subtle patterns, grasp domain-specific terminology, and adapt to the dynamic nature of medical information. This research focuses on the application of transformer-based models, such as Bidirectional Encoder Representations from Transformers (BERT), on hospital discharge summaries to predict 30-day readmissions among older adults. In particular, we explore the role of transfer learning in medical text classification and compare domain-specific transformer models, such as SciBERT, BioBERT and ClinicalBERT. We also analyse how data preprocessing techniques affect the performance of language models. Our comparative analysis shows that removing parts of text with a large proportion of out-of-vocabulary words improves the classification results. We also investigate how the input sequence length affects the model performance, varying sequence length from 128 to 512 for BERT-based models and 4096 sequence length for the Longformers. The results of the investigation showed that among compared models SciBERT yields the best performance when applied in the medical domain, improving current hospital readmission predictions using clinical notes on MIMIC data from 0.714 to 0.735 AUROC. Our next step is pretraining a model with a large corpus of clinical notes to potentially improve the adaptability of a language model in the medical domain and achieve better results in downstream tasks. |
---|