| Abstract | Pain assessment is a critical challenge in healthcare, requiring accurate and objective measurement to enhance patient care. Traditional methods rely on subjective self-reporting, which lacks reliability, particularly for patients with communication difficulties. This work presents PainFusion+, a multimodal transformer architecture for pain assessment that integrates physiological signals and facial expressions to improve pain assessment accuracy. For physiological signals, our approach first employs convolutional neural networks to extract local patterns from short signal fragments, capturing essential pain-related features. These localized representations are then processed by a transformer encoder, which models long-range dependencies to form a comprehensive global representation. For facial video data, we leverage a frozen video transformer to extract expressive features without requiring fine-tuning, significantly reducing computational costs. Finally, both feature spaces are fused using a transformer encoder, allowing effective cross-modal learning. Experiments on publicly available datasets demonstrate that PainFusion+ outperforms existing models. In biomedical signal processing, our method achieves over a 16 % improvement in accuracy. For multimodal pain estimation, it achieves 35.40 % accuracy on the BioVid dataset, setting a new state-of-the-art benchmark. |
|---|