| Abstract | Deep learning has significantly enhanced diagnostic accuracy in medical imaging, particularly in breast cancer detection. While unimodal methods such as mammography and histopathology analysis achieve strong performance, integrating multimodal data remains challenging due to the high dimensionality and heterogeneity of imaging modalities. To address these challenges, we propose the Channel Attention CNN-Transformer (CTNet), a hybrid model that combines convolutional neural networks (CNNs) with vision Transformers for improved breast cancer classification. CTNet incorporates a channel attention mechanism for modality-specific feature refinement, a Cross Attention Block (CAB) for efficient feature extraction, and a Dynamic Attention (DA) block with Transformer encoding to strengthen contextual representation. Experimental evaluation demonstrates that CTNet outperforms baseline models, achieving 99% accuracy on histology images, 93% on mammography, and 86% on combined datasets, exceeding architectures such as EfficientNet and DenseNet. Rather than introducing complex fusion strategies, CTNet applies a straightforward element-wise fusion of modality features, showing that robust performance can still be achieved through careful architectural design. These findings highlight the potential of channel- and Transformer-based integration to advance breast cancer diagnostics, offering a practical and clinically relevant approach to multimodal analysis. |
|---|