Image Classification & Segmentation based on Enhanced CNN and Transformer Networks
Prasad Kulkarni
Bo Luo
Cuncong Zhong
Guanghui Wang
Xinmai Yang
Convolutional Neural Networks (CNNs) have significantly improved the performance on various computer vision tasks such as image recognition and segmentation based on their rich representation power. To enhance the performance of CNN, a self-attention module is embedded after each layer in the network. Recently proposed Transformer-based models achieve outstanding performance by employing a multi-head self-attention module as the main building block. However, several challenges still need to be addressed, such as (1) focusing only on class-specified limited channels in CNN; (2) limited respective field in the local transformer; and (3) addition of redundant features and lack of multi-scale features in U-Net type segmentation architecture.
In our work, we propose new strategies to address these issues. First, we propose a novel channel-based self-attention module to diversify the focus more on the discriminative and significant channels, and the module can be embedded at the end of any backbone network for image classification. Second, to limit the noise added by the shallow layers of an encoder in U-Net type architecture, we replaced the skip connections with the Adaptive Global Context Module (AGCM). In addition, we introduced the Semantic Feature Enhancement Module (SFEM) for multi-scale feature enhancement in polyp segmentation. Third, we propose a Multi-scaled Overlapped Attention (MOA) mechanism in the local transformer-based network for image classification to establish the long-range dependencies and initiate the neighborhood window communication.