The SE module performed the recalibration of feature maps in the channel dimension to automatically obtain the importance values of different channels in the feature maps. With these importance values, the proposed network can selectively enhance the informative features useful for the current classification task and suppress less useful ones. The principle of the SE module is described as follows.
33 First, the original feature maps produced by convolutional operation are defined as
U ∈
RH × W × C and the original feature maps can be written as
U = [
u1,
u2, …
uC]. The original feature maps
U are first passed through a squeeze operation, which can incorporate the global spatial information by generating channel-wise statistics. Specifically, the
H ×
W spatial dimensions of the original feature maps are shrunk to generate the global spatial feature
Z ∈
RC. The
c channel element of
Z is computed by
\begin{equation}{z_c} = \frac{1}{{H \times W}}\sum\limits_{i - 1}^H {\sum\limits_{j = 1}^W {{u_c}\left( {i,j} \right)} } ,\end{equation}
where the spatial dimension of the original feature maps is
H ×
W, and
uc ∈
RH × W. Second, to make use of the information aggregated in the squeeze operation, the aggregation is followed by an excitation operation that aims to fully capture channel-wise dependencies. For global spatial feature
Z, the channel dimension of
C is reduced to
C/R by the first FC layer and then is activated by the ReLU function. The channel dimension of
C/R is returned to the channel dimension of the original feature maps by the second FC layer. Subsequently, a series of per-channel modulation weights between 0 and 1 is produced by the sigmoid activation function. The global spatial feature
Z is forwarded to two FC layers to finally generate the channel attention map
S ∈
RC, encoding which channel to emphasize or suppress. This process is called feature recalibration, namely, the gating mechanism. A simple gating mechanism is employed to achieve this objective:
\begin{equation}S = \sigma \left( {{W_2}\delta \left( {{W_1}Z} \right)} \right),\end{equation}
where
δ denotes the ReLU function,
σ refers to the sigmoid function,
W1 ∈
R\({\textstyle{C \over R}}\) × C and
W2 ∈
RC × \({\textstyle{C \over R}}\) are the weights of the two FC layers, respectively.
R is the reduction ratio used to reduce the channel dimension of the first FC layer in the SE module. Finally, the output of the SE module is obtained by rescaling the feature maps
U with the channel attention map
S:
\begin{equation}X = U \otimes S, {x_c} = {F_{scale}}\left( {{u_c},} \right.\left. {{s_c}} \right) = {u_c}{s_c},\end{equation}
where
X = [
x1,
x2, …
xC] and ⊗ denotes channel-wise multiplication.
Fscale (
uc,
sc) refers to the channel-wise multiplication between the scalar
sc and the feature map
uc ∈
RH × W.