In the structure of ChoroidSeg-ViT, the ASFF module is utilized to fuse the multiscale features output from the encoder [
F1,
F2,
F3,
F4]. For
F1 and
F4, the ASFF module is applied to fuse adjacent layer features; for the intermediate feature layers [
F2,
F3], the ASFF module is used to integrate the multiscale features from three adjacent stages. Taking the intermediate layer encoder output
Fi (
i = 2, 3) as an example, the ASFF module first performs dimension matching operations on the low-resolution features
Fi+1 and high-resolution features
Fi–1. For low-resolution features
Fi+1, a 1 × 1 convolution layer is used to compress the number of channels by halves, and then an upsample layer (bilinear interpolation) is applied to expand the feature map size to the same as
Fi, resulting in the transformed feature
\(F_{i + 1}^T\). For high-resolution features
Fi–1, a 3 × 3 depth convolution layer and a 1 × 1 layer are adopted to reduce the spatial resolution by half and match the channel number, separately. For the current feature
Fi, the enhanced feature
\(F_i^T{\rm{\ }}\)is obtained by using an ACB module and a residual connection to deeply mine multiscale semantic information and retain more detailed features. After dimension matching, element-wise multiplications of
\(F_{i - 1}^T{\rm{\ }}\)and
\(F_{i + 1}^T{\rm{\ }}\)with
\(F_i^T{\rm{\ }}\)are conducted separately to enhance distinctive features and suppress background noise. Finally, a concatenate operation is performed to fuse the multiscale feature output from adjacent layers, capturing the correlation information among the multiscaled features to obtain the final fused features. The formulae are expressed as follows:
\begin{eqnarray}
F_i^T = {{F}_i} + ACB({{F}_{\rm{i}}}), \quad (i = 1,2,3,4)
\end{eqnarray}
\begin{eqnarray}
F_{i + 1}^T = UP(Con{{v}_{1 \times 1}}({{F}_{{\rm{i}} + 1}})), \quad (i = 2,3,4)\quad
\end{eqnarray}
\begin{eqnarray}
F_{i - 1}^T = Con{{v}_{1 \times 1}}(DWConv({{F}_{i - 1}})), \quad (i = 1,2,3)\quad
\end{eqnarray}
\begin{eqnarray}
F_i^{ASFF} &=& Concat(Mul(F_{i - 1}^T,{{F}_i}),F_i^T,\nonumber\\
&& Mul({{F}_i},F_{i + 1}^T)), \quad (i = 1,2,3,4)\quad
\end{eqnarray}
where
UP denotes a twofold up-sample operation;
DWConv represents a 3 × 3 depthwise convolution and denotes an asymmetric convolution module; and
Mul and
Concat represent a pixel-by-pixel multiplication operation and a concatenate operation, respectively.