The purpose of developing EMReady is to optimize 3D cryo-electron microscopy (cryo-EM) density maps to facilitate better structural determination. The workflow of EMReady is depicted in Figure 1. EMReady adopts the three-dimensional (3D) Swin-Conv-UNet-based network architecture (SCUNet), which combines the advantages of conventional residual convolution for local modeling, the swin (shifted window) transformer for non-local modeling, and multiscale UNet for further enhancement of both local and non-local modeling. The swin transformer is an efficient transformer that combines self-attention of non-overlapping local windows and non-local cross-window connections by shifted window partitioning. With the SCUNet architecture, EMReady is capable of capturing the non-local features within each input density slice of size 48 Å × 48 Å × 48 Å. The local and non-local modeling of EMReady is implemented not only in the network architecture but also in the training process. Specifically, our network is trained by simultaneously minimizing the local smooth L1 distance and maximizing the non-local structural similarity (SSIM) between processed experimental and simulated target maps. Compared with the simple smooth L1 loss, incorporating the SSIM loss in the training process can effectively prevent the network from possible overfitting. To enhance EMReady's robustness to map heterogeneity, we generate target simulated maps in the training set using different local resolution values for different atoms in the structure, with these resolution values derived from Q-scores. Moreover, the training and evaluation are expanded to encompass 2.0-10.0 Å cryo-EM maps with nucleic acid molecules and cryo-electron tomography (cryo-ET) maps.
Figure 1: Workflow of EMReady. a, Prediction module of EMReady. b, Data types supported by EMReady. c, Architecture of Swin-Conv-UNet (SCUNet). d, Training and data preprocessing modules of EMReady.
We employ the Swin-Conv-Unet as our primary network architecture to improve the input density maps. EMReady consists of three encoders, one bottleneck, and three decoders swin-conv(SC) blocks, with skip connections between encoders and decoders at the same level. Each SC block comprises two parallel Swin Transformer Blocks (SwinT) and Residual Convolutional Blocks (Rconv). Compared to traditional CNNs, this architecture effectively integrates the global information extraction capability of Transformers and the local information extraction capability of Convolutions. The swin window size in SwinT is set to 3, facilitating global information interaction through sliding. Downsampling and upsampling are achieved using 3D convolutional layers with a stride of 2 and a kernel size of 2, and 3D transposed convolutional layers with a kernel size and stride of 2, respectively. The network takes as input density boxes of size 48 × 48 × 48 that have been interpolated to a grid spacing of 1 Å. The output of the network is improved density boxes of the same size as the input.
When utilizing EMReady, users only need to submit a density map for optimization. Depending on the grid spacing of the input density map, EMReady dynamically selects the appropriate model. Specifically, for density maps with a grid spacing greater than or equal to 1.0 Å, the 1.0 Å model is utilized, while for maps with a spacing less than 1.0 Å EMReady selects for the 0.5 Å model. After determining the model, the input density map is interpolated to the corresponding grid spacing and then partitioned into a series of 48×48×48 overlapping boxes. These boxes undergo optimization through a pre-trained Swin-Conv-Unet network, ultimately combining to form the fully optimized density map. Users can choose to apply reverse interpolation to ensure that the optimized density map retains the same size as the input. It is noteworthy that users have the option to provide an additional mask input to ensure that EMReady optimizes.