Choosing the right resolution to train Deep Learning models on Histopathology images

By David Guet (Digital Pathology Specialist), Melanie Lubrano (Data Scientist) - 30 Mar 2021

Histopathology is the analysis of tissue samples under a microscope in order to establish the severity of a disease. More precisely, it concerns the examination of tissue extracted from surgery, biopsy or autopsy. The tissue extracted from the body of the patient is placed in a fixative medium in order to prevent decay. It is then embedded in a cassette in wax before being sliced in thin microtome sections. Thin tissue sections are stained using different staining protocols (H&E, chromogenic, immunofluorescence, etc.) and mounted on glass slides before observation.

For about 20 years now, slide scanner technologies allow to digitize such tissue samples at the microscopic level, instead of human observation under a microscope. These digitized Whole Slide Images (WSI) are super-high resolution images which usually contain millions of pixels. They require specific processing to be used to train deep learning models. 

The most common technique to process such high-resolution images is to break them into small patches that will be later analysed and aggregated to obtain the final slide level result. It is therefore critical to properly choose the size and the resolution of the patches.

Macro of a WSI from TCGA databank

Image resolution in computational pathology


Rayleigh’s and Abbe’s resolution criteria were developed for observations with the human eye and had a major influence on the development of optical instruments. In microscopy imaging, the choice of the image resolution for acquisition is an important prerequisite for relevant image analysis. Image resolution depends on :

  • the numerical aperture of the objective, 

  • the optical magnification, given by the objective and lenses, 

  • and the size of the pixel on the sensor.

We can also mention the important role of the spectral resolution in fluorescence image analysis. To choose the relevant image resolution one has to take into account the size of the sample, the size of the “event” one wants to discriminate, and possibly the number of channels to acquire. 

In histopathology, most of the analysis is done at a cellular level, or subcellular level, giving an order of magnitude of size between 1 µm and a few tens of microns. In most of the cases, acquisitions are performed at 20 x magnification or at 40 x magnification, corresponding to a pixel resolution of 0.25µm/px to 0.5µm/px approximately. Those magnification levels correspond to the magnification used under the microscope by pathologists to review and analyse slide samples, but are those magnifications needed for proper image analysis using deep learning algorithmes ?

Whole slide images have a pyramidal structure from which different level of magnification can be extracted

Here is a code snippet in python illustrating basic functionalities of openslide

Handling Digital Slides


Brightfield and fluorescence WSI usually come in proprietary format depending on the scanner provider or as standard TIFF. 

WSI are stored in a multi-resolution tiles format. Each level of resolution is stored in a separate “page” of the WSI. The level 0 or page 0 contains the image at the full resolution. Other levels contain downsampled images, with thus a lower resolution. This pyramidal structure gives a high level of flexibility for the user to zoom and pan in the image, and so mimic the behavior of slide review under a microscope. 

The factor between two consecutive levels of resolution is traditionally of 2, but it may vary. (See later in the exemple how to determine the scale factor between 2 levels).


Libraries such as openslide or libvips allow to process such WSI formats and provide methods to directly access the downsampled levels, optimizing the processing time spent on the WSI. With traditional microscope images the user can not have access to lower resolution of the image, and has to work with the full resolution image.


Choose your resolution


Finally the user will have to choose a magnification level, with a given resolution to train his deep learning model. 

Two main criterias are to be taken into consideration:


  • What are your objects of interest and how much context is necessary for your model to understand what it wants to predict? The context is highly dependent on the study you want to perform. For instance, performing segmentation of tumor versus stroma tissue will have a specific context that will be different from quantifying biomarker expression level in nuclei.

  • What is your computational power and how long are you willing to wait for your model to converge?

We will explore two use cases to illustrate how image resolution is involved in the design of deep learning algorithms.


01. Cell detection and / or segmentation


In this first example we were asked to evaluate  specific cell densities or organisations within a tissue. To do so, the user will need a neural network to be able to detect events on the tissue and process cellular level information, ie. at a very high resolution. 

In the image below, the goal is to segment cells (delimit their boundaries) and classify them into the right category (lymphocytes in yellow, marked epithelial cells in red, unmarked epithelial cells in green). For instance, this will give us information about the proportion of marked cells (ie: expressing a given biomarker) among all epithelial cells. 

The Human Protein Atlas - Colorectal Cancer

Traditional convolutional neural network architectures are built with a stack of neural layers performing successive downsampling on the input image to extract relevant features from it. See the figure below. 


The ability of the model to detect separated objects will therefore be related to the downsampling factor between the input image and the encoded feature map. 

For instance, in a traditional U-Net architecture [ref], the downsampling factor is equal to 16, meaning that 16x16 pixels boxes from the input image will be summarized in 1 value of the feature map. It is thus crucial to adapt the input image resolution to make sure only few objects of interest fit into the encoded 16x16 bounding box. However, in the U-Net architecture this issue is mitigated by skip connexions that propagate to the decoder information from higher levels’ feature maps.


02. Whole Slide Classification


The second example concerns Whole Slide Classification which consists of predicting a global label to the WSI. To do so, specific techniques such as Multiple Instance Learning, require to cut the WSI into small patches (as they are too big to be fed entirely in the network). The discriminative patterns are extracted from the patches, therefore, the size and resolution of the patches will directly depend on the morphological pattern the user  intends to teach the model with. 


If discriminative patterns are understandable at the cellular level eg: bigger nuclei, increased number of mitoses, etc.  you will prefer to use the highest resolution (and the lowest level). 

As the size of the patches that you can feed to your network is limited (usually 224 pixels x 224 pixels), the network will not have much context nor structural information and will only see a few hundreds of cells (right image).

On the contrary, if the discriminative patterns for your classes are related to cell organisation and textures, you might prefer to feed the model with patches that contain more context and use a lower resolution (higher level). However, you might lose cell-level information (left image).


HE slides can be used at different level of magnification depending on the context of the training (TCGA data set)

For both use cases the choice of resolution will be mitigated by the computational capacity. Indeed, as you might prefer to use the full resolution, you will end up with tens of thousands of patches to process. This will have a cost on the time you need to train the model. 

For instance, using a resolution of 0.5 mpp (20X) instead of 0.25 mpp (40X) will reduce the number of patches extracted from the Whole Slide by a factor of 4, and therefore reduce the training and inference time by a factor of 4 as well. Often, using a reduced resolution will not actually affect the overall performance of the model.


Conclusion & Take aways


Since Rayleigh’s and Abbe’s theorized resolution as the minimum distance to separate two distinct objects, the implications for optical microscopy and image analysis have continued to grow. Image resolution today in tissue section acquisition is paramount not only for diagnosis by the human eye but also for deep learning based analysis. The training of the model will have to take into account the context in which the event we want to observe is located (tissue segmentation, expression level of a protein, mitosis, etc...) and therefore the resolution of the image is a key factor. 


Although most often images are scanned at 20x or 40x, which is the standard for observation under a microscope, we can take advantage of the pyramid structure of WSI to access specific resolutions. This will greatly improve the training time and thus the analysis time for routine use. In many applications, the use of these lower resolutions will not affect the overall performance of the algorithm. 


Finally, understanding the image architecture in digital pathology is critical to elevate deep learning analyses quality and robustness at the level of pathologist's expertise.