Introduction
In 2018, breast cancer is the most commonly diagnosed cancer for women and the second leading cause of cancer deaths with 2 million new cases and 600,000 deaths.1 Most breast cancers are carcinomas, which are tumors generally tend to originate from epithelial cells. Most cancers develop in epithelial tissue which lines body cavities. Epithelial tissues work as boundaries of the body to the outside world. Epithelial organization makes them to act as a barrier and do molecules' vectorial transport between various compartments of body. Epithelia comprises much of the tissue in glands. Carcinomas develop from epithelial tissue and they account for as many as 90 percent of all human cancers.
Epithelial tissue component of breast consists of "ductal" or "luminal" cells. Epithelial cells make milk under normal condition. In breast tissue, cancer arises predominantly from the luminal epithelial cells that line both the ducts and milk-producing lobules, and less frequently from the outer layer of basal cells. Cancer follows from a series of somatic mutations. Somatic mutations are genetic alterations acquired by the cell, which is passed to the progeny of the mutated cell in the course of cell division. Renewal of epithelial tissues showing large number of cell replications lead to an increase of somatic mutations, tending to carcinoma incidence. Carcinoma, a cancer starts in the skin or the tissues that line other organs, especially Epithelial tissue component of breast.
Breast cancer can be identified and diagnosed with the help of ultrasound, X-rays, but the conclusive evidence of cancerous cells is given by WSI images. WSI images are gigapixels in size and it is manually difficult and time consuming to analyze them. Deep Learning has shown super human performance in classification, object detection, segmentation tasks and can save a lot of human effort and help to improve the diagnostic accuracy, if the analysis can be automated.
Segmenting epithelium regions is crucial as cancer carcinomas typically arise in the epithelium.2 This will allow pathologists to separate out relevant cancer tissue. Since it filters out relevant regions of an image, this will also reduce the amount of computation used. The dataset contains breast epithelium images along with their respective mask images (ground truth) which are annotated by experts.3 However, this is a difficult task as often non-relevant tissue finds its way into the masks. So, the quality of segmentation heavily relies on the quality of the dataset.
This paper uses an image segmentation model U-Net to segment out the epithelium tissue of an image.4 Image preprocessing techniques such as thresholding, noise reducing methods to remove small connected cells components, etc, are performed to obtain finer mask images.5 Similarly, image denoising techniques like connected component removal, dilation, closing, etc. are performed in the postprocessing to improve the accuracy of results. Binary cross-entropy loss function is used to calculate loss energy weight in the images. U-Net uses Adam optimizer algorithm to learn the features of the dataset. Statistical parameters such as dice, Jaccard coefficients, accuracy and F-score are used as performance parameters to evaluate U-Net. Data augmentation is used to increase the size of the dataset. Learning rate would be high if more data are fed into the U-Net network.
Each WSI image is gigapixels in size and analyzing the entire image manually is extremely difficult which consumes lot of pathologist’s time. Due to the time-consuming and tedious nature of the diagnosis of breast cancer, there is a growing need for it to be automated. This paper envisions an Epithelium segmentation model which will help the experts to perform fast and accurate diagnosis task.
Anant Madabhushi et al introduces the usage of Deep Learning Techniques for Digital Pathology images processing, with various onco-pathology diagnostic use cases such as Carcinoma Localization, Nuclei Segmentation, Epithelium Segmentation, Tubule Segmentation, Lymphocyte detection, Mitosis Detection and Lymphoma classification.6 Authors have used 34 digital pathology (WSI-Whole Slide Images of 1000 x 1000 pixels) training images and 8 validation images to perform the Epithelium Segmentation use case. Each single WSI image is divided into multiple patches of size 32 x 32 pixels. To remove uninterested stroma regions from the mask of grayscale image, a user-defined thresholding is applied. Patches containing the edges of epithelium regions are taken so that the network can learn crisp boundaries. AlexNet, the popular Convolutional Neural Network (CNN) algorithm is applied for image classification on the generated patches. While generating the output, white regions are removed again by applying user-defined thresholding. And positive regions of size less than 300 pixels are removed, which aren’t clinically relevant. This method has generated an average F1 score of 0.84.
Segmentation is the same as classification of each pixel in the image. Ciresan et al used a CNN to classify each pixel using its local patch.7 As the number of patches generated through this technique were really large and the CNN has to run over each one of them which leads to high runtime.
With the introduction of U-Net, which has better speed and accuracy than the other existing models of the time, image segmentation algorithms started using U-Net model. U-Net has processes called down-convolutions and up-convolutions. Down-convolutions facilitate data localization, whereas up-convolutions generate an image of matching size. Its concatenation layers help to prevent context loss and generate a clean image. All these processes and concatenation layers of U-Net facilitate faster runtime and better accuracy than other CNNs.
Wouter Bulten et al proposes the usage of U-Net for prostate epithelium segmentation on Hematoxylin and Eosin (H&E) stained prostatectomy slides which uses immunohistochemistry (IHC) as reference standard.8 Nearly 102 tissue sections were stained with H&E and subsequently re-stained with P63 and CK8/18 IHC markers to highlight epithelial structures. The H&E and IHC images are co-registered using a non-linear image registration method. Due to multi-staining of the image, the co-registration pipeline follows specialized methodology to account for the multi-modality registration problem. The co-registration pipeline initially converts image to grayscale, followed by parametric (affine) registration of the same, non-parametric registration and finally a patch-based registration with Normalized Gradient Fields (NGF) and Curvature. Non-parametric registration uses NGF distance which measures the alignment of image gradients, to account for the multi-modality registration problem. By applying color deconvolution on IHC images P63/CK8-18 components are separated. Subsequently, using thresholding each channel was then converted to binary mask. The U-Net is trained with 50 training images and 12 validation images. The network’s output is mapped to H&E version of the specimens using registration which were fed to our final model as inputs.
Philipp Kainz et al proposed classic LeNet-5 Network architecture to semantically segment glands in the Warwick-QU dataset, presented at GlaS@MICCAI 2015 challenge.9, 10 The dataset contains 161 H&E stained images taken at 20x magnification of benign and malignant colorectal adenocarcinomas. Authors proposed 3-steps pipeline, comprising preprocessing, with 2 Convolutional Neural Networks namely Object-Net and Separator-Net. Object-Net network is used to predict glands and Separator-Net network is used to segment gland-separating structures. The proposed methodology has color deconvolution and CLAHE equalization as preprocessing steps. CLAHE equalization helps to equalize training images contrasts. Object-Net network provides the probability of a pixel belonging to a gland or background, whereas Separator-Net network identifies structures which separate the gland from the background objects. The model generated an F1 score of 0.78 without Separator-Net and 0.87 with Separator-Net.
CNNs were used for brain tumor classification and segmentation.11 Inputs of size 224 x 224 pixel are passed into the network. Initially, convolutional layers are applied to down-sample the image. Subsequently, the outputs are flattened and linear layers are applied on them. Finally, a 4096 dimensional feature extractor was extracted. The model attained classification accuracy of 97.5% and got 0.84 dice coefficient for segmentation.
Model Description
Convolutional Neural Network (CNN) is the class of deep neural networks, used to analyze images. U-Net is a Deep Learning Segmentation model developed by Olaf Ronneberger et al for fast and accurate image segmentation. It was originally developed for biomedical image segmentation, but has become a standard image segmentation model for all domains. CNNs are typically used for classification tasks. Segmentation can also be seen as a classification task, as each pixel needs to be classified into one of several types. Prior to U-Net a sliding window technique was used along with a CNN to classify each pixel, by looking at the region around it.12 The run-time of sliding window technique was high because the CNN need to run huge number of times before the entire image gets classified. U-net outperforms this task in terms of both speed and accuracy. U-Net model was also used for object detection and object counting tasks, as proposed by Hao Dong et al and Falk, T et al.13, 14 For detection, critical objects can be accurately outlined and segmented out using U-Net.
U-Net consists of two primary parts namely down-sampling (shown at left side in Figure) and up-sampling (Figure 1). The down-sampling comprises of 5 convolutional blocks, with each block consist of two 3x3 convolutional layers and a max pooling layer. Similarly, up-sampling consists of 5 reverse-convolutional blocks. Each convolutional block consists of a 2x2 convolution transpose layer, a concatenate layer which concatenates the current activations with their respective down-sampled activations and two 3x3 convolutional layers. Finally, the activations are passed through a 1x1 convolution layer and a soft max activation layer. Each convolution block is followed by a ReLU. The concatenate layer prevents loss of context and detail. The 1x1 convolutional layer maps the feature vector for each pixel into the desired number of classes. The soft max is used to calculate class probabilities of each pixel, and can be replaced with a sigmoid layer for binary classification. The entire architecture is shown is Figure 1.
Performance of a CNN model can be assessed using a Loss function. Binary-cross-entropy loss function, which classifies each image pixel into two classes viz Presence (positive/White color) and Absence (Negative/Black color) of epithelium. Critical regions or interested regions are very limited on most of the biomedical images and critical regions occupy a small portion on the image map, which leads to an unbalanced optimization problem.
The experimental use case dataset is also susceptible to unbalanced optimization problem due to the existence of sparse regions within the images. As negative regions (non-epithelium regions) within the images are common, which may lead to data overfitting, while training dataset becomes unbalanced (Figure 2). By applying dice loss which focus on positive regions (regions containing epithelium), avoids data overfitting. 15 U-Net uses Adam optimizer, a modified Stochastic Gradient Descent optimizer for smoother and finer learning curves. 16
Binary cross-entropy and dice loss functions are added and the model is trained on the new function. The optimizer is used with b1 = 0.9, b2 = 0.999, learning rate = 0.001, and batch size = 4. The learning rate is exponentially decayed by 0.5 every 400 training steps.
Model Evaluation Metrics
The model is evaluated by using the metrics, dice coefficient, Jaccard coefficient, accuracy, and F1-score.17, 18 Accuracy is usually the most common way to evaluate models. However, a model evaluated by accuracy may fail to identify small, positive, and critical regions. In imbalanced datasets, a model may have high accuracy, but may fail to identify such regions. To properly access the model in such situations, dice, jaccard coefficients, and F1-score are used. A detailed comparison of the metrics was done by Vikas Thada et al.19
Data
Epithelium is a tissue which lines the outer surfaces of organs in the body. The digitized biopsy image (WSI, whole slide image) generally contains regions like stroma, tubules etc, along with epithelium. To perform epithelium segmentation task on WSI image using CNN requires a sample image along with respective expert-annotated mask showing the region containing epithelium tissue on WSI. Epithelium region is shown as the dark purplish region in the Figure 3.
The experimental dataset consists of 42 epithelium tissue images of size 1000x1000 pixels with their respective mask images. To fit into CNN model, each image is appropriately resized and divided into patches of size 320x320 leading to 378 pairs (image and mask).
Data augmentation is the process of performing transformations over existing data to produce new, additional data samples. Data augmentation techniques are used to increase the size of training data and obtain a better fit on the model. Few sample images, masks and respective data augmentations are shown in FigureFigure 4. These images are augmented using random horizontal, vertical flips, random crops and rotations before being passed into the model. Perez et al demonstrates the effectiveness of data augmentation in improving deep learning models.20 A SmallNet was used on the Cats vs Dogs dataset. With data Augmentation validation accuracy improved from 0.705 to 0.775. On the dogs vs goldfish dataset, validation accuracy improved from 0.855 to 0.890 using traditional data augmentation, and 0.915 using neural data augmentation.
Pre and Post processing Methods
Generally, there would be few white stroma regions within the epithelium, which are considered to be epithelium by mask-annotations as shown in FigureFigure 5. Similarly, there also small purplish epithelium regions outside the epithelium chunks, those were not marked as epithelium region as shown in Figure 6. This confuses the training network and predict them as noisy. Pre and post processing techniques which are discussed as given below would override these issues.
The annotated masks in the given dataset are slightly inconsistent, marking even white stroma regions as epithelium at few places. Binary thresholding is applied to remove these mis-markings. Initially, the original image is converted to grayscale, where the value of each pixel corresponds to the darkness at that point. Over these images, apply binary thresholding value of 0.5, wherein all positive pixels with a value less than a threshold value of are marked as negative in the mask as shown inFigure 7. Thus, this creates negative spaces in the mask wherever there are white regions in the original image and reduces the number of wrongly marked stroma regions.
Post-processing Methods
Connected components are groups of connected pixels belonging to the same class in the segmented image. Removing small connected components within an image is an effective way to de-noise a binary image. Coupier et al and Chan et al describe how removing connected components can assist in denoising an image. 21, 22 From the generated mask, white generated components which contain fewer than 300 pixels are removed. This removes most white noise in the image, and reduces the issue of falsely marked non-epithelial regions. The image is slightly dilated to close tiny gaps between epithelial regions. This process is illustrated in Figure 8.
Results
The automatic system generated an accuracy of 0.932 for 40 images. The system generated Sørensen–Dice coefficient, a statistic used to gauge the similarity of two samples as 0.7 as shown in the Table 1. The system achieved a precision of 0.94 and an F1 score (also F-score or F-measure), a measure of a test's accuracy in a binary classification problem as 0.86 which is considerably better than the previous results as shown in the Table 2.
Discussion
This paper discussed an automatic image segmentation method to segment out epithelium tissue from WSI images of the breast cancer using a CNN network called U-Net. This method used the combination of binary cross-entropy and dice loss functions to counter imbalanced training. Adam optimizer was used for training. Data augmentation techniques like rotation, cropping, horizontal and vertical flips, etc. are used to artificially increase the dataset size. This paper also discussed about information loss issues in the dataset, like the presence of false positives and false negatives and discussed the pre-processing techniques like binary thresholding to avoid the information loss. It also discussed about postprocessing techniques like blurring, removing small connected components, etc. to generate cleaner and more interpretable images.
The main disadvantage over this method is noise persistence, despite applying pre and post processing techniques. Due to the existence of few false-positives and false negatives in the data and generalized boundaries, noise persists in the output. The postprocessing and dilating the output image can help to clean up the output generated and make it accessible to a human viewer, but still some unnecessary regions do get in which can be avoided by applying denoising techniques over the output image.