Enhancing Local Context of Histology Features in Vision Transformers
Wood R., Sirinukunwattana K., Domingo E., Sauer A., Lafarge MW., Koelzer VH., Maughan TS., Rittscher J.
Predicting complete response to radiotherapy in rectal cancer patients using deep learning approaches from morphological features extracted from histology biopsies provides a quick, low-cost and effective way to assist clinical decision making. We propose adjustments to the Vision Transformer (ViT) network to improve the utilisation of contextual information present in whole slide images (WSIs). Firstly, our position restoration embedding (PRE) preserves the spatial relationship between tissue patches, using their original positions on a WSI. Secondly, a clustering analysis of extracted tissue features explores morphological motifs which capture fundamental biological processes found in the tumour micro-environment. This is introduced into the ViT network in the form of a cluster label token, helping the model to differentiate between tissue types. The proposed methods are demonstrated on two large independent rectal cancer datasets of patients selectively treated with radiotherapy and capecitabine in two UK clinical trials. Experiments demonstrate that both models, PREViT and ClusterViT, show improvements in the prediction over baseline models.