The strongest indicator of a cancer patient’s prognosis is the number of mitotic bodies that a pathologist manually counts from the high-resolution whole-slide histopathology images. Obviously, it is not efficient to manually count the mitosis number. But it is still challenging to automate the process of mitosis detection due to the limited training datasets and the intensive computing involved in the model training and inference.
This presentation introduces a large-scale deep learning approach to train a two-stage CNN-based model with high accuracy to detect the mitosis locations directly from the high-resolution whole-slide images. In details, we first train a nuclei detection model to remove the background information from the raw whole-slide histopathology images. Second, a customized ResNet-50 model is trained on the cleaned dataset in the first step. The first step saves the training time while improving the model performance in the second step.
A false-positive oversampling approach is used to further improve the model performance. With these models, the inference process is conducted to detect the mitosis locations from the large volume of histopathology images in parallel. Meanwhile, the whole pipeline, including data preprocessing, model training, hyperparameter tuning, and inference, is parallelized by utilizing the distributed TensorFlow, Apache Spark, and HDFS. The experiences and techniques in this project can be applied to other large scale deep learning problems as well.
I am a software engineer in Center for Open-Source Data and AI Technologies, IBM. My work focus on computing- and data- intensive issues in the large scale of distributed model training.