In recent years, machine learning (ML) competitions in health have attracted ML scientists to work together to solve challenging clinical problems. These competitions provide access to relevant data and well-defined problems where experienced data scientists come to compete for solutions and learn new methods. However, a fundamental difficulty in organizing such challenges is obtaining and curating high quality datasets for model development and independent datasets for model evaluation. Importantly, to reduce the risk of bias and to ensure broad applicability of the algorithm, evaluation of the generalisability of resulting algorithms should ideally be performed on multiple independent evaluation datasets by an independent group of scientists.
One clinical problem that has attracted substantial ML research is prostate cancer, a condition that 1 in 9 men develop in their lifetime. A prostate cancer diagnosis requires pathologists to examine biological tissue samples under a microscope to identify cancer and grade the cancer for signs of aggressive growth patterns in the cells. However, this cancer grading task (called Gleason grading) is difficult and subjective due to the need for visual assessment of cell differentiation and Gleason pattern predominance. Building a large dataset of samples with expert annotations can help with the development of ML systems to aid in prostate cancer grading.
To help accelerate and enable more research in this area, Google Health, Radboud University Medical Center and Karolinska Institutet joined forces to organize a global competition, the Prostate cANcer graDe Assessment (PANDA) Challenge, on the open Kaggle platform. In “Artificial Intelligence for Diagnosis and Gleason Grading of Prostate Cancer: the PANDA challenge”, published in Nature Medicine, we present the results of the challenge. The study design of the PANDA challenge provided the largest public whole-slide image dataset available and was open to participants from April 21st until July 23rd, 2020. The development datasets remain available for further research. In this effort, we compiled and publicly released a European cohort of prostate cancer cases for algorithm development and pioneered a standardized evaluation setup for digital pathology that enabled independent, blinded external validation of the algorithms on data from both the United States and EU.
Design of the Panda Challenge
The challenge had two phases: a development phase (i.e., the Kaggle competition) and a validation phase. During the competition, 1,290 developers from 65 countries competed in building the best performing Gleason grading algorithm, having full access to a development set for algorithm training. Throughout the competition teams submitted algorithms that were evaluated on a hidden tuning set.
In the validation phase, a selection of top performing algorithms were independently evaluated on internal and external validation datasets with high quality reference grades from panels of expert prostate pathologists. In addition, a group of general pathologists graded a subset of the same cases to put the difficulty of the task and dataset in context. The algorithms submitted by the teams were then compared to grades done by groups of international and US general pathologists on these subsets.
Research Velocity During the Challenge
We found that a group of Gleason grading ML algorithms developed during a global competition could achieve pathologist-level performance and generalize well to intercontinental and multinational cohorts. On all external validation sets, these algorithms achieved high agreement with urologic pathologists (prostate specialists) and high sensitivity for detecting tumor in biopsies. The Kaggle platform enabled the tracking of teams’ performance throughout the competition. Impressively, the first team achieving high agreement with the prostate pathologists at above 0.90 (quadratically weighted Cohen’s kappa) on the internal validation set occurred within the first 10 days of the competition. By the 33rd day, the median performance of all teams exceeded a score of 0.85.
Progression of algorithms’ performances throughout the competition, as shown by the highest score on the tuning and internal validation sets among all participating teams. During the competition teams could submit their algorithm for evaluation on the tuning set, after which they received their score. At the same time, algorithms were evaluated on the internal validation set, without disclosing these results to the participating teams. The development of the top score obtained by any team shows the rapid improvement of the algorithms.
Learning from the Challenge
By moderating the discussion forum on the Kaggle platform, we learned that the teams’ openness in sharing code via colab notebooks led to rapid improvement across the board, a promising sign for future public challenges, and a clear indication of the power of sharing knowledge on a common platform.
Organizing a public challenge that evaluates algorithm generalization across independent cohorts using high quality reference standard panels presents substantial logistical difficulties. Assembling this size of a dataset across countries and organizations was a massive undertaking. This work benefited from an amazing collaboration between the three organizing institutions which have all contributed respective publications in this space, two in Lancet Oncology and one in JAMA Oncology. Combining these efforts provided a high quality foundation on which this competition could be based. With the publication, Radboud and Karolinska research groups are also open sourcing the PANDA challenge development datasets to facilitate the further improvement of prostate Gleason grading algorithms. We look forward to seeing many more advancements in this field, and more challenges that can catalyze extensive international knowledge sharing and collaborative research.
Key contributors to this project at Google include Po-Hsuan Cameron Chen, Kunal Nagpal, Yuannan Cai, David F. Steiner, Maggie Demkin, Sohier Dane, Fraser Tan, Greg S. Corrado, Lily Peng, Craig H. Mermel. Collaborators on this project include Wouter Bulten, Kimmo Kartasalo, Peter Ström, Hans Pinckaers, Hester van Boven, Robert Vink, Christina Hulsbergen-van de Kaa, Jeroen van der Laak, Mahul B. Amin, Andrew J. Evans, Theodorus van der Kwast, Robert Allan, Peter A. Humphrey, Henrik Grönberg, Hemamali Samaratunga, Brett Delahunt, Toyonori Tsuzuki, Tomi Häkkinen, Lars Egevad, Masi Valkonen, Pekka Ruusuvuori, Geert Litjens, Martin Eklund and the PANDA Challenge consortium. We thank Ellery Wulczyn, Annisah Um’rani, Yun Liu, and Dale Webster for their feedback on the manuscript and guidance on the project. We thank our collaborators at NMCSD, particularly Niels Olson, for internal re-use of de-identified data which contributed to the US external validation set. Sincere appreciation also goes to Sami Lachgar, Ashley Zlatinov, and Lauren Winer for their feedback on the blogpost.