From data to diagnosis: skin cancer image datasets for artificial intelligence.
Wen D., Soltan A., Trucco E., Matin RN.
Artificial intelligence (AI) solutions for skin cancer diagnosis continue to gain momentum, edging closer towards broad clinical use. These AI models, particularly deep-learning architectures, require large digital image datasets for development. This review provides an overview of the datasets used to develop AI algorithms and highlights the importance of dataset transparency for the evaluation of algorithm generalizability across varying populations and settings. Current challenges for curation of clinically valuable datasets are detailed, which include dataset shifts arising from demographic variations and differences in data collection methodologies, along with inconsistencies in labelling. These shifts can lead to differential algorithm performance, compromise of clinical utility, and the propagation of discriminatory biases when developed algorithms are implemented in mismatched populations. Limited representation of rare skin cancers and minoritized groups in existing datasets are highlighted, which can further skew algorithm performance. Strategies to address these challenges are presented, which include improving transparency, representation and interoperability. Federated learning and generative methods, which may improve dataset size and diversity without compromising privacy, are also examined. Lastly, we discuss model-level techniques that may address biases entrained through the use of datasets derived from routine clinical care. As the role of AI in skin cancer diagnosis becomes more prominent, ensuring the robustness of underlying datasets is increasingly important.