Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Artificial Intelligence (AI) solutions for skin cancer diagnosis continue to gain momentum, edging closer towards broad clinical use. These AI models, particularly deep learning architectures, require large digital image datasets for development. This review provides an overview of the datasets used to develop AI algorithms and highlights the importance of dataset transparency for evaluation of algorithm generalisability across varying populations and settings. Current challenges for curation of clinically valuable datasets are detailed, which include dataset shifts arising from demographic variations and differences in data collection methodologies, along with inconsistencies in labelling. These shifts can lead to differential algorithm performance, compromise of clinical utility, and the propagation of discriminatory biases when developed algorithms are implemented in mismatched populations. Limited representation of rare skin cancers and minoritised groups in existing datasets are highlighted which can further skew algorithm performance. Strategies to address these challenges are presented, which include improving transparency, representation and interoperability. Federated learning and generative methods, which may improve dataset size and diversity without compromising privacy, are also examined. Lastly, we discuss model-level techniques which may address biases entrained through the use of datasets derived from routine clinical care. As the role of AI in skin cancer diagnosis becomes more prominent, ensuring the robustness of underlying datasets is increasingly important.

Original publication




Journal article


Clin Exp Dermatol

Publication Date