Datasets & Software


NuCLS datasets (Amgad et al, 2021)

The NuCLS dataset contains over 220,000 labeled nuclei from breast cancer images from TCGA. These nuclei were annotated through the collaborative effort of pathologists, pathology residents, and medical students using the Digital Slide Archive. These data can be used in several ways to develop and validate algorithms for nuclear detection, classification, and segmentation, or as a resource to develop and evaluate methods for interrater analysis. Data from both single-rater and multi-rater studies are provided. For more details, consult our preprint publication:

Amgad M, Atteya LA, ..., Cooper LAD. NuCLS: A scalable crowdsourcing, deep learning approach and dataset for nucleus classification, localization and segmentation. arXiv preprint arXiv:2102.09099. 2021 Feb 18.

Breast cancer semantic segmentation (Amgad et al., 2019)

This dataset contains over 20,000 segmentation annotations of tissue region from 150 breast cancer patients from TCGA. This large-scale dataset was annotated through the collaborative effort of pathologists, pathology residents, and medical students using the Digital Slide Archive. It enables the generation of highly accurate machine-learning models for tissue segmentation.

Use this repo to download all elements of the dataset described in:

Amgad M, Elfandy H, ..., Gutman DA, Cooper LAD. Structured crowdsourcing enables convolutional segmentation of histology images. Bioinformatics. 2019.

Adult Rhabdomyosarcoma (Elsebaie et al., 2018)

Use this link to download the dataset used in:

Elsebaie M, Amgad M, …, Elsayed Z. Management of low and intermediate risk adult rhabdomyosarcoma: A pooled survival analysis of 553 patients. Scientific Reports. 2018 Jun 19;8(1):9337.

This contains retrospective individual patient data from ~550 patients with adult Rhabdomyosarcoma, collected from published case series and reports. Original authors were contacted for complete records.


HistomicsTK / Digital Slide Archive

I am an active contributor to the open-source software package HistomicsTK (see this talk), which is a python toolkit for organizing, annotating and analyzing WSI data. My contributions include the development of workflows to handle annotations and segmentation masks, color normalization and augmentation, and image processing workflows for efficient detection of tissue region boundaries. Additionally, I develop workflows that utilize the girder RESTful API to visualize and interact with data.

HistomicsTK is built and maintained by the company Kitware.


HistomicsML, where "ML" stands for machine learning, is a software tool developed by Michael Nalisnik, PhD and Sanghoon Lee, PhD for the interactive learning of histological patterns by biologists and physicians. I was involved in the validation of both iterations of the software. HistomicsML enables rapid training of machine learning models (eg. to identify vascular endothelial cells in glioma) in a few learning cycles, by focusing the user's attention to regions with high model uncertainty.

Software download and usage instructions can be found below:

> Segmentation-free system (general approach): Version 2.0.

> Using segmentation boundaries (Image analysis expertise required): Version 1.0.

A demo of the HistomicsML v2.0 software.

Ripley's K for image clustering analysis

This is a MATLAB tool for biologists to calculate Ripley's K function for grayscale images, and can be downloaded here. The detailed methodology and validation is described in:

Amgad M, Itoh A, Tsui MM. Extending Ripley’s K-function to quantify aggregation in 2-D grayscale images. PLoS One. 2015;10(12):e0144404.

Sample use: quantifying the aggregation of proteins in fluorescent microscopic images.