We have recently released H-optimus-1, a new foundation model (FM) for pathology that reaches state-of-the-art performance on a large variety of downstream tasks, including the HEST benchmark1.
H-optimus-1 is a 1.1 billion parameter vision transformer trained with self-supervised learning on an extensive proprietary dataset. It consists of billions of histology images, sampled from over 1 million slides from more than 800,000 patients.
The model can be accessed for academic research purposes here.
H-optimus-1 pre-training dataset
A crucial component in developing a strong FM is the quality and diversity of the dataset used for training the model.
H-optimus-1 was trained on an extensive collection of over 1 million H&E-stained histology slides of more than 50 organs digitized with 3 scanner types across more than 4,000 clinical centers.
Importantly, the dataset used to train H-optimus-1 is, to the best of our knowledge, the most patient-diverse dataset ever used to train a pathology FM, including histology slides of more than 800,000 patients2 with various diseases. This patient diversity enables the model to learn from various histology patterns and diseases during training, ultimately resulting in rich and generalizable features that are useful for solving complex tasks.
Model evaluation
Results
H-optimus-1 was benchmarked on 13 downstream tasks encompassing 15 datasets at both the slide level and tile level, including the HEST benchmark [Jaume et al. 2025].
HEST
This task consists of predicting gene expression from histology images in nine different organs. More details about this benchmark can be found here.
The metric used is Pearson’s correlation coefficient (higher is better). The models are ordered by decreasing average performance. Standard deviations are reported in parentheses. Bold indicates the highest score in a column.
Table 1
|
Average |
IDC |
PRAD |
PAAD |
SKCM |
COAD |
READ |
SCCRCC |
LUAD |
LYMPH-IDC |
H-optimus-1 |
0.422 (0.019) |
0.602 (0.081) |
0.378 (0.012) |
0.496 (0.051) |
0.659 (0.048) |
0.32 (0.016) |
0.242 (0.015) |
0.245 (0.125) |
0.578 (0.012) |
0.277 (0.039) |
H-optimus-0 |
0.413 (0.021) |
0.598 (0.085) |
0.385 (0.0) |
0.491 (0.04) |
0.645 (0.062) |
0.309 (0.0) |
0.222 (0.048) |
0.255 (0.135) |
0.559 (0.032) |
0.259 (0.04) |
UNI2-h |
0.413 (0.02) |
0.59 (0.081) |
0.357 (0.049) |
0.50 (0.04) |
0.659 (0.017) |
0.301 (0.004) |
0.223 (0.038) |
0.261 (0.132) |
0.558 (0.014) |
0.272 (0.04) |
Virchow2 |
0.396 (0.02) |
0.592 (0.08) |
0.348 (0.031) |
0.472 (0.065) |
0.619 (0.028) |
0.259 (0.016) |
0.209 (0.05) |
0.257 (0.123) |
0.553 (0.017) |
0.255 (0.026) |
Prov-GigaPath |
0.386 (0.02) |
0.551 (0.073) |
0.37 (0.022) |
0.475 (0.048) |
0.562 (0.061) |
0.299 (0.021) |
0.196 (0.062) |
0.232 (0.115) |
0.541 (0.036) |
0.25 (0.05) |
UNI |
0.385 (0.02) |
0.574 (0.08) |
0.294 (0.09) |
0.481 (0.07) |
0.635 (0.04) |
0.262 (0.03) |
0.184 (0.05) |
0.238 (0.12) |
0.546 (0.02) |
0.256 (0.04) |
GPFM |
0.378 (0.024) |
0.566 (0.08) |
0.342 (0.078) |
0.46 (0.062) |
0.589 (0.048) |
0.248 (0.024) |
0.164 (0.071) |
0.253 (0.138) |
0.547 (0.014) |
0.237 (0.041) |
Phikon-v2 |
0.373 (0.021) |
0.541 (0.077) |
0.354 (0.015) |
0.445 (0.066) |
0.555 (0.036) |
0.25 (0.018) |
0.175 (0.059) |
0.257 (0.14) |
0.542 (0.011 |
0.244 (0.046) |
CONCH |
0.37 (0.019) |
0.537 (0.084) |
0.357 (0.004) |
0.438 (0.065) |
0.572 (0.041) |
0.27 (0.006) |
0.161 (0.055) |
0.206 (0.108) |
0.538 (0.004) |
0.254 (0.039) |
Slide-level tasks
We have benchmarked H-optimus-1 and other leading pathology FMs on a diverse set of slide-level downstream tasks using multiple instance learning:
- META-BC: Identification of metastasis in breast cancer lymph nodes.
- MSI-GC: Prediction of the microsatellite instability (MSI) status in gastric cancer.
- MSI-CRC: Prediction of the MSI status in colorectal cancer.
- KRAS-CRC: Prediction of the KRAS mutation status in colorectal cancer.
- BRAF-CRC: Prediction of the BRAF mutation status in colorectal cancer.
- HER2-BC: Prediction of the HER2 status in breast cancer.
- ER-BC: Prediction of the ER status in breast cancer.
- PR-BC: Prediction of the PR status in breast cancer.
The metric used is the area under the ROC curve (higher is better). More details about the evaluation methodology can be found in the ‘Slide-level tasks evaluation methodology’ section below. The models are ordered by decreasing average performance and standard deviations are reported in parentheses. Bold indicates the highest score in a row.
Table 2
Task |
Dataset |
H-optimus-1 |
UNI2-h |
Virchow2 |
H-optimus-0 |
Prov-GigaPath |
GPFM |
UNI |
Phikon-v2 |
CONCH |
|
Average |
0.856 |
0.851 |
0.843 |
0.835 |
0.834 |
0.824 |
0.823 |
0.813 |
0.786 |
META-BC |
CAMELYON16 Test |
0.996 (0.001) |
0.996 (0.003) |
0.976 (0.002) |
0.998 (0.001) |
0.985 (0.001) |
0.992 (0.002) |
0.981 (0.006) |
0.985 (0.001) |
0.974 (0.002) |
META-BC |
SLN-Breast |
0.959 (0.003) |
0.984 (0.002) |
0.985 (0.002) |
0.953 (0.002) |
0.977 (0.001) |
0.944 (0.014) |
0.963 (0.005) |
0.938 (0.008) |
0.959 (0.001) |
MSI-GC |
TCGA-STAD Test |
0.915 (0.003) |
0.903 (0.004) |
0.923 (0.014) |
0.899 (0.006) |
0.863 (0.004) |
0.892 (0.007) |
0.907 (0.008) |
0.912 (0.003) |
0.891 (0.012) |
MSI-CRC |
PAIP2020 |
0.984 (0.003) |
0.971 (0.001) |
0.988 (0.002) |
0.974 (0.002) |
0.970 (0.005) |
0.974 (0.001) |
0.966 (0.003) |
0.972 (0.002) |
0.894 (0.015) |
MSI-CRC |
FR-CRC-Bio |
0.917 (0.002) |
0.894 (0.003) |
0.887 (0.003) |
0.888 (0.009) |
0.876 (0.008) |
0.837 (0.003) |
0.829 (0.005) |
0.865 (0.002) |
0.838 (0.007) |
MSI-CRC |
CPTAC-COAD |
0.957 (0.003) |
0.953 (0.003) |
0.959 (0.004) |
0.923 (0.010) |
0.947 (0.015) |
0.913 (0.001) |
0.928 (0.004) |
0.929 (0.003) |
0.882 (0.006) |
MSI-CRC |
SURGEN |
0.914 (0.003) |
0.899 (0.002) |
0.896 (0.003) |
0.903 (0.013) |
0.913 (0.002) |
0.865 (0.002) |
0.857 (0.006) |
0.862 (0.006) |
0.796 (0.004) |
KRAC-CRC |
CPTAC-COAD |
0.625 (0.005) |
0.649 (0.008) |
0.706 (0.006) |
0.592 (0.010) |
0.581 (0.011) |
0.688 (0.004) |
0.687 (0.010) |
0.659 (0.015) |
0.647 (0.014) |
KRAC-CRC |
SURGEN |
0.683 (0.006) |
0.675 (0.009) |
0.654 (0.005) |
0.662 (0.002) |
0.692 (0.007) |
0.638 (0.003) |
0.664 (0.004) |
0.612 (0.007) |
0.631 (0.011) |
BRAF-CRC |
CPTAC-COAD |
0.722 (0.006) |
0.758 (0.007) |
0.800 (0.011) |
0.693 (0.021) |
0.743 (0.007) |
0.813 (0.006) |
0.766 (0.015) |
0.780 (0.005) |
0.740 (0.014) |
BRAF-CRC |
SURGEN |
0.823 (0.001) |
0.827 (0.01) |
0.760 (0.023) |
0.786 (0.007) |
0.809 (0.005) |
0.780 (0.004) |
0.799 (0.013) |
0.707 (0.003) |
0.730 (0.017) |
HER2-BC |
YALE-HER2 |
0.899 (0.011) |
0.87 (0.004) |
0.826 (0.016) |
0.899 (0.009) |
0.849 (0.011) |
0.863 (0.012) |
0.825 (0.013) |
0.742 (0.030) |
0.801 (0.020) |
HER2-BC |
IMPRESS |
0.903 (0.018) |
0.853 (0.008) |
0.810 (0.041) |
0.888 (0.02) |
0.860 (0.009) |
0.625 (0.034) |
0.745 (0.033) |
0.682 (0.009) |
0.608 (0.012) |
HER2-BC |
BCNB |
0.683 (0.007) |
0.674 (0.008) |
0.692 (0.004) |
0.656 (0.005) |
0.680 (0.01) |
0.692 (0.005) |
0.677 (0.008) |
0.673 (0.003) |
0.650 (0.015) |
ER-BC |
IMPRESS |
0.836 (0.007) |
0.835 (0.005) |
0.834 (0.008) |
0.834 (0.012) |
0.821 (0.006) |
0.839 (0.004) |
0.824 (0.008) |
0.860 (0.005) |
0.759 (0.006) |
ER-BC |
BCNB |
0.903 (0.005) |
0.902 (0.002) |
0.847 (0.008) |
0.854 (0.007) |
0.848 (0.005) |
0.853 (0.005) |
0.861 (0.003) |
0.835 (0.002) |
0.814 (0.002) |
PR-BC |
IMPRESS |
0.831 (0.014) |
0.830 (0.005) |
0.834 (0.009) |
0.867 (0.004) |
0.814 (0.014) |
0.813 (0.003) |
0.761 (0.034) |
0.821 (0.006) |
0.767 (0.007) |
PR-BC |
BCNB |
0.854 (0.008) |
0.853 (0.002) |
0.803 (0.01) |
0.769 (0.022) |
0.793 (0.005) |
0.814 (0.008) |
0.777 (0.029) |
0.805 (0.004) |
0.764 (0.006) |
Tile-level tasks
We have also benchmarked the different pathology FMs on tile-level tasks using linear probing. These tasks are:
- MHIST: classification of colorectal polyps as hyperplastic polyp or sessile serrated adenoma.
- TCGA-UNIFORM: pan-cancer tumor tissue classification task (32 cancer types).
- CAM17-WILDS: identification of tumor on histology patches of lymph nodes of patients diagnosed with breast cancer.
- CRC-NO-NORM: classification of colorectal cancer histology images as one of nine tissue types.
The metric used is the top-1 accuracy (higher is better). More details about the evaluation methodology can be found in the ‘Tile-level tasks evaluation methodology’ section below. The models are ordered by decreasing average performance and standard deviations are reported in parentheses. Bold indicates the highest score in a column.
Table 3
Text |
Average |
MHIST |
TCGA-UNIFORM |
CAM17-WILDS |
CRC-NO-NORM |
H-optimus-1 |
0.908 |
0.835 (0.001) |
0.851 (0.000) |
0.991 (0.000) |
0.956 (0.002) |
UNI2-h |
0.904 |
0.826 (0.001) |
0.831 (0.000) |
0.988 (0.000) |
0.969 (0.001) |
H-optimus-0 |
0.904 |
0.848 (0.001) |
0.835 (0.001) |
0.986 (0.001) |
0.945 (0.012) |
Virchow2 |
0.9 |
0.851 (0.001) |
0.830 (0.000) |
0.986 (0.001) |
0.933 (0.011) |
GPFM |
0.895 |
0.824 (0.002) |
0.827 (0.001) |
0.972 (0.004) |
0.955 (0.004) |
Prov-GigaPath |
0.887 |
0.831 (0.003) |
0.804 (0.000) |
0.968 (0.003) |
0.945 (0.003) |
UNI |
0.883 |
0.840 (0.002) |
0.805 (0.001) |
0.980 (0.001) |
0.906 (0.015) |
Phikon-v2 |
0.877 |
0.797 (0.001) |
0.794 (0.000) |
0.972 (0.001) |
0.946 (0.002) |
CONCH |
0.844 |
0.783 (0.003) |
0.679 (0.000) |
0.972 (0.000) |
0.940 (0.000) |
Additional information
Models benchmarked
We list in the table below the characteristics of the models benchmarked. For each model, the [CLS] token embedding was used for the downstream evaluations.
Table 4
Model |
Authors |
Model architecture
(number of parameters)
|
Number of histology slides used for pre-training |
H-optimus-1 |
Bioptimus |
ViT-g/14 (1.1B) |
1M+ |
UNI2-h |
Mahmood Lab |
Modified ViT-H/14 (681M) |
350k+ |
UNI |
Mahmood Lab [Chen et al. 2024] |
ViT-L/16 (307M) |
100k |
H-optimus-0 |
Bioptimus [Saillard et al. 2024] |
ViT-g/14 (1.1B) |
500k+ |
Virchow2 |
Paige / Microsoft Research
|
ViT-H/14 (632M) |
3.1M |
GPFM |
Hong Kong University of Science and Technology
|
ViT-L/14 (307M) |
86k |
Prov-GigaPath |
Microsoft Research [Xu et al. 2024] |
ViT-g/16 (1.1B) |
171k |
Phikon-v2 |
Owkin [Filiot et al. 2024] |
ViT-L/16 (307M) |
58k |
CONCH |
Mahmood Lab [Lu et al. 2024] |
Modified ViT-B/16 (90M) |
21k slides & 1.2M image-text pairs |
Slide-level evaluation tasks
We list in the table below the different tasks defined for the slide evaluation benchmark, and the datasets used to define these tasks.
FR-CRC-Bio is an internal dataset consisting of 727 CRC biopsies from multiple French hospitals. TCGA datasets were retrieved from https://portal.gdc.cancer.gov/.
Table 5
Task |
Dataset |
Split |
Classes |
Number of slides (patients) per category |
META-BC: Identification of metastasis in breast cancer lymph nodes |
CAMELYON16 [Bejnordi et al. 2017]
|
Train |
With tumor / without tumor
|
111 (111) / 159 (159) |
META-BC: Identification of metastasis in breast cancer lymph nodes |
CAMELYON16 [Bejnordi et al. 2017]
|
Test |
With tumor / without tumor
|
49 (49) / 80 (80) |
META-BC: Identification of metastasis in breast cancer lymph nodes |
SLN-Breast [Campanella et al. 2019]
|
Test |
With tumor / without tumor
|
36 (NA) / 94 (NA) |
MSI-GC: Prediction of MSI status in gastric cancer |
TCGA-STAD Train |
Train |
MSI-H / non MSI-H
|
47 (47) / 243 (219) |
MSI-GC: Prediction of MSI status in gastric cancer |
TCGA-STAD Test |
Test |
MSI-H / non MSI-H
|
10 (10) / 51 (51) |
MSI-CRC: Prediction of MSI status in colorectal cancer
|
TCGA-CRC |
Train |
MSI-H / non MSI-H
|
62 (61) / 371 (365) |
MSI-CRC: Prediction of MSI status in colorectal cancer
|
PAIP2020 [Kim et al. 2023]
|
Test |
MSI-H / non MSI-H
|
12 (12) / 35 (35) |
MSI-CRC: Prediction of MSI status in colorectal cancer
|
FR-CRC-Bio |
Test |
MSI-H / non MSI-H
|
257 (257) / 470 (470) |
MSI-CRC: Prediction of MSI status in colorectal cancer
|
SURGEN [Myles et al. 2025]
|
Test |
MSI-H / non MSI-H
|
100 (76) / 891 (746) |
MSI-CRC: Prediction of MSI status in colorectal cancer
|
CPTAC-COAD |
Test |
MSI-H / non MSI-H
|
53 (24) / 168 (81) |
KRAS-CRC: Prediction of KRAS mutation in colorectal cancer
|
TCGA-CRC |
Train |
KRAS mutant / KRAS wild-type
|
208 (206) / 299 (294) |
KRAS-CRC: Prediction of KRAS mutation in colorectal cancer
|
SURGEN [Myles et al. 2025]
|
Test |
KRAS mutant / KRAS wild-type
|
406 (324) / 590 (502) |
KRAS-CRC: Prediction of KRAS mutation in colorectal cancer
|
CPTAC-COAD |
Test |
KRAS mutant / KRAS wild-type
|
72 (35) / 150 (70) |
BRAF-CRC: Prediction of BRAF mutation in colorectal cancer
|
TCGA-CRC |
Train |
KRAS mutant / KRAS wild-type
|
60 (58) / 447 (442) |
BRAF-CRC: Prediction of BRAF mutation in colorectal cancer
|
SURGEN [Myles et al. 2025]
|
Test |
KRAS mutant / KRAS wild-type
|
131 (104) / 769 (657) |
BRAF-CRC: Prediction of BRAF mutation in colorectal cancer
|
CPTAC-COAD |
Test |
KRAS mutant / KRAS wild-type
|
41 (16) / 181 (89) |
HER2-BC: Prediction of HER2 status in breast cancer
|
TCGA-BRCA |
Train |
HER2 positive / HER2 negative
|
170 (162) / 917 (855) |
HER2-BC: Prediction of HER2 status in breast cancer
|
YALE-HER2 [Farahmand el al., 2022]
|
Test |
HER2 positive / HER2 negative
|
93 (93) / 97 (97) |
HER2-BC: Prediction of HER2 status in breast cancer
|
IMPRESS [Huang et al. 2023]
|
Test |
HER2 positive / HER2 negative
|
53 (53) / 73 (73) |
HER2-BC: Prediction of HER2 status in breast cancer
|
BCNB [Xu et al. 2021] |
Test |
HER2 positive / HER2 negative
|
274 (274) / 759 (759) |
ER-BC: Prediction of ER status in breast cancer
|
TCGA-BRCA |
Train |
ER positive / ER negative
|
830 (771) / 229 (223) |
ER-BC: Prediction of ER status in breast cancer
|
IMPRESS [Huang et al. 2023]
|
Test |
ER positive / ER negative
|
30 (30) / 96 (96) |
ER-BC: Prediction of ER status in breast cancer
|
BCNB [Xu et al. 2021] |
Test |
ER positive / ER negative
|
808 (808) / 225 (225) |
PR-BC: Prediction of PR status in breast cancer
|
TCGA-BRCA |
Train |
PR positive / PR negative
|
719 (666) / 337 (325) |
PR-BC: Prediction of PR status in breast cancer
|
IMPRESS [Huang et al. 2023]
|
Test |
PR positive / PR negative
|
19 (19) / 107 (107) |
PR-BC: Prediction of PR status in breast cancer
|
BCNB [Xu et al. 2021] |
Test |
PR positive / PR negative
|
768 (768) / 265 (265) |
Tile-level evaluation tasks
We list in the table below the different tasks used for the tile-level evaluation benchmark and their corresponding datasets. For MHIST, CAM17-WILDS and CRC-NO-NORM/CRC-VAL-HE-7K, we used the official train/test splits. For TCGA-UNIFORM, we designed a train/test split stratified according to the labels categories as no official split is available.
Table 6
Task |
Dataset |
Split |
Number of images |
Classification of colorectal polyps as hyperplastic polyp or sessile serrated adenoma |
MHIST [Wei et al. 2021]
|
Train |
2 175 |
Classification of colorectal polyps as hyperplastic polyp or sessile serrated adenoma |
MHIST [Wei et al. 2021]
|
Test |
977 |
Pan-cancer tumor tissue classification task |
TCGA-UNIFORM [Komura et al. 2020]
|
Train |
217,360 |
Pan-cancer tumor tissue classification task |
TCGA-UNIFORM [Komura et al. 2020]
|
Test |
54,350 |
Identification of tumor on lymph nodes histology images of breast cancer patients |
CAM17-WILDS [Koh et al. 2020]
|
Train |
370,900 |
Identification of tumor on lymph nodes histology images of breast cancer patients |
CAM17-WILDS [Koh et al. 2020]
|
Test |
85,054 |
Tissue type classification of colorectal cancer histology images |
CRC-NO-NORM [Kather et al. 2018]
|
Train |
100,000 |
Tissue type classification of colorectal cancer histology images |
CRC-VAL-HE-7K [Kather et al. 2018]
|
Test |
7,180 |
HEST evaluation methodology
We used the exact same procedure as [Jaume et al. 2025], we refer to their paper for the training details.
Slide-level tasks evaluation methodology
For each task, we train 10 ABMIL models [Ilse et al. 2018] by minimizing the binary cross-entropy loss with Adam [Kingma et al. 2014], using a batch size of 32 and a constant learning rate of 0.0001.
We select the number of training steps by 5-fold cross validation while minimizing the binary cross-entropy. The maximum number of training steps is 1000 for all models except for CONCH where it is set to 4000 steps to ensure convergence.
For the sake of robustness, the above procedure is repeated 5 times with different PyTorch seeds. The values reported in the table are the average metrics of the 5*10=50 ABMIL models. Standard deviations are reported on the average of the metrics for each seed, computed across all seeds.
For the sake of speed, a random subset of 3,000 tiles per slide is selected during training. For inference, a subset of 8,000 tiles is randomly selected.
Tile-level tasks evaluation methodology
For each task, we learn a linear classifier by minimizing the cross-entropy loss with SGD, with a constant learning rate and a batch size of 256. We select the following hyperparameters by 5-fold cross validation while minimizing the binary cross-entropy:
- Learning rate in {1e-2, 5e-3, 2e-3, 1e-3, 1e-4}
- Number of training steps in [100, 200, …, 12500]
To ensure convergence, a different set of learning rates is used for CONCH: {5e-1, 2e-1, 1e-1, 5e-2, 2e-2, 1e-2, 1e-3} while keeping the same number of training steps.For the sake of robustness, the above procedure is repeated 3 times with different PyTorch seeds. The values reported in the table are the average metrics of the 3*5=15 linear classifiers. Standard deviations are reported on the average of the metrics for each seed, computed across all seeds.
Acknowledgments
This project was partially supported by computational and storage resources from the GENCI at IDRIS, thanks to the grant 2024-GC011015442 on the supercomputer Jean Zay's H100 partition.
The results published here are partly based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.
Part of data used in this report were generated by the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC):
National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2020). The Clinical Proteomic Tumor Analysis Consortium Colon Adenocarcinoma Collection (CPTAC-COAD) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.YZWQ-ZZ63
The following datasets from TCIA were used in the benchmarks:
- Campanella, G., Hanna, M. G., Brogi, E., & Fuchs, T. J. (2019). Breast Metastases to Axillary Lymph Nodes [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/tcia.2019.3xbn2jcc
- Farahmand, Saman, Fernandez, Aileen I, Ahmed, Fahad Shabbir, Rimm, David L., Chuang, Jeffrey H., Reisenbichler, Emily, & Zarringhalam, Kourosh. (2022). HER2 and trastuzumab treatment response H&E slides with tumor ROI annotations (Version 3) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/E65C-AM96
Regarding the PAIP2020 dataset: De-identified pathology images and annotations used in this research were prepared and provided by the Seoul National University Hospital by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI18C0316).
1 On average, benchmarked against all other leading foundation models that were available at the time of the writing of this blog post.
2 Number of patients in the training set: UNI2-h:<350k, Virchow2: 225k, Hibou: 306k, ATLAS: 490k, Phikon-v2: <58k.