Blog
-
March 27, 2025

Bioptimus launches H-optimus-1: a state-of-the-art foundation model for pathology

We have recently released H-optimus-1, a new foundation model (FM) for pathology that reaches state-of-the-art performance on a large variety of downstream tasks, including the HEST benchmark1.

H-optimus-1 is a 1.1 billion parameter vision transformer trained with self-supervised learning on an extensive proprietary dataset. It consists of billions of histology images, sampled from over 1 million slides from more than 800,000 patients. 

The model can be accessed for academic research purposes here.

H-optimus-1 pre-training dataset

A crucial component in developing a strong FM is the quality and diversity of the dataset used for training the model. 

H-optimus-1 was trained on an extensive collection of over 1 million H&E-stained histology slides of more than 50 organs digitized with 3 scanner types across more than 4,000 clinical centers.

Importantly, the dataset used to train H-optimus-1 is, to the best of our knowledge, the most patient-diverse dataset ever used to train a pathology FM, including histology slides of more than 800,000 patients2 with various diseases. This patient diversity enables the model to learn from various histology patterns and diseases during training, ultimately resulting in rich and generalizable features that are useful for solving complex tasks.

Model evaluation

Results

H-optimus-1 was benchmarked on 13 downstream tasks encompassing 15 datasets at both the slide level and tile level, including the HEST benchmark [Jaume et al. 2025].

HEST

This task consists of predicting gene expression from histology images in nine different organs. More details about this benchmark can be found here.

The metric used is Pearson’s correlation coefficient (higher is better). The models are ordered by decreasing average performance. Standard deviations are reported in parentheses. Bold indicates the highest score in a column.

Table 1

Average IDC PRAD PAAD SKCM COAD READ SCCRCC LUAD LYMPH-IDC
H-optimus-1 0.422 (0.019) 0.602 (0.081) 0.378 (0.012) 0.496 (0.051) 0.659 (0.048) 0.32 (0.016) 0.242 (0.015) 0.245 (0.125) 0.578 (0.012) 0.277 (0.039)
H-optimus-0 0.413 (0.021) 0.598 (0.085) 0.385 (0.0) 0.491 (0.04) 0.645 (0.062) 0.309 (0.0) 0.222 (0.048) 0.255 (0.135) 0.559 (0.032) 0.259 (0.04)
UNI2-h 0.413 (0.02) 0.59 (0.081) 0.357 (0.049) 0.50 (0.04) 0.659 (0.017) 0.301 (0.004) 0.223 (0.038) 0.261 (0.132) 0.558 (0.014) 0.272 (0.04)
Virchow2 0.396 (0.02) 0.592 (0.08) 0.348 (0.031) 0.472 (0.065) 0.619 (0.028) 0.259 (0.016) 0.209 (0.05) 0.257 (0.123) 0.553 (0.017) 0.255 (0.026)
Prov-GigaPath 0.386 (0.02) 0.551 (0.073) 0.37 (0.022) 0.475 (0.048) 0.562 (0.061) 0.299 (0.021) 0.196 (0.062) 0.232 (0.115) 0.541 (0.036) 0.25 (0.05)
UNI 0.385 (0.02) 0.574 (0.08) 0.294 (0.09) 0.481 (0.07) 0.635 (0.04) 0.262 (0.03) 0.184 (0.05) 0.238 (0.12) 0.546 (0.02) 0.256 (0.04)
GPFM 0.378 (0.024) 0.566 (0.08) 0.342 (0.078) 0.46 (0.062) 0.589 (0.048) 0.248 (0.024) 0.164 (0.071) 0.253 (0.138) 0.547 (0.014) 0.237 (0.041)
Phikon-v2 0.373 (0.021) 0.541 (0.077) 0.354 (0.015) 0.445 (0.066) 0.555 (0.036) 0.25 (0.018) 0.175 (0.059) 0.257 (0.14) 0.542 (0.011 0.244 (0.046)
CONCH 0.37 (0.019) 0.537 (0.084) 0.357 (0.004) 0.438 (0.065) 0.572 (0.041) 0.27 (0.006) 0.161 (0.055) 0.206 (0.108) 0.538 (0.004) 0.254 (0.039)
Slide-level tasks

We have benchmarked H-optimus-1 and other leading pathology FMs on a diverse set of slide-level downstream tasks using multiple instance learning:

  • META-BC: Identification of metastasis in breast cancer lymph nodes.
  • MSI-GC: Prediction of the microsatellite instability (MSI) status in gastric cancer.
  • MSI-CRC: Prediction of the MSI status in colorectal cancer.
  • KRAS-CRC: Prediction of the KRAS mutation status in colorectal cancer.
  • BRAF-CRC: Prediction of the BRAF mutation status in colorectal cancer.
  • HER2-BC: Prediction of the HER2 status in breast cancer.
  • ER-BC: Prediction of the ER status in breast cancer.
  • PR-BC: Prediction of the PR status in breast cancer.

The metric used is the area under the ROC curve (higher is better). More details about the evaluation methodology can be found in the ‘Slide-level tasks evaluation methodology’ section below. The models are ordered by decreasing average performance and standard deviations are reported in parentheses. Bold indicates the highest score in a row.

Table 2
Task Dataset H-optimus-1 UNI2-h Virchow2 H-optimus-0 Prov-GigaPath GPFM UNI Phikon-v2 CONCH

Average 0.856 0.851 0.843 0.835 0.834 0.824 0.823 0.813 0.786
META-BC CAMELYON16 Test 0.996 (0.001) 0.996 (0.003) 0.976 (0.002) 0.998 (0.001) 0.985 (0.001) 0.992 (0.002) 0.981 (0.006) 0.985 (0.001) 0.974 (0.002)
META-BC SLN-Breast 0.959 (0.003) 0.984 (0.002) 0.985 (0.002) 0.953 (0.002) 0.977 (0.001) 0.944 (0.014) 0.963 (0.005) 0.938 (0.008) 0.959 (0.001)
MSI-GC TCGA-STAD Test 0.915 (0.003) 0.903 (0.004) 0.923 (0.014) 0.899 (0.006) 0.863 (0.004) 0.892 (0.007) 0.907 (0.008) 0.912 (0.003) 0.891 (0.012)
MSI-CRC PAIP2020 0.984 (0.003) 0.971 (0.001) 0.988 (0.002) 0.974 (0.002) 0.970 (0.005) 0.974 (0.001) 0.966 (0.003) 0.972 (0.002) 0.894 (0.015)
MSI-CRC FR-CRC-Bio 0.917 (0.002) 0.894 (0.003) 0.887 (0.003) 0.888 (0.009) 0.876 (0.008) 0.837 (0.003) 0.829 (0.005) 0.865 (0.002) 0.838 (0.007)
MSI-CRC CPTAC-COAD 0.957 (0.003) 0.953 (0.003) 0.959 (0.004) 0.923 (0.010) 0.947 (0.015) 0.913 (0.001) 0.928 (0.004) 0.929 (0.003) 0.882 (0.006)
MSI-CRC SURGEN 0.914 (0.003) 0.899 (0.002) 0.896 (0.003) 0.903 (0.013) 0.913 (0.002) 0.865 (0.002) 0.857 (0.006) 0.862 (0.006) 0.796 (0.004)
KRAC-CRC CPTAC-COAD 0.625 (0.005) 0.649 (0.008) 0.706 (0.006) 0.592 (0.010) 0.581 (0.011) 0.688 (0.004) 0.687 (0.010) 0.659 (0.015) 0.647 (0.014)
KRAC-CRC SURGEN 0.683 (0.006) 0.675 (0.009) 0.654 (0.005) 0.662 (0.002) 0.692 (0.007) 0.638 (0.003) 0.664 (0.004) 0.612 (0.007) 0.631 (0.011)
BRAF-CRC CPTAC-COAD 0.722 (0.006) 0.758 (0.007) 0.800 (0.011) 0.693 (0.021) 0.743 (0.007) 0.813 (0.006) 0.766 (0.015) 0.780 (0.005) 0.740 (0.014)
BRAF-CRC SURGEN 0.823 (0.001) 0.827 (0.01) 0.760 (0.023) 0.786 (0.007) 0.809 (0.005) 0.780 (0.004) 0.799 (0.013) 0.707 (0.003) 0.730 (0.017)
HER2-BC YALE-HER2 0.899 (0.011) 0.87 (0.004) 0.826 (0.016) 0.899 (0.009) 0.849 (0.011) 0.863 (0.012) 0.825 (0.013) 0.742 (0.030) 0.801 (0.020)
HER2-BC IMPRESS 0.903 (0.018) 0.853 (0.008) 0.810 (0.041) 0.888 (0.02) 0.860 (0.009) 0.625 (0.034) 0.745 (0.033) 0.682 (0.009) 0.608 (0.012)
HER2-BC BCNB 0.683 (0.007) 0.674 (0.008) 0.692 (0.004) 0.656 (0.005) 0.680 (0.01) 0.692 (0.005) 0.677 (0.008) 0.673 (0.003) 0.650 (0.015)
ER-BC IMPRESS 0.836 (0.007) 0.835 (0.005) 0.834 (0.008) 0.834 (0.012) 0.821 (0.006) 0.839 (0.004) 0.824 (0.008) 0.860 (0.005) 0.759 (0.006)
ER-BC BCNB 0.903 (0.005) 0.902 (0.002) 0.847 (0.008) 0.854 (0.007) 0.848 (0.005) 0.853 (0.005) 0.861 (0.003) 0.835 (0.002) 0.814 (0.002)
PR-BC IMPRESS 0.831 (0.014) 0.830 (0.005) 0.834 (0.009) 0.867 (0.004) 0.814 (0.014) 0.813 (0.003) 0.761 (0.034) 0.821 (0.006) 0.767 (0.007)
PR-BC BCNB 0.854 (0.008) 0.853 (0.002) 0.803 (0.01) 0.769 (0.022) 0.793 (0.005) 0.814 (0.008) 0.777 (0.029) 0.805 (0.004) 0.764 (0.006)
Tile-level tasks

We have also benchmarked the different pathology FMs on tile-level tasks using linear probing. These tasks are:

  • MHIST: classification of colorectal polyps as hyperplastic polyp or sessile serrated adenoma.
  • TCGA-UNIFORM: pan-cancer tumor tissue classification task (32 cancer types).
  • CAM17-WILDS: identification of tumor on histology patches of lymph nodes of patients diagnosed with breast cancer.
  • CRC-NO-NORM: classification of colorectal cancer histology images as one of nine tissue types.

The metric used is the top-1 accuracy (higher is better). More details about the evaluation methodology can be found in the ‘Tile-level tasks evaluation methodology’ section below. The models are ordered by decreasing average performance and standard deviations are reported in parentheses. Bold indicates the highest score in a column.

Table 3
Text Average MHIST TCGA-UNIFORM CAM17-WILDS CRC-NO-NORM
H-optimus-1 0.908 0.835 (0.001) 0.851 (0.000) 0.991 (0.000) 0.956 (0.002)
UNI2-h 0.904 0.826 (0.001) 0.831 (0.000) 0.988 (0.000) 0.969 (0.001)
H-optimus-0 0.904 0.848 (0.001) 0.835 (0.001) 0.986 (0.001) 0.945 (0.012)
Virchow2 0.9 0.851 (0.001) 0.830 (0.000) 0.986 (0.001) 0.933 (0.011)
GPFM 0.895 0.824 (0.002) 0.827 (0.001) 0.972 (0.004) 0.955 (0.004)
Prov-GigaPath 0.887 0.831 (0.003) 0.804 (0.000) 0.968 (0.003) 0.945 (0.003)
UNI 0.883 0.840 (0.002) 0.805 (0.001) 0.980 (0.001) 0.906 (0.015)
Phikon-v2 0.877 0.797 (0.001) 0.794 (0.000) 0.972 (0.001) 0.946 (0.002)
CONCH 0.844 0.783 (0.003) 0.679 (0.000) 0.972 (0.000) 0.940 (0.000)
Additional information
Models benchmarked

We list in the table below the characteristics of the models benchmarked.  For each model, the [CLS] token embedding was used for the downstream evaluations.

Table 4
Model Authors
Model architecture
(number of parameters)
Number of histology slides used for pre-training
H-optimus-1 Bioptimus ViT-g/14 (1.1B) 1M+
UNI2-h Mahmood Lab Modified ViT-H/14 (681M) 350k+
UNI Mahmood Lab [Chen et al. 2024] ViT-L/16 (307M) 100k
H-optimus-0 Bioptimus [Saillard et al. 2024] ViT-g/14 (1.1B) 500k+
Virchow2
Paige / Microsoft Research
ViT-H/14 (632M) 3.1M
GPFM
Hong Kong University of Science and Technology
ViT-L/14 (307M) 86k
Prov-GigaPath Microsoft Research [Xu et al. 2024] ViT-g/16 (1.1B) 171k
Phikon-v2 Owkin [Filiot et al. 2024] ViT-L/16 (307M) 58k
CONCH Mahmood Lab [Lu et al. 2024] Modified ViT-B/16 (90M) 21k slides & 1.2M image-text pairs
Slide-level evaluation tasks 

We list in the table below the different tasks defined for the slide evaluation benchmark, and the datasets used to define these tasks.

FR-CRC-Bio is an internal dataset consisting of 727 CRC biopsies from multiple French hospitals. TCGA datasets were retrieved from https://portal.gdc.cancer.gov/

Table 5
Task Dataset Split Classes Number of slides (patients) per category
META-BC: Identification of metastasis in breast cancer lymph nodes

CAMELYON16
[Bejnordi et al. 2017]

Train

With tumor
/ without tumor

111 (111) / 159 (159)
META-BC: Identification of metastasis in breast cancer lymph nodes

CAMELYON16
[Bejnordi et al. 2017]

Test

With tumor
/ without tumor

49 (49) / 80 (80)
META-BC: Identification of metastasis in breast cancer lymph nodes

SLN-Breast
[Campanella et al. 2019]

Test

With tumor
/ without tumor

36 (NA) / 94 (NA)
MSI-GC: Prediction of MSI status in gastric cancer TCGA-STAD Train Train

MSI-H
/ non MSI-H

47 (47) / 243 (219)
MSI-GC: Prediction of MSI status in gastric cancer TCGA-STAD Test Test

MSI-H
/ non MSI-H

10 (10) / 51 (51)

MSI-CRC:
Prediction of MSI status in colorectal cancer

TCGA-CRC Train

MSI-H
/ non MSI-H

62 (61) / 371 (365)

MSI-CRC:
Prediction of MSI status in colorectal cancer

PAIP2020
[Kim et al. 2023]

Test

MSI-H
/ non MSI-H

12 (12) / 35 (35)

MSI-CRC:
Prediction of MSI status in colorectal cancer

FR-CRC-Bio Test

MSI-H
/ non MSI-H

257 (257) / 470 (470)

MSI-CRC:
Prediction of MSI status in colorectal cancer

SURGEN
[Myles et al. 2025]

Test

MSI-H
/ non MSI-H

100 (76) / 891 (746)

MSI-CRC:
Prediction of MSI status in colorectal cancer

CPTAC-COAD Test

MSI-H
/ non MSI-H

53 (24) / 168 (81)

KRAS-CRC:
Prediction of KRAS mutation in colorectal cancer

TCGA-CRC Train

KRAS mutant
/ KRAS wild-type

208 (206) / 299 (294)

KRAS-CRC:
Prediction of KRAS mutation in colorectal cancer

SURGEN
[Myles et al. 2025]

Test

KRAS mutant
/ KRAS wild-type

406 (324) / 590 (502)

KRAS-CRC:
Prediction of KRAS mutation in colorectal cancer

CPTAC-COAD Test

KRAS mutant
/ KRAS wild-type

72 (35) / 150 (70)

BRAF-CRC:
Prediction of BRAF mutation in colorectal cancer

TCGA-CRC Train

KRAS mutant
/ KRAS wild-type

60 (58) / 447 (442)

BRAF-CRC:
Prediction of BRAF mutation in colorectal cancer

SURGEN
[Myles et al. 2025]

Test

KRAS mutant
/ KRAS wild-type

131 (104) / 769 (657)

BRAF-CRC:
Prediction of BRAF mutation in colorectal cancer

CPTAC-COAD Test

KRAS mutant
/ KRAS wild-type

41 (16) / 181 (89)

HER2-BC:
Prediction of HER2 status in breast cancer

TCGA-BRCA Train

HER2 positive
/ HER2 negative

170 (162) / 917 (855)

HER2-BC:
Prediction of HER2 status in breast cancer

YALE-HER2
[Farahmand el al., 2022]

Test

HER2 positive
/ HER2 negative

93 (93) / 97 (97)

HER2-BC:
Prediction of HER2 status in breast cancer

IMPRESS
[Huang et al. 2023]

Test

HER2 positive
/ HER2 negative

53 (53) / 73 (73)

HER2-BC:
Prediction of HER2 status in breast cancer

BCNB [Xu et al. 2021] Test

HER2 positive
/ HER2 negative

274 (274) / 759 (759)

ER-BC:
Prediction of ER status in breast cancer

TCGA-BRCA Train

ER positive
/ ER negative

830 (771) / 229 (223)

ER-BC:
Prediction of ER status in breast cancer

IMPRESS
[Huang et al. 2023]

Test

ER positive
/ ER negative

30 (30) / 96 (96)

ER-BC:
Prediction of ER status in breast cancer

BCNB [Xu et al. 2021] Test

ER positive
/ ER negative

808 (808) / 225 (225)

PR-BC:
Prediction of PR status in breast cancer

TCGA-BRCA Train

PR positive
/ PR negative

719 (666) / 337 (325)

PR-BC:
Prediction of PR status in breast cancer

IMPRESS
[Huang et al. 2023]

Test

PR positive
/ PR negative

19 (19) / 107 (107)

PR-BC:
Prediction of PR status in breast cancer

BCNB [Xu et al. 2021] Test

PR positive
/ PR negative

768 (768) / 265 (265)
Tile-level evaluation tasks

We list in the table below the different tasks used for the tile-level evaluation benchmark and their corresponding datasets. For MHIST, CAM17-WILDS and CRC-NO-NORM/CRC-VAL-HE-7K, we used the official train/test splits. For TCGA-UNIFORM, we designed a train/test split stratified according to the labels categories as no official split is available.

Table 6

Task Dataset Split Number of images
Classification of colorectal polyps as hyperplastic polyp or sessile serrated adenoma

MHIST
[Wei et al. 2021]

Train 2 175
Classification of colorectal polyps as hyperplastic polyp or sessile serrated adenoma

MHIST
[Wei et al. 2021]

Test 977
Pan-cancer tumor tissue classification task

TCGA-UNIFORM
[Komura et al. 2020]

Train 217,360
Pan-cancer tumor tissue classification task

TCGA-UNIFORM
[Komura et al. 2020]

Test 54,350
Identification of tumor on lymph nodes histology images of breast cancer patients

CAM17-WILDS
[Koh et al. 2020]

Train 370,900
Identification of tumor on lymph nodes histology images of breast cancer patients

CAM17-WILDS
[Koh et al. 2020]

Test 85,054
Tissue type classification of colorectal cancer histology images

CRC-NO-NORM
[Kather et al. 2018]

Train 100,000
Tissue type classification of colorectal cancer histology images

CRC-VAL-HE-7K
[Kather et al. 2018]

Test 7,180
HEST evaluation methodology

We used the exact same procedure as [Jaume et al. 2025], we refer to their paper for the training details.

Slide-level tasks evaluation methodology

For each task, we train 10 ABMIL models [Ilse et al. 2018] by minimizing the binary cross-entropy loss with Adam [Kingma et al. 2014], using a batch size of 32 and a constant learning rate of 0.0001.

We select the number of training steps by 5-fold cross validation while minimizing the binary cross-entropy. The maximum number of training steps is 1000 for all models except for CONCH where it is set to 4000 steps to ensure convergence.

For the sake of robustness, the above procedure is repeated 5 times with different PyTorch seeds. The values reported in the table are the average metrics of the 5*10=50 ABMIL models. Standard deviations are reported on the average of the metrics for each seed, computed across all seeds.

For the sake of speed, a random subset of 3,000 tiles per slide is selected during training. For inference, a subset of 8,000 tiles is randomly selected.

Tile-level tasks evaluation methodology

For each task, we learn a linear classifier by minimizing the cross-entropy loss with SGD, with a constant learning rate and a batch size of 256. We select the following hyperparameters by 5-fold cross validation while minimizing the binary cross-entropy:

  • Learning rate in {1e-2, 5e-3, 2e-3, 1e-3, 1e-4}
  • Number of training steps in [100, 200, …, 12500] 

To ensure convergence, a different set of learning rates is used for CONCH: {5e-1, 2e-1, 1e-1, 5e-2, 2e-2, 1e-2, 1e-3} while keeping the same number of training steps.For the sake of robustness, the above procedure is repeated 3 times with different PyTorch seeds. The values reported in the table are the average metrics of the 3*5=15 linear classifiers. Standard deviations are reported on the average of the metrics for each seed, computed across all seeds.

Acknowledgments

This project was partially supported by computational and storage resources from the GENCI at IDRIS, thanks to the grant 2024-GC011015442 on the supercomputer Jean Zay's H100 partition.

The results published here are partly based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.

Part of data used in this report were generated by the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC): 

National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2020). The Clinical Proteomic Tumor Analysis Consortium Colon Adenocarcinoma Collection (CPTAC-COAD) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.YZWQ-ZZ63

The following datasets from TCIA were used in the benchmarks:

  • Campanella, G., Hanna, M. G., Brogi, E., & Fuchs, T. J. (2019). Breast Metastases to Axillary Lymph Nodes [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/tcia.2019.3xbn2jcc
  • Farahmand, Saman, Fernandez, Aileen I, Ahmed, Fahad Shabbir, Rimm, David L., Chuang, Jeffrey H., Reisenbichler, Emily, & Zarringhalam, Kourosh. (2022). HER2 and trastuzumab treatment response H&E slides with tumor ROI annotations (Version 3) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/E65C-AM96

Regarding the PAIP2020 dataset: De-identified pathology images and annotations used in this research were prepared and provided by the Seoul National University Hospital by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI18C0316).

1 On average, benchmarked against all other leading foundation models that were available at the time of the writing of this blog post.

2 Number of patients in the training set: UNI2-h:<350k, Virchow2: 225k, Hibou: 306k, ATLAS: 490k, Phikon-v2: <58k.

Author
Author
The Bioptimus Team