Machine learning for predicting the survival in osteosarcoma patients: Analysis based on American and Hebei Province cohort

Osteosarcoma, a rare malignant tumor, has a poor prognosis. This study aimed to find the best prognostic model for osteosarcoma. There were 2912 patients included from the Surveillance, Epidemiology, and End Results (SEER) database and 225 patients from Hebei Province. Patients from the SEER database (2008–2015) were included in the development dataset. Patients from the SEER database (2004–2007) and Hebei Province cohort were included in the external test datasets. The Cox model and three tree-based machine learning algorithms (survival tree [ST], random survival forest [RSF], and gradient boosting machine [GBM]) were used to develop the prognostic models by 10-fold cross-validation with 200 iterations. Additionally, performance of models in the multivariable group was compared with the TNM group. The 3-year and 5-year cancer-specific survival (CSS) were 72.71% and 65.92% in the development dataset, respectively. The predictive ability in the multivariable group was superior to that in the TNM group. The calibration curves and consistency in the multivariable group were superior to those in the TNM group. The Cox and RSF models performed better than the ST and GBM models. A nomogram was constructed to predict the 3-year and 5-year CSS of osteosarcoma patients. The RSF model can be used as a nonparametric alternative to the Cox model. The constructed nomogram based on the Cox model can provide reference for clinicians to formulate specific therapeutic decisions both in America and China.


Introduction
Osteosarcoma, a rare malignant tumor, accounts for approximately 35% of primary malignant bone and joint tumors [1,2]. Osteosarcoma often occurs in children and adolescents, invades the metaphysis of appendicular skeleton, and has a poor prognosis [3][4][5]. Although the 5-year survival rate of osteosarcoma in the United States has increased from 51.0% (in the 1980s) to 60.5% (in the 2010s), it is still the lowest among bone sarcomas [1]. It is an imperative to develop an accurate survival prediction model that influences the decisions of clinicians, patients, and their families [6]. At present, most models for predicting osteosarcoma are Cox proportional hazard regression models combined with nomograms [3,[7][8][9][10][11][12][13]. The Cox models need to meet the proportional hazards assumption, thus the overall quality of these models may have not reached the optimal state.
Machine learning is a new type of artificial intelligence, which has been widely used in medical data analysis and is a powerful tool for improving clinical strategies [14][15][16]. Some tree-based machine learning methods (such as survival tree [ST], random survival forest [RSF], and gradient boosting machine [GBM]) can account for interaction and effect modification between variables and have been applied in some prognosis studies [17][18][19][20][21][22][23][24][25][26][27]. In many studies, in which the categorical variable was the dependent variable, the prediction performance of machine learning is better than that of traditional models [6,14,28,29]. However, there is little research comparing survival prediction models for osteosarcoma, in which the dependent variable includes survival status and survival time. It is unclear whether the performance of the new machine learning model is superior to the traditional Cox regression model for survival prediction.
In this study, we selected patients with osteosarcoma from the National Cancer Institute Surveillance, Epidemiology, and End Results (SEER) database (from 2008 to 2015) to develop four prognosis models (Cox, ST, RSF, and GBM). Patients with osteosarcoma from the SEER database (from 2004 to 2007) and Hebei Province cancer registry in North China were used as external test dataset 1 and external test dataset 2, respectively. In addition, we compared the multivariable group with the TNM group (included only 7th American Joint Committee on Cancer [AJCC] T, N, and M category).
This study aimed to find the optimal survival prediction model for osteosarcoma, in order to help clinicians make more reasonable therapeutic decisions both in America and China.

Materials and methods
Data source and study population Data of SEER cohort were obtained from the "Incidence -SEER Research Plus Data, 17 Registries, Nov2021 Sub (2000-2019)" of the SEER database, a population-based cancer registry of the National Cancer Institute in the United States (https://seer. cancer.gov/). A total of 2912 osteosarcoma patients from the SEER database were included in this study. The inclusion criteria for the patient were the following: (1) Year of diagnosis: "2004-2015"; (2) Site recode ICD-O-3/WHO 2008: "Bones and Joints," and behavior code ICD-O-3: "Malignant"; (3) AYA site recode 2020 Revision: "Osteosarcoma"; and (4) First malignant primary indicator. The exclusion criteria were the following: (1) Missing survival information (diagnostic confirmation was "autopsy", "death certificate only cases", or "unknown"); and (2) Survival months less than one month or unknown.
The Hebei Province cohort was recruited from Hebei Province cancer registry in North China. We included 225 patients who were first diagnosed with osteosarcoma between 1 January 2008 and 31 December 2021. Patients were followed-up by passive and active methods. Passive follow-up information was collected through readmission information, outpatient records, and all cause of death database in Hebei Province. Active follow-up was conducted trimonthly by professionally trained personnel. The end date for follow-up was 20 January 2022. Second, we excluded those with incomplete survival information (missing survival information) or survival months less than one month or unknown.

Predictors and outcome
We included 11 predictors (sex, age, marital status, site, T category, N category, M category, grade, surgery, radiation, and chemotherapy) into the ST, RSF, and GBM models of the multivariable group. As the Cox proportional hazard model needs to meet the proportional hazards assumption, Cox model included only the variables without overlapping in Kaplan-Meier curves and those having statistically significant difference in log-rank test. In addition, we compared the multivariable group with the TNM group (only included 7th AJCC T category, N category, and M category). To avoid bias caused by different variables, we set the same variables group (four models included only variables that met the proportional hazards assumption). Survival months were calculated from the date of diagnosis to the date of death due to osteosarcoma or the end of follow-up. The outcome variables were survival months and 3-year and 5-year cancer-specific survival (CSS).

Model development and testing
In this study, patients in the SEER cohort from 2008 to 2015 were assigned to the development dataset. Patients from 2004 to 2007 comprised the external test dataset 1. Patients from Hebei Province cohort formed the external test dataset 2. Cox, ST, RSF, and GBM were used to develop the prediction models. All models were built on the development dataset by 10-fold cross-validation with 200 iterations. Discrimination of the survival prediction models was quantified by concordance index (C-index) and area under the receiver operating characteristic curve (AUC). Calibration curves were used to evaluate the consistency between the predicted and observed values of CSS of the survival prediction models.
Cox is a popular semi-parametric model for survival analysis, which can be defined by formula: where t represents the survival time and β p is the coefficient of covariate X p and measures the impact of the covariate. h(t) is the hazard function of individuals with covariates X 1 , X 2 , . . . , X p at time t. h 0 (t) is the baseline hazard.
The CART algorithm was used to construct the tree structure [30,31]. The procedure consisted of three steps. First, we examined all allowable splits on each predictor variable and choose a split point that maximizes the survival differences between children nodes based on the log-rank test. Second, we repeated the procedure to split the children nodes until the tree met the stopping criteria (all terminal nodes containing only the minimum number of unique events). Third, the split-complexity measure was used in the pruning step.
RSF is an ensemble tree method for the analysis of right-censored survival data [32]. The algorithm of RSF was the following: First, B bootstrap samples were drawn from the original data. Each bootstrap sample, including two-thirds of the original data, was used to train data. Second, in order to grow a ST for each bootstrap sample, we randomly selected p candidate variables at each node of the tree. The criterion of growing a tree was the maximization of the survival difference between each branch. Third, the cumulative hazard function (CHF) of each tree was averaged to achieve an ensemble CHF. Finally, the prediction error for the ensemble CHF was calculated, by using out-of-bag data (OOB, the rest one-third of the original data) to avoid overfitting.
GBM is a gradient-descent-based formulation of boosting methods [33]. The basic idea is to train new base learners according to the negative gradient information of the loss function of the current model. Then, it combines the trained base learners with the existing model in the form of accumulation. This process aimed to continuously reduce the loss function and deviation.

Ethical statement
The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). The authors obtained authorization to access the SEER Research Data supported by the National Cancer Institute with approval number 11241-Nov2021. Because public and anonymous data from the SEER database were used, informed patient consent was not required. The Ethics Committee of the Fourth Hospital of Hebei Medical University/The Tumor Hospital of Hebei Province has confirmed that no ethical approval was required. Informed consent was obtained from all individual participants included in the study.

Statistical analysis
The analyses were conducted using the statistical software R (version 4.1.2). R package "randomForestSRC" was used for missing data imputation. R packages "survival," "rpart," "ran-domForestSRC," and "gbm" were used to develop the models. The value of P < 0.05 was considered statistically significant.  Table 1, and Table S1).

Model performance
According to the Kaplan-Meier curves and log-rank tests for different variables (Figure 2), the variables which met the proportional hazards assumption were sex, age, marital status, site, T category, N category, M category, grade, surgery, and radiation. Those variables were included in the Cox model of the multivariable group. As the ST, RSF, and GBM models did not need to meet the proportional hazards assumption, the included variables of the multivariable group were sex, age, marital status, site, T category, N category, M category, grade, surgery, radiation, and chemotherapy.
To compare the performance of the four models (Cox, ST, RSF, and GBM), we computed the C-index and plotted the receiver operating characteristic curves (ROC) and calibration curves in 3-year survival cohort and 5-year survival cohort of different datasets. Figure 3 and Table 2     of the Cox, RSF, and GBM models were higher than 0.70 in the train dataset and external test dataset 2. The C-indices for various models of every dataset in the multivariable group were higher than that in the TNM group. The C-indices of other datasets or models were not higher than 0.70. The C-indices of the Cox, RSF, and GBM models were higher than that of the ST model in each dataset both in the multivariable group and in TNM group. Figure S1 and Table S2 show the C-indices for      respectively. Except ST in the external test dataset 2, the AUCs of multivariable group were higher than of the TNM group in every model of different datasets. In every dataset, the AUC of the ST model was lower than that of the other three models. Figure S2 shows AUC by picturing ROC for various models for predicting 5-year CSS of osteosarcoma in the multivariable group and TNM group. The results of the 5-year CSS cohort were similar to those of the 3-year CSS cohort.
Calibration curve was used to visualize the consistency between the predicted and observed values of CSS of the survival prediction models. The 45-degree gray straight line represents the perfect match between the observed (y-axis) and predicted (x-axis) survival probabilities. A smaller distance between model and gray straight line indicates higher accuracy. Figure 5 shows the calibration curves for various models for predicting 3-year CSS of osteosarcoma in the multivariable group and TNM group. In every dataset, the consistency of the multivariable group was superior to the TNM group. In the multivariable group, the consistency of the Cox and RSF models was superior to that of the ST and GBM models. The predicted value of the GBM model was lower than the observed value, and the predicted value of the ST model was higher than the actual value. Figure S3 depicts the calibration curves for  various models for predicting 5-year CSS of osteosarcoma in the multivariable group and TNM group. The results of the 5-year CSS cohort were similar to those of the 3-year CSS cohort.
Outcome of the Cox model Figure 6 depicts the factors influencing the prognosis of osteosarcoma patients based on the Cox proportional hazard model in the development dataset of the multivariable group. According to the forest plot, poorer CSS was associated with older age (20- Based on the result of the Cox model in the development dataset of the multivariable group, a nomogram was established to predict the 3-year and 5-year CSS of osteosarcoma patients. By bringing the patient's variable into the nomogram, the scores of each variable can be obtained and, finally, the scores can be added to obtain the patient's 3-year and 5-year CSS. For example, we included each variable of the patient ID 771469 in the external test set 2 into the nomogram and obtained the score

Sensitivity analysis
To avoid bias caused by different variables, we set the same variables group. In the group, the included variables in the ST, RSF, and GBM models were same as in the Cox model, which included sex, age, marital status, site, T category, N category, M category, grade, surgery, and radiation. The C-index, AUC, and calibration curves of same variables group were similar to the multivariable group. (Table S3 and Figures S4 and S5)

Discussion
Machine learning can improve the efficiency of survival prediction models, which is crucial for disease prognosis and nursing plan. This study developed and validated four models using the Cox and three machine learning algorithms (ST, RSF, and GBM) to predict the survival of osteosarcoma patients. In every model of different dataset, the predictive ability of the multivariable group was superior to the TNM group. In every dataset, the C-index and AUC of the ST were lower than that of the other three models. The calibration curves of the Cox and RSF models performed better than that of the ST and GBM. In addition, the nomogram was established to predict the 3-year and 5-year CSS of osteosarcoma patients.
Previous studies [34][35][36] showed that there was no difference in the prognosis of osteosarcoma among different ethnic groups. Therefore, it is feasible for us to choose the Hebei Province cohort as the external test dataset 2. Compared with American cohorts (the 3-year and 5-year CSS were 72.71% and 65.92% in the development data, and 70.25% and 63.46% in the external test dataset 1, respectively), the Hebei Province cohort (the 3-year and 5-year CSS were 55.25% and 52.49% in the external test dataset 2, respectively) had a poorer survival. This may be due to the fact that the proportion of pelvis and pine osteosarcoma patients in the Hebei Province cohort (44.2%) is much higher than that in the United States (9.4%). Previous studies [35] have shown that the prognosis of pelvis and pine osteosarcoma patients is worse than that of appendicular skeleton osteosarcoma patients, which is consistent with the Cox regression model in this study.
The C-index and AUC of the Cox model were similar to that of the RSF and GBM models for predicting the survival of osteosarcoma patients. This may be due to the small number of predictors. Cox is a semi-parametric model, which is valid only when the number of predictors is less than the number of events. The advantages of machine learning are more easily seen with large number of predictors and a relatively small sample size [37]. ST model performed worst, in terms of C-index, AUC, or calibration curve. ST divides data by maximizing the difference between nodes, but the prediction error is large, leading to low prediction accuracy of the model. RSF and GBM models can compensate for the error [37]. The performance of the GBM model is similar to the Cox and RSF models for C-index and AUC, but is poor for calibration curve. RSF model can be used as a non-parametric alternative to the Cox model. The selection of methods should be based on a combination of many factors, such as the type of data collected, data size, calculation strength, model implementation skills, and software availability [38].
The Kaplan-Meier curves showed that there were differences in survival between different marital statuses. However, the results of multivariate Cox analysis showed that marital status is not a factor affecting survival. The possible reason is that the proportion of young osteosarcoma patients is higher, and the survival rate of young patients is higher than that of middle-aged and elderly patients. Most young patients are unmarried. This may be the reason why was the survival of unmarried patients higher than that of married patients in the Kaplan-Meier curve. According to the Cox proportional hazard model, poorer CSS was associated with older age compared with 0-19 age group; having a pelvis and spine tumor site compared with having a appendicular skeleton tumor site; T2 or T3 compared with T1; N1 compared with N0; M1a or M1b compared with M0; poorly differentiated or undifferentiated compared with well differentiated or moderately differentiated; and radiation compared with no radiation. Improved CSS was associated with female sex compared with male sex and surgery compared with no surgery. The influencing factors are consistent with previous studies [7,34,37,[39][40][41].
The advantages of this study mainly include the following aspects. First, in order to verify the reliability of the model, all models were built on the development dataset by 10-fold cross-validation with 200 iterations. Previous studies established nomograms based on the Cox regression models using data from a country or region and that have not been validated in the external test dataset [3,[6][7][8][42][43][44]. Our model was selected as the optimal model after comparing various models. In addition, we used not only data in the SEER dataset of 2004-2007 as time cohort external test dataset, but also used Chinese data as regional cohort external test dataset. The external test demonstrated the reliability of our model in different time periods and regions. The nomogram based on the Cox model has a wider application; it is applicable not only to osteosarcoma patients in the United States but also appropriate for Chinese population. Second, to our knowledge, this is the first study to compare multivariable groups of multiple models with the TNM groups for osteosarcoma. Gao et al. [42] established a nomogram compared with AJCC stage to predict the overall survival for postoperative osteosarcoma patients, which was limited to patients with osteosarcoma after surgery and only combined the Cox model and nomogram. In our study, we used the TNM group consisting of three variables (T category, N category, and M category) which is more detailed than the AJCC stage. Third, we did a sensitivity analysis by setting same variables group. In the group, the included variables of the ST, RSF, and GBM models were same as in the Cox model. It had a similar result to multivariable group, which avoided the bias caused by different variables in different models. There are still some limitations in this study. First, the variables included in this study are limited. There are some studies that demonstrated that some variables (such as gene-based signature, radiomics, pathological fracture, etc.) could also influence the survival in osteosarcoma [43][44][45][46][47][48]. In future research, we will try to incorporate these factors into the models. Second, the consistency of calibration curves was not high, which may be due to the insufficient sample size of osteosarcoma as a rare cancer. Efforts will be made to solve the problem of low consistency. Third, we included only three machine learning methods to compare with the Cox model. Currently, there are some machine learning or deep learning models based on biomedical images that have been applied in the diagosis of osteosarcoma [46][47][48]. In the future, we will explore whether these models and biomedical image variables can be used to develop survival prognostic models for osteosarcoma. In addition, we will try to compare more models (such as support vector machine, artificial neural network, Xgboost, deep learning, etc.) to develop a more comprehensive and better prediction model.

Conclusion
The multivariable group was superior to the TNM group. Among the four prognosis models, the Cox and RSF models performed better than the ST and GBM models. RSF model can be used as a non-parametric alternative to the Cox model. The nomogram based on the Cox model in development dataset of the multivariable group can provide reference for clinicians to formulate specific therapeutic decisions and allocate health resources reasonably for osteosarcoma patients both in America and in China.