Article Text

Early COVID-19 respiratory risk stratification using machine learning
  1. Molly J Douglas1,2,
  2. Brian W Bell2,
  3. Adrienne Kinney2,
  4. Sarah A Pungitore2,
  5. Brian P Toner2
  1. 1Department of Surgery, University of Arizona, Tucson, Arizona, USA
  2. 2Program in Applied Mathematics, University of Arizona, Tucson, Arizona, USA
  1. Correspondence to Dr Molly J Douglas; mjdouglas{at}

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


  • The COVID-19 pandemic has strained healthcare resources and highlighted the importance of appropriate triage to allocate resources most efficiently.


  • This retrospective modeling study derives a six-variable model for predicting the risk of respiratory failure requiring intubation, in any 48-hour period, for patients with COVID-19, with an area under the receiver operating characteristic curve of 0.8.


  • This streamlined model allows non-experts to assist in accurate triage to an appropriate level of care and can aid in system-level planning for bed and staffing needs.


The COVID-19 global pandemic has caused unprecedented levels of population illness and healthcare resource utilization.1–4 Infection with the causative agent of COVID-19, SARS-CoV-2, can range from asymptomatic5 6 to life-threatening,4 7 8 and illness requiring mechanical ventilation carries a high mortality rate of 25% to 60%.7 9 10

The combination of heavy illness burden and finite resources has made triage a necessity in many health systems, with a particular strain on intensive care units (ICUs).1 3 11 Patients with acute respiratory failure may require endotracheal intubation and placement on a ventilator for respiratory support, interventions which are only performed in an ICU setting. Appropriate triage can reduce unnecessary ICU admissions and promote allocation of resources to the sickest patients. Factors shown to be associated with severe COVID-19 include advanced age,12 13 cardiovascular disease, chronic kidney disease, diabetes, and laboratory findings such as lymphopenia, thrombocytopenia, and elevated inflammatory markers.14–18

Machine learning has been used to further the understanding of COVID-19, including for disease diagnosis19–26 and transmission.27–30 Further, an April 2020 systematic review by Wynants et al31 discussed 50 published models for predicting disease progression or severity, but recommended none for clinical practice due methodological limitations including small sample sizes, inadequate training versus testing cohorts, or other factors leading to high risk of bias or limited external validity. A January 2021 review of artificial intelligence (AI) applications for COVID-19 by Tayarani et al19 reviewed 14 additional studies of machine learning for predicting COVID-19 severity and found promise in works using demographics, laboratory values, and other electronic health record (EHR) data. Online calculators have been published with some studies.15 16

However, there remains a lack of standardization on how to predict an individual’s disease trajectory and risk of severe illness. Thus, assessing the relative weight of risk factors in any particular patient’s case has remained largely a provider-level task. Our goal in this work is to develop a tool to aid in risk assessment for progression to severe disease. Specifically, we aimed to analyze demographic and clinical data with statistical and machine learning techniques, and to develop a prediction score, usable at the bedside by non-experts, to stratify the risk of progression to intubation within the next 48 hours for patients hospitalized with COVID-19.


Methods and results are reported in accordance with the 2015 statement for Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis.32

Data source

De-identified patient-level data were provided via a hospital-affiliated clinical data warehouse. Patients testing positive for SARS-CoV-2 at three academic medical centers in Arizona between January and April 2020 were included. Extracted variables included age, sex, vital signs, laboratory values (including blood counts, electrolytes, blood gas results, and inflammatory markers), oxygen requirements, and timing of intubation.

Patient comorbidities were extracted to describe the study cohort. However, knowledge of comorbidities is dependent on prior interaction with the healthcare system and on patient reporting or availability of medical records. This information may be unavailable at the urgent point of care. Accordingly, comorbidity information was omitted from model training to build a score robust to the incomplete data that may be available in times of health system crisis. Further, data on self-identification of race and ethnicity were not reliably available within the electronic medical record (EMR), so race and ethnicity were not considered in modeling.

Data preprocessing and missing data

Data were reformatted into 4-hour time blocks (rows). A 4-hour interval was chosen to match the frequency of routine vital sign checks in non-ICU units, representing the highest data sampling rate that was likely to be available across the population. Vital signs were then summarized as mean, minimum, and maximum for each block, as well as the initial value recorded on presentation for each patient. Laboratory values, measured less frequently, were represented as current and initial values. Respiratory support other than intubation was quantified by fraction of inspired oxygen (FiO2) and oxygen delivery device (ie, nasal cannula, face mask, high-flow humidified cannula, etc). Where necessary, FiO2 was estimated as 0.21 (room air) plus an additional 0.04 for every 1 L/min increase in oxygen flow rate.33 Each “row” (4-hour block) was labeled with whether the patient required intubation within the subsequent 48 hours, as well as the number of hours from the end of that time block until the time of their intubation. Where values were missing, the last measured vital signs were carried forward for up to 12 hours and the laboratory values for up to 72 hours. Otherwise, missing fields were left blank. Rows with greater than 85% missing values were excluded. Parameters were excluded from modeling if they were populated in fewer than 15% of rows. This left 67 parameters for use in model training, including the initial and summary values as separate model inputs. Bivariate comparisons between the intubated and non-intubated groups were done using the χ2 test for categorical data and the Mann-Whitney U test for continuous data. A complete list of the parameters initially considered in modeling, prior to elimination of those with low prevalence in the data set, is available in the online supplemental information. Finally, the data were randomly split into 80% training and 20% testing sets.

Supplemental material


The primary outcome used in model development was whether or not the patient was intubated within 48 hours of the end of each 4-hour time block. A patient’s physiological state during each time block was considered as a separate model input, such that each “row” formed an independent training example. Model performance was assessed by the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and qualitatively for ease of application in clinical practice.

LASSO regression

A least absolute shrinkage and selection operator (LASSO) regularized linear regression model was trained.34 The regularization parameter α had little impact on AUC, but affected the number of non-zero weights (sparsity) and specificity of the model. We noted a sharp drop in specificity as α approached 1, so α=0.1 was selected to minimize the number of non-zero weights without sacrificing specificity. This resulted in the inclusion of 10 to 15 predictors, depending on the training and testing data split. Feature importance was then explored by rerunning the model across 100 randomizations of the training and testing data split. Thirteen parameters were used in >50% of model runs, and these were then used in an elimination algorithm where model performance was tested after dropping each parameter in turn (figure 1). Features with minimal (<0.002) reduction in AUC or with high potential for clinical redundancy (such as current temperature and maximum temperature) were removed, leaving only seven predictors: fraction of inspired oxygen (FiO2), initial red blood cell count (RBC_initial), maximum oxygen saturation for the 4-hour block (SpO2-max), lymphocyte count (lymph#), initial modified Sequential Organ Failure Assessment score (mSOFA_initial), current temperature (temp), and body weight (weight). LASSO was run again with just these seven predictors, and all predictors except mSOFA_initial had non-zero coefficient values, resulting in a model with just six predictors.

Figure 1

Impact of the top 13 parameters on LASSO model performance during training. Model AUC is plotted after dropping each of the top 13 parameters in turn. A lower postelimination AUC indicates the feature is more important in the model. Parameters yielding minimal reduction (<0.002) or an increase in AUC on elimination were removed from the final model. The dotted line, “baseline test AUC”, shows the AUC of the model with all 13 parameters included. AUC, area under the receiver operating characteristic curve; FiO2, fraction of inspired oxygen; LASSO, least absolute shrinkage and selection operator; lymph#, lymphocyte count; mSOFA_initial, initial modified Sequential Organ Failure Assessment score; RBC_initial, initial red blood cell count; SpO2-max, maximum oxygen saturation for the 4-hour block; temp, temperature; weight, body weight.

Model performance was then assessed on the testing cohort. The CIs for LASSO performance were bootstrapped using the empirical bootstrap,35 where the testing set was resampled with replacement 1000 times, and 95% confidence bands were calculated and plotted using the simultaneous joint confidence regions method.36


An eXtreme Gradient Boosting (XGBoost) model was trained.37 Model tuning initially focused on sensitivity and sparsity. Bracketing algorithms were used to select the optimal values for scale_pos_weight (to more heavily weight cases of intubation, given the preponderance of negative examples in the data set), maximum tree depth (to optimize model complexity), as well as the regularization parameter γ. Given the goal of a bedside-usable prediction score, we initially focused on building a single-tree model. However, this yielded performance inferior to LASSO regression, with an AUC of 0.74, sensitivity of 0.88, and specificity of 0.60, so a more complex model using 100 trees was tuned. Tuning this model for maximum sensitivity rather than sparsity (γ=0) yielded a model with improved performance, as described in the Results section. Feature importance was explored by gain in model performance.



There were 3447 patient encounters meeting the inclusion criteria, of which 20.7% required intubation. The baseline cohort characteristics regarding comorbidities and all parameters used in model training are presented in table 1. After data preprocessing as discussed in the Methods section, the average missing data rate was 57% across the 4-hour time blocks, with an SD of 30%. We considered all patients who did not have a documented intubation to be in the non-intubated cohort, so there were no unclassified patients with respect to intubation.

Table 1

Cohort initial characteristics

LASSO regression

LASSO modeling, optimized for sensitivity and sparsity (ie, minimization of the number of inputs required), yielded an AUC of 0.798, with 95% CI of 0.785 to 0.812 (figure 2). At the 90% sensitivity operating point, we observed a specificity of 61.7% (95% CI 0.524 to 0.710), NPV of 0.997 (95% CI 0.996 to 0.998), and PPV of 0.040 (95% CI 0.033 to 0.047). The six parameters included in the final LASSO model were FiO2, RBC_initial, SpO2-max, current lymph#, current temperature (temp), and body weight (weight). The relative weights of each predictor are shown in figure 3. FiO2 was the most significant predictor, followed by maximum oxygen saturation (SpO2). The score is calculated by summing the value of each predictor multiplied by its coefficient and adding the constant (C0). If necessary, FiO2 is estimated as 0.21 (room air) plus an additional 0.04 for each 1 L/min increase in oxygen flow rate.33 Positive values predict intubation within the next 48 hours, and negative values predict no intubation within the next 48 hours. The greater the magnitude of the score, the greater the certainty of the prediction.

Figure 2

ROC for the final LASSO model (ECoRRS score). The model predicts intubation within the subsequent 48 hours based on six clinical parameters. The AUC is 0.789 (95% CI 0.785 to 0.812). 95% confidence bands (dotted curves) are shown, calculated via the simultaneous joint confidence regions method. AUC, area under the receiver operating characteristic curve; ECoRRS, Early COVID-19 Respiratory Risk Stratification; LASSO, least absolute shrinkage and selection operator; ROC, receiver operating characteristic curve.

Figure 3

Parameter weights for the final LASSO model (ECoRRS score). The six parameters included in the final LASSO model are shown versus their model weights: FiO2, RBC_initial, SpO2-max, lymph#, temp, and weight. The model is applied by summing the value of each parameter multiplied by its coefficient and adding the constant 0.08. Positive values predict intubation within the next 48 hours, and negative values predict no intubation within the next 48 hours. FiO2 was the most significant predictor of intubation. ECoRRS, Early COVID-19 Respiratory Risk Stratification; FiO2, fraction of inspired oxygen; LASSO, least absolute shrinkage and selection operator; lymph#, current lymphocyte count; RBC_initial, initial red blood cell count at admission; SpO2-max, maximum oxygen saturation for the 4-hour block; temp, current temperature; weight, body weight.


XGBoost classification tree modeling, optimized for sensitivity and trained on all parameters in the data set, yielded an AUC of 0.86, with a sensitivity of 0.99 at a specificity of 0.74. The NPV was 0.999 and the PPV was 0.082. Of the parameters, FiO2 was consistently the most important by gain in model performance. The final model used 100 unique trees which combine to produce the prediction. A subsection of one of these tree diagrams is shown in figure 4.

Figure 4

Representative portion of a single tree from the XGBoost model. The final model contains 100 unique trees which combine to yield the model prediction. Terminal node (“leaf”) values represent the log odds of the probability of intubation. To arrive at the predicted probability, the values of the appropriate leaves of each tree in the model are summed and transformed into a probability using the logistic function. XGBoost, eXtreme Gradient Boosting; HCT, hematocrit.

Early COVID-19 Respiratory Risk Stratification prediction score

Both models were highly unlikely to undertriage patients, with NPV of 99.7% (LASSO) and 99.9% (XGBoost). The XGBoost model, however, achieved approximately double the PPV of LASSO and thus is less likely to overtriage patients (ie, indicate a need for intubation when the patient will not be intubated within the specified time frame). Given its complexity, the XGBoost model would require clinicians to enter a large number of variables into a specialized software program to see a prediction; this presents a significant barrier to rapid deployment for emergency triage. In contrast, the LASSO model, with only six parameters, can be used by any practitioner with a simple calculator or spreadsheet program. Thus, we present the LASSO model as the Early COVID-19 Respiratory Risk Stratification (ECoRRS) score. The coefficients and constant to calculate the ECoRRS score are shown in table 2. Positive results predict the need for intubation within 48 hours, and negative results predict no intubation within that time frame. The greater the magnitude of the score, the greater the certainty of the prediction.

Table 2

Coefficients and constant to calculate the ECoRRS score


We analyzed EHR data with two methods, LASSO regularized linear regression and XGBoost classification trees, to predict intubation within the next 48 hours for patients hospitalized with COVID-19. Both models achieved high sensitivity and very low rates of undertriage. XGBoost performed as well or better on all metrics compared with LASSO. However, given the marked simplicity and sparsity of LASSO relative to XGBoost, the LASSO model, which uses six objective inputs, is presented as the ECoRRS score.

The ECoRRS score can be used to predict intubation and forecast resource utilization up to 48 hours in advance, which has implications for both individual patient care and for system-wide planning and staffing. The score tolerates overtriage to maximize sensitivity, identifying a subpopulation “at risk” of intubation. At the system level, however, hospitals can multiply the number of patients scoring positive on ECoRRS by the model’s PPV and arrive at a relatively precise estimate of the number of inpatients likely to newly require a ventilator within the next 48 hours. This can facilitate timely redistribution of staff and resources to the areas of greatest need.

With regard to individual patient care, our framework relies on objective measurements and not patient history or comorbidities, which may be unavailable at the urgent point of care. Additionally, relying on objective measures, rather than subjective assessments by healthcare providers, supports the utility of ECoRRS as a triage tool for use by personnel with minimal healthcare training when systems are overburdened.

Multiple other investigators have sought to develop predictive algorithms for COVID-19 disease severity. Notably, Marcos and colleagues16 developed an open-source online calculator using just nine variables to classify patients as high or low risk for severe disease, using a methodology similar to that presented here. Our model differs in that it provides prediction of intubation specifically within a 48-hour window and does not rely on knowledge of comorbidities to predict disease trajectory.

This study has multiple limitations. First, the indications for intubation were not protocolized and the decision to intubate was at the treating clinician’s discretion. Thus, differences in individual practice may have impacted the study’s results. Further, COVID-19 treatment has evolved since our data collection period (January–April 2020). Prone positioning, which has historically been used as an adjunct for intubated patients with severe acute respiratory distress syndrome,38 39 came into practice to improve oxygenation in non-intubated patients with COVID-19. Proning increased in popularity during our study period, but data on the precise rate and intensity of proning in our cohort were not available. Studies have shown that prone positioning improves oxygenation and possibly reduces mortality in COVID-19, but it is not clearly associated with a reduced need for intubation.40 41 As the most powerful predictor of need for intubation in our cohort was FiO2, it is likely that the benefits of proning would be reflected in FiO2 requirements, allowing the score to remain useful with increased utilization of prone positioning.

Additionally, remdesivir was introduced for COVID-19 under emergency use authorization in May 2020 and full US Food and Drug Administration approval followed in October 2020.42 43 However, subsequent studies have shown minimal impact of this drug on disease trajectory,44 and we suspect remdesivir’s introduction to have little impact on the ECoRRS score’s generalizability. Convalescent plasma was also introduced in Spring 2020,45 46 with significant hopes for modifying disease progression, although large trials subsequently found this treatment too was ineffective.47 48 In contrast, glucocorticoids in patients requiring supplemental oxygen became standard of care during our study period, after the RECOVERY trial.49 The impact of this major therapeutic is likely captured only in the latter half of our cohort.

Further, our data source is linked to both strengths and significant limitations. With assistance from a hospital-affiliated clinical data warehouse, we extracted real-world EHR data. Such data are notoriously challenging and often include high rates of missing or incorrect values.50 Our average missing data rate of 57% is similar to that reported in previously reported studies, including an evaluation of blood pressure documentation in the EHR which was found to vary in missing rate from 0.1% to 52%.50 These missing data may have led to bias in our conclusions and model performance. However, it also may reflect incomplete information that healthcare workers operate with on a regular basis.

Finally, our 3447 patients were from three academic hospitals located within the same state. Validation studies in a wider multicenter cohort are needed to better assess the external validity of the ECoRRS score. The authors plan to undertake this using data from geographically diverse and non-academic hospitals within the same health network, which spans 6 states and 30 facilities.

The contrast of the user-friendliness of the LASSO model versus the accuracy of the XGBoost model highlights an active challenge in healthcare machine learning and informatics. Although numerous algorithms have been developed for healthcare, few have been deployed in the clinical setting, leading some to question the hopes for AI in medicine.51–53 Although EHR systems remain closed environments, the use of novel algorithms will require clinicians to manually enter data into a secondary system or calculator, which creates a substantial barrier to algorithm deployment and also to building the infrastructure for ongoing model evaluation with new populations. A future with enhanced collaboration between EHR developers, researchers, and regulatory organizations54 could facilitate more comprehensive model training, testing, and validation. Such collaboration could also allow algorithms processing large numbers of data inputs, such as our XGBoost model, to find utility in clinical practice.


The ECoRRS score enables non-specialists to identify patients with COVID-19 at risk of intubation within 48 hours with minimal undertriage and enables health systems to forecast new COVID-19 ventilator needs up to 48 hours in advance.

Data availability statement

Data may be obtained from a third party and are not publicly available. The data set generated and analyzed in the current study is protected by a Banner Health data use agreement (DUA), which prohibits placing the data in a public repository. The institution requires approval of a new DUA with any individual wishing to access the data. Requests for data access may be directed to the corresponding author, who will facilitate the request for a DUA through Banner Health. New requests typically require 3 to 6 months to process.

Ethics statements

Patient consent for publication

Ethics approval

This study involves human participants but the University of Arizona Institutional Review Board exempted this study (protocol number 2004546291). This was a retrospective study using chart review only. There was no direct interaction with or impact on care received among the study participants. The study was deemed "non-human subjects research" by the University of Arizona Institutional Review Board.


This research made use of the community-developed core Python and Julia packages, including IPython (Perez and Granger), Scikit-learn (Fabian et al), SciPy (Eric et al), and Pandas (Wes and Wes). This work would not have been possible without the team at the University of Arizona’s Clinical Research Data Warehouse, who worked closely with us on clinical data extraction and de-identification.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Contributors MJD, BPT: development of the research question, data acquisition and cleaning, data analysis, article preparation. BWB: development of the research question, data acquisition and cleaning, data analysis. AK: data acquisition and cleaning, data analysis. SAP: data acquisition and cleaning, data analysis, article preparation. Author guarantor: MJD

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.