Mortality risks of pneumonia patients exhibit surprising statistical patterns
To investigate and categorize the real-world impact of clinical decision-making thresholds, we analyzed mortality risks using real-world data to find surprising statistical patterns in patient outcomes. We conducted this exploratory data analysis by training generalized additive models (GAMs)3 to estimate mortality risk as the additive effects of individual risk factors. The GAMs demonstrated high predictive accuracy, comparable to fully-connected neural networks and XGBoost baselines (Supplementary Table 2), and provided high-resolution, nonlinear, component functions that can be sequentially analyzed. Our analysis used the 1989 MedisGroups Comparative Hospital Database (MCHD) pneumonia dataset4. We see two repeated statistical patterns that routinely indicate underlying treatment effects: (1) discontinuities in risk, and (2) counter-causal non-monotonicities in risk.
Discontinuities in risk often reveal threshold-based treatment effects in clinical data. Blood urea nitrogen (BUN) exhibits a discontinuous association with mortality risk (Fig. 3a), with a non-monotone structure. Intrinsic (i.e., untreated) risk is likely to increase smoothly with increasing BUN, so this discontinuous population risk suggests that different treatments are being applied at different BUN levels. Lower levels of BUN are associated with survival; however, there is a rapid rise in mortality risk from 30–40 mg/dL followed by a plateau in risk from 40–80 mg/dL. The pattern of risk at BUN of 40 mg/dL is stronger for male patients than female patients, with the risk for female patients rising continuously from 30–50 mg/dL, suggesting a stronger adherence to threshold-based decisions for male patients than for female patients. Similarly, systolic blood pressure shows a discontinuous association with mortality risk (Fig. 3b). There is a sharp discontinuous rise in risk as systolic blood pressure decreases below 80 mmHg, which suggests adherence to a threshold-based protocol.
In all plots, we show the effect of each variable on patient mortality risk (with 95% confidence intervals shaded) after correcting for all other observable risk factors using a generalized additive model. Each tick on the horizontal axis denotes ten observed patients. Discontinuities arise from behavior influenced by discrete round-number thresholds, including a a flattening in risk at a BUN of 35 mg/dL and b a discontinuous drop in risk at a systolic blood pressure of 80 mmHg. Counter-causal trends occur when intrinsically high-risk features are predicted as low-risk due to effective treatment, including c serum creatinine above 5 mg/dL and d chronic comorbidities, where patients often receive more aggressive care. Differences by sex are shown in (c), illustrating population stratification in risk profiles.
Counter-causal non-monotonicities reveal that aggressive treatment can have a stronger positive effect than the deleterious effect of intrinsic biological risk. For example, elevated serum creatinine above 5 mg/dL (Fig. 3c) is associated with increased survival (lower mortality risk) of pneumonia patients. This effect is measured after correcting for all other patient risk factors recorded in the dataset and contradicts medical causality, as serum creatinine above 5 mg/dL indicates advanced renal dysfunction5. The correspondence between this counter-causal effect and a round-number threshold suggests that intervening factors, including aggressive treatment, are influenced by this threshold. Unfortunately, purely data-driven AI protocols could de-prioritize care for these patients who were effectively treated and might even use this extreme region as a therapeutic target (i.e., withhold treatment to keep serum creatinine high, or even worse, suggest treatments to increase serum creatinine).
Chronic comorbidities also produce paradoxical statistical effects. For example, the odds ratio of mortality in pneumonia patients is reduced by more than 30% with a history of chest pain, 20% with a history of asthma, and 18% by a chronic lung disease (Fig. 3d). These counter-causal effects are robustly estimated and not sensitive to the model class or algorithm used for estimation, suggesting that risk models trained on these real-world datasets could systematically underestimate mortality risk (effective risk < underlying risk) for patients with prior comorbidities. In real-world data, observed effective risk implicitly combines underlying risk with the impact of interventions (including unobserved quantities such as perceived urgency by both the patient and the clinician). As a result, the effective risk for patients with chronic comorbidities is often lower than the underlying risk for the same patients; hence, data-driven AI models can systematically underestimate the underlying risk associated with chronic comorbidities.
A model of how threshold-guided interventions generate statistical artifacts in risks
To build a model for systematically identifying and understanding the potential impacts of confounding from threshold-guided clinical practice, we first build a simplified model of how risk curves are generated from simulated treatment benefits and protocols in a controlled setting (Fig. 4 and Supplementary Fig. 2). We study eight simulation cases; for each simulation case, we select one of four treatment benefit curves: flattens risk, limits risk, reduces biomarker, or has constant benefit (the left column in Fig. 4 and Supplementary Fig. 2) and one of two treatment probability distributions: strict adherence to a biomarker treatment threshold, or loose guidance based on a biomarker threshold (the middle column in Fig. 4 and Supplementary Fig. 2). These underlying risk and treatment curves combine to produce a population risk—the right column in Fig. 4 and Supplementary Fig. 2. The shape of the observed population risk (right column) is influenced by the shape of the intrinsic risk prior to treatment (left column) and the treatment decisions. From this simple causal explanation, we are able to generate many statistical artifacts in observed population risk curves, all of which can be categorized into: (1) discontinuities and (2) non-monotonicities in risk.
Based on simulations, we depict intrinsic risk for treated and untreated patients (left), adherence of treatment decisions to a threshold (middle), and observed risk (right). The optimal treatment threshold is where untreated risk crosses above treated risk; when protocols are misaligned, observed risk includes excess risk. The flat “with treatment” curve assumes a simplified, short-term view of treatment effects, where treatment applied too early may increase risk, and treatment applied at or after the optimal threshold can flatten high-risk regions if highly effective. For alternative assumptions of treatment benefits and resulting artifacts, see Supplementary Fig. 2.
Threshold-based decisions cause sharp cutoffs in real-world risk. In some cases, these cutoffs produce discontinuous risk profiles, suggesting that the intervention could be more precisely distributed or dosed to achieve statistical optimality. For example, strict adherence to a threshold that is too high (Fig. 4, yellow) results in observed population risk that includes excess risk below the treatment threshold where treatment should, but is not, being given, with a sharp, discontinuous drop in risk at the honored treatment threshold.
When the threshold is only loosely adhered to (possibly in a multi-factor decision list), the discontinuity turns into a non-monotonicity. As medicine becomes increasingly precise, multi-faceted, AI-supported, and personalized, we expect to see risk curves smooth out round-number biases and flatten observed risks—we later discuss an example (Fig. 6d) where recent improvements in clinical decision-making have already smoothed and flattened risk curves.
Using glass-box ML to systematically find statistical artifacts in risks
Based on the morphology of these two consistent types of artifacts in population risk (discontinuities and non-monotonicities), we develop a glass-box ML approach to systematically search for these two characteristic surprises. Our approach first uses GAMs to decompose risk into additive component functions of individual risk factors and then finds regions of discontinuities and/or non-monotonicities in each component.
To automate the search for discontinuities and non-monotonicities within each component, we define two statistical tests on the likelihood of the shape of a component (Fig. 5). We use boosted decision trees to train the GAM models because these have been shown to have excellent accuracy and high resolution in the learned feature functions. Moreover, the ability of trees to learn “jumps” in feature functions aids detection of discontinuities and non-monotonicities. For each component function fr estimated by a tree-based GAM, we have a set of thresholds tr and associated probability densities yr. To test for a discontinuity at threshold tr, we compare the probability of observed outcomes under the discontinuous threshold-based component function fr against the probability of observed outcomes under a linearized version of fr (Fig. 5b). To test for a non-monotonicity at threshold tr, we apply changepoint detection6 to the sign of non-zero slopes of the component function (Fig. 5c). For details of these tests, see “Methods”.
a Sketch of the tree-based GAM trained by InterpretML, which models the logit probability of outcome Y as an additive function of individual features of X. This model uses round-robin training to split information between individual component functions and bootstrap aggregation for confidence intervals. b For each component function and each threshold therein, we evaluate the likelihood of a discontinuity by comparing the probability of observed data under the discontinuous component function against a locally-linear version of the component function. c For each component function and each threshold therein, we evaluate the likelihood of a non-monotonicity by changepoint detection to the nonzero slopes of the signs of the component function.
Threshold-based artifacts reveal how clinical practice has improved over years of refinement
To study how clinical practice and threshold-based biases effects have changed over years of protocol refinement, we compare the statistical artifacts of clinical practice in three versions of the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC) dataset7,8,9. These datasets were collected over three decades of management of intensive care units (ICU) at a single hospital, providing high-quality snapshots of medical practice and outcomes over many years. We find several likely examples in which the harmful impact of round-number threshold effects has been ameliorated over the years of refinement in practice. However, counter-causal and threshold-driven effects are still strong and robustly estimated by data-driven risk models. Finally, we validate the hypothesis of treatment effects driving these statistical artifacts by identifying treatments recorded in MIMIC-IV9 that correspond to counterfactual changes in risk and de-confound the effects of treatment from the underlying biological risk. Overall, we find that the statistical artifacts of clinical practice are associated with observable treatment patterns, typically based on clinically-meaningful and round-number thresholds, and that while these effects have diminished over time, they are still routinely and robustly identified to impact patient mortality risk. Here, we study four examples of these effects.
As a first example, we see that harmful effects associated with blood urea nitrogen (BUN) have been ameliorated. In the oldest version of MIMIC, BUN strongly influenced mortality risk (Fig. 6a), with a sharp rise in risk from 20–35 mg/dL followed by a plateau from 35–100 mg/dL. In more recent datasets, this sharp rise has diminished, likely due to advancements in treatment protocols and the incorporation of BUN into multi-factor risk measures10.
The three datasets (MIMIC-II, MIMIC-III, and MIMIC-IV) span three decades of intensive care at a single hospital system, allowing comparisons of the effects of clinical practice over time. The top two rows show mortality risk associated with each risk factor (after correcting for all other observable factors of patient risk, 95% confidence intervals shaded). The bottom row shows treatment likelihood associated with each risk factor in MIMIC-IV (with 95% confidence intervals shaded). Each tick along the horizontal axes denotes ten patients. a An example of successful improvement in clinical practice: the harmful mortality risk attributable to elevated blood urea nitrogen has been ameliorated over time, reducing from large impact in the early 2000s (blue) to a small impact in the 2010s (black). b A second example of successful improvement in clinical practice: the mortality risk attributable to patient age has been smoothed. c, d An example of a persistent counter-causal effect. In all three datasets, the mortality risk attributable to serum sodium (c) decreases at 150 mEq/L. This counter-causal change in mortality risk is associated with an increase in water and electrolyte replacement (Dextrose 5%, Free Water, Potassium Phosphate, Sterile Water, Sodium Chloride 0.45%) (d). e, f As treatment effectiveness has improved over time, some risks have become even more counter-causal. For example, in the most recent version of MIMIC, the mortality risk associated with serum creatinine is maximized for moderately elevated creatinine and decreases sharply for severely elevated creatinine (e). This decrease in risk is associated with increased use of continuous renal replacement therapy (f), suggesting potential for improvements in clinical practice to spread the benefits of this care to a broader set of patients with lower creatinine levels.
As a second example, we see that discontinuous effects of age have been smoothed. Patient age exhibits a discontinuous impact on mortality rates (Fig. 6b), but these effects have smoothed over time. In particular, discontinuous rises in risk at 50, 55, and 60 years of age have been smoothed. While the impact of these round-number ages has reduced, age 80 continues to confer increased risk in all 3 three datasets, suggesting that changes to the treatment of elderly patients could reduce this risk and flatten the discontinuity at 80.
As a third example, we see that counter-causal effects of serum sodium have endured. As expected from medical intuition regarding hydration, the healthy range of 135–145 mEq/L serum sodium is associated with low mortality risk in all three datasets (Fig. 6c). On either side of this healthy range, the probability of mortality increases, rising rapidly for hyponatremia below 135 mEq/L and hypernatremia above 145 mEq/L. However, in all three datasets, this medical intuition is broken at 150 mEq/L: the probability of mortality declines for extremely elevated serum sodium. This threshold of 150 mEq/L is an indicator of moderate hypernatremia11 and is associated with increased water and electrolyte repletion (Fig. 6d). As a result of aggressive treatment of severe hypernatremia and under-treatment for mild hypernatremia12, the probability of mortality is lower for patients with serum sodium above 155 mEq/L than for patients with serum sodium of 145 mEq/L. This effect is robustly supported by statistical evidence, persistent over decades of care, and clinically meaningful—any accurate model trained on these datasets has learned that severe hypernatremia is lower risk than moderate hypernatremia, so applying these data-driven models in clinical practice could divert treatments away from the group of patients who would benefit most from treatment.
Finally, as a fourth example, we see that deleterious counter-causal effects of serum creatinine have strengthened. Elevated levels of serum creatinine indicate kidney dysfunction13; however, this biomarker is not strongly associated with mortality at extremely high values (Fig. 6e). Instead, in MIMIC-IV, the highest risk is observed for patients with moderate serum creatinine in the range 3–5 mg/dL. This counter-causal effect has strengthened between MIMIC-III and MIMIC-IV. This surprisingly non-monotone risk is linked to strong changes in treatment: the use of continuous renal replacement therapy (CRRT) sharply increases at both 3 mg/dL and 6 mg/dL (Fig. 6f). These patterns suggest that such extra monitoring for kidney dysfunction and corresponding interventions like CRRT have successfully reduced the mortality risk of patients with severely-elevated creatinine below the risk of patients with healthy levels of creatinine, leaving an opportunity to improve the outcomes of more patients by more widespread use of these interventions.