From Real‐World Patient Data to Individualized Treatment Effects Using Machine Learning: Current and Future Methods to Address Underlying Challenges

Clinical decision making needs to be supported by evidence that treatments are beneficial to individual patients. Although randomized control trials (RCTs) are the gold standard for testing and introducing new drugs, due to the focus on specific questions with respect to establishing efficacy and safety vs. standard treatment, they do not provide a full characterization of the heterogeneity in the final intended treatment population. Conversely, real‐world observational data, such as electronic health records (EHRs), contain large amounts of clinical information about heterogeneous patients and their response to treatments. In this paper, we introduce the main opportunities and challenges in using observational data for training machine learning methods to estimate individualized treatment effects and make treatment recommendations. We describe the modeling choices of the state‐of‐the‐art machine learning methods for causal inference, developed for estimating treatment effects both in the cross‐section and longitudinal settings. Additionally, we highlight future research directions that could lead to achieving the full potential of leveraging EHRs and machine learning for making individualized treatment recommendations. We also discuss how experimental data from RCTs and Pharmacometric and Quantitative Systems Pharmacology approaches can be used to not only improve machine learning methods, but also provide ways for validating them. These future research directions will require us to collaborate across the scientific disciplines to incorporate models based on RCTs and known disease processes, physiology, and pharmacology into these machine learning models based on EHRs to fully optimize the opportunity these data present.

Clinical decision making needs to be supported by evidence that treatments are beneficial to individual patients. Although randomized control trials (RCTs) are the gold standard for testing and introducing new drugs, due to the focus on specific questions with respect to establishing efficacy and safety vs. standard treatment, they do not provide a full characterization of the heterogeneity in the final intended treatment population. Conversely, real-world observational data, such as electronic health records (EHRs), contain large amounts of clinical information about heterogeneous patients and their response to treatments. In this paper, we introduce the main opportunities and challenges in using observational data for training machine learning methods to estimate individualized treatment effects and make treatment recommendations. We describe the modeling choices of the state-of-the-art machine learning methods for causal inference, developed for estimating treatment effects both in the cross-section and longitudinal settings. Additionally, we highlight future research directions that could lead to achieving the full potential of leveraging EHRs and machine learning for making individualized treatment recommendations. We also discuss how experimental data from RCTs and Pharmacometric and Quantitative Systems Pharmacology approaches can be used to not only improve machine learning methods, but also provide ways for validating them. These future research directions will require us to collaborate across the scientific disciplines to incorporate models based on RCTs and known disease processes, physiology, and pharmacology into these machine learning models based on EHRs to fully optimize the opportunity these data present.

FROM RANDOMIZED TRIALS TO PATIENT OBSERVATIONAL DATA
Clinical pharmacology as a discipline has been at the forefront of the science to identify the optimal dose and identify patient populations requiring dose adjustments. These approaches have largely depended on statistical and longitudinal approaches utilizing data from randomized clinical trials (RCTs). Regulators currently make decisions for drug approvals based on RCTs and prescribing physicians use these to determine the treatment approaches to be used. 1,2 Through randomization, well-designed RCTs evenly distribute the known and unknown factors about patients into interventional and control groups, thereby reducing the potential bias due to confounding variables. However, external validity is an issue for RCTs, as findings sometimes fail to generalize beyond the study population. [3][4][5] This may be due to the narrow inclusion criteria in RCTs compared with the real world, where historically, population restrictions with respect to disease severity, comorbidities, elderly patients, and ethnic minorities can be under-represented. 6 Although there is increasing awareness of this issue and global regulatory authorities are encouraging wider inclusion criteria in clinical trials, it remains an issue that is unlikely to be solved by RCTs and associated integrated and model-based analyses alone. 7 Nevertheless, when drugs are US Food and Drug Administration (FDA)-approved after the clinical trials stage, they start being administered to a much larger and varied population of patients. Each patient's unique medical characteristics-including comorbidities, laboratory results, prior treatments received, as well as the various adverse or beneficial effects observed-are stored in electronic health records (EHRs). EHRs have already been widely adopted in US hospitals. 8 Although data from EHRs have less regulatory acceptance compared with data from RCTs in addition to weaknesses, such as informative missingness and observational biases, it facilitates the quantity, accessibility, and heterogeneity of offline observational data for a variety of diseases and patients. We refer the readers to Beaulieu-Jones et al. 9 for a detailed review of the advantages and limitations of using real-world evidence, such as EHRs, to infer treatment effects. Estimating individualized treatment effects from EHR data represents a thriving area of research, in which machine learning methods are primed to take center stage.
Data from EHRs have already been successfully leveraged in several machine learning applications, such as modeling disease progression, 10 predicting disease deterioration, 11 and risk factors, 12,13 as well as predicting treatment responses. 14 However, it is important to not only make predictions from EHRs, but also understand what the best way is to act and give treatments. As also described by Beaulieu-Jones et al. 9 and Cave et al., 15 this requires understanding causality and being able to quantify causal relationships using the available observational data. In this context, machine learning methods for causal inference can be used to learn the effects of treatments from observational data and subsequently provide clinicians with actionable intelligence for making treatment decisions. We illustrate in Figure 1 how learning individualized treatment effects allow us to identify which patients benefit the most and least from certain treatments.
In this tutorial, we survey the recent progress that has been made in using observational data for building data-driven methods for making treatment recommendations. We refer readers to the recent tutorial on machine learning methods for a background of the general approaches 16 or to elsewhere as appropriate. Our focus is on a causal inference method for both the static (cross-sectional) setting and the longitudinal setting (where patient history and treatment timing are taken into account). We describe the challenges associated with learning from observational data, such as confounding bias, as well as the modeling choices used by machine learning methods to handle them in both settings. We illustrate the benefits of various methods and discuss the limitations and challenges of different approaches. We also note that this research is still in the early stages and more methods are needed to achieve the full potential of learning from observational data as well as for validating machine learning methods for causal inference.
Moreover, we note that relatively little work has been done at the interplay of combining experimental data from RCTs with observational data from EHRs or using existing pharmacometric and quantitative systems pharmacology (QSP) approaches to provide priors for machine learning models for causal inference (or machine learning driving pharmacology). Developing methods capable of integrating data from RCT, EHR, and pharmacokinetic/ pharmacodynamic (PK/PD) models provides the most exciting opportunity for real progress. The benefits of such interplay between machine learning and pharmacometric and QSP approaches have also been highlighted by Ribba et al. 17 in the context of combining reinforcement learning methods for precision dosing with PK/ PD models. 18,19 This is an opportunity that requires collaboration among statisticians, clinical pharmacologists, and machine learners to fully deliver on the promise that access to extensive data sources provide.

INDIVIDUALIZED TREATMENT RECOMMENDATIONS IN THE STATIC SETTING
What can be done using machine learning and what kind of patient observational data is needed?
We first consider estimating individualized treatment effects in the cross-sectional or static setting (i.e., when one single treatment decision needs to be made based on the one-time measurement (current) patient features). The information about patients collected in EHRs can be used by data-driven methods to inform clinical decision making. However, building machine learning methods to extract actionable intelligence from observational patient data involves causal inference, which goes beyond standard supervised learning methods for prediction. In this section, we describe what is possible to achieve using current state-of-theart machine learning methods for causal inference. In the static (cross-sectional) setting, the aim is to make a one-time treatment decision. We will use observational data containing clinical information, such as age, sex, comorbidities, symptoms, and laboratory measurements collected before the treatment decision is made. Conditioned on this patient data, machine learning methods can be used to estimate the effects of single or multiple treatments on various patient outcomes. In each setting, the observational data needs to contain the corresponding information about the type of treatments and patient outcomes to be able to train the data-driven models.
The standard scenario involves deciding whether or not to treat a patient. In this case, we need observational data about patients that have received the treatment, as well as patients that have not received the treatment and their corresponding outcome. The outcome of interest can be, for instance, 1-year mortality (binary outcome), a specific disease (categorical outcome), or the value of a specific laboratory measurement (continuous outcome) that is observed after the treatment decision is made. Using such observational data, machine learning methods can be trained to estimate a patient's outcome-both when they receive the treatment, as well as when they do not receive the treatment. Figure 1 By using causal inference methods to estimate individualized treatment effects, we can understand which patients benefit from certain medication and which patients do not, thus enabling us to make personalized treatment recommendations. We describe machine learning methods for causal inference, which are capable of estimating individualized treatment effects.
By comparing these two outcomes, physicians can then decide whether the treatment will benefit the patient or not. It is also possible to estimate multiple outcomes, such as patient recovery and associated side effects of treatments-in which case, the treatment decision can be a trade-off between the different outcomes.
Additional realistic treatment scenarios involve choosing one treatment out of multiple treatment options or deciding the optimal combination of treatment. In the case of multiple treatments, interactions between treatments can also be considered. Moreover, treatments often have an associated dosage, which is critical for assessing the therapeutic effect of the treatment on a patient. Causal inference methods can also be used to estimate individualized dose-response curves for patients and thus lead to decisions about optimal treatment dosage. To summarize, we provide in Table 1 some examples of treatment decisions that machine learning methods can be used for and references to machine learning methods that can handle the different types of treatments present in observational data. Although we note that these methods are general and can be applied to various types of patient characteristics in observational data, in Table 1 we also provide some examples of patient clinical data and patient outcomes that can be used for training these machine learning methods. Figure 2 provides a theoretical example, which illustrates how causal inference methods can be used to inform treatment decisions for patients with breast cancer. The EHR for a patient with breast cancer will contain, at diagnosis, data about the tumor, such as tumor size, number of positive nodes, estrogen receptor status, genetic information, etc., as well as clinical features describing the patient's overall health state, including comorbidities and laboratory measurements. Using this diagnosis information, we can estimate the causal effect of various treatment alternatives (postsurgery; e.g., chemotherapy and radiotherapy) on a patient's outcome of interest. Some possible patient outcomes are 5 or 10-year survival, probability of relapse, side effects, or some other measurable feature of interest associated with the patient's health. Based on the estimated individualized treatment responses and individualized dose-response curves, physicians can then choose the best treatment and the optimal dosage for each patient.
Challenges in using observational data for estimating individualized treatment effects To estimate individualized treatment effects, machine learning methods need access to observational data containing information about how patients with different characteristics have responded to treatments. By learning based on how other patients have responded to the treatment, these methods can be trained to estimate potential outcomes (i.e., patient outcomes under different treatment options) 26-28 conditioned on the patient's clinical state. Trained models can then be applied to estimate potential outcomes for new patients. We provide a table of definitions in Table S1.
Nevertheless, learning from observational data poses several challenges and requires strong assumptions that need to be rigorously assessed before performing causal inference. Whereas in RCTs the treatments are assigned randomly, in observational data, the treatment is often highly dependent on the patient's clinical characteristics. For instance, patients with cancer with more aggressive tumors are more likely to receive more extensive treatments but are also more prone to experience worse outcomes despite the treatment intervention. 29 This induces treatment selection bias in the observational data. Without adjusting for the selection bias, we can incorrectly conclude that the extensive treatment is harmful to the patients with aggressive tumors. Additionally, the selection bias also creates covariate shift in the distribution of patients receiving different treatments. For instance, when sicker patients receive more extensive treatments, the distribution of patients receiving extensive treatments will be different from the distribution of patients receiving nonextensive treatments. Standard supervised learning methods in this context will be biased toward the treatment assignments present in observational data and will not be able to generalize well in estimating patient outcomes for other possible treatment options.
Moreover, to be able to perform causal inference using observational patient data, the following assumptions need to hold: overlap (a.k.a. positivity), as well as no hidden confounders (a.k.a. unconfoundedness). 30,31 Overlap means that there is common support between the treatment groups (i.e., that each patient has a non-zero probability of receiving each treatment option). If we do not observe treatment alternatives in similar contexts, it is not possible to make reliable treatment recommendations. For instance, if patients with diabetes always receive the same treatment, we cannot estimate the effects of other possible treatments on these patients. In an observational dataset, overlap can be evaluated by computing treatment probabilities for each patient. If these are bounded away from 0 and 1 for each patient, then there is overlap between the treatment groups. Overlap becomes more problematic when we have high dimensional Patient outcomes Single outcome (e.g., survival probability), multiple outcomes (e.g., probability of recovery and associated side effects of treatment) Treatment decisions Optimal single treatment, 21,23 optimal combinations of treatments, 24 optimal dosage by trading off added benefits of treatments, and side effects 25 We also included references to specific machine learning methods that can handle the different types of treatments present in observational data. BMI, body mass index.
features for the patients. Several methods have been developed to identify regions in observational studies where all treatment groups are well-represented. [32][33][34] The current state-of-the-art method for assessing overlap, developed by Johansson et al., 34 identifies overlap regions by finding minimum value sets 35 of the observational distribution that are subject to coverage constraints by estimating treatment assignment propensities. The proposed algorithm, OverRule, 34 outputs a rule-based characterization of overlap between treated populations where causal conclusions from observational data are valid and is applied on a dataset with postsurgical patients containing information about the effect of opioid prescriptions on future misuse. 36 The unconfoundedness assumption ensures that we observe all the confounders (i.e., all the variables affecting the treatment assignment and the outcome). This is needed to be able to correctly adjust for the confounders when estimating individualized treatment effects. Nevertheless, having no hidden confounders is a strong assumption, and cannot be tested in practice: because counterfactuals are never observed, it is not possible to test for the existence of hidden confounders that might affect them. In this context, domain knowledge becomes vital to assess the validity of this assumption. In particular, elicitation from doctors or key opinion leaders may be needed to check if all confounding variables are included in the observational data. 9 Additionally, prior knowledge of causal links among patient characteristics, treatments, and outcomes can also be leveraged to identify potential unobserved confounders. 37,38 Furthermore, sensitivity analyses can be used to evaluate the potential impact that an unmeasured confounder could have on the estimation of treatment effects. [39][40][41] For instance, Franks et al. 41 propose factorizing the joint distribution of factual (observed) and counter-factual (unobserved) patient outcomes, into separate factors that are nonparametrically identified from the data and factors that cannot be uniquely identified from the data. This factorization enables quantifying the dependence among that partially observable patient outcomes, treatments, and patient features. Thus, sensitivity analyses can be used to determine the suitability of using treatment effects methods for an observational dataset of interest. Alternatively, several methods have been proposed to make use of proxy variables [42][43][44][45][46] or the presence of multiple treatments [47][48][49][50] to infer latent variables that can act as substitutes for the hidden confounders. Wang and Blei 47 observed that the dependencies in the assignment of multiple treatments can be used to infer latent variables that render the treatments independent and thus be used as substitutes of potential hidden confounders present in the observational dataset. Zhang et al. 50 extended the work of Wang and Blei 47 and developed the Medical Deconfounder, a method for reliably estimating the individualized effects of multiple treatment assignments on patient's laboratory tests. In this case, latent variables are used to correct for unobserved confounders, such as body mass index, that affect the patient's medication and clinical outcome of interest. The Medical Deconfounder was applied to a cohort of patients with type 2 diabetes mellitus with the aim of determining causality and the magnitude of the effect of their medications on their hemoglobin A1c levels. It was similarly applied to a cohort of patients with potassium disorders to determine the effect of their medications on potassium levels. Zhang et al. 50 illustrate that without adjusting for hidden confounders in EHRs, standard methods would otherwise spuriously identify noncausal treatments (e.g., noncausal treatments for managing hemoglobin A1c (HbA1c) levels in patients with diabetes).
In summary, whereas machine learning methodology has been developed to handle the covariate shift between different treatment groups, it is crucial to assess the validity of the overlap and unconfoundedness assumptions before training causal inference methods on the observational data. If these assumptions fail to hold, then the observational data would not be suitable for making personalized treatment recommendations.

Modeling choices and methods for individualized treatment effects
For simplicity, consider the problem of binary treatment assignment, which is the most studied one in the causal inference literature. In this case, the observational data consists of patient features X i , assigned treatment W i ∈ 0,1 , and observed outcome Y i for each patient i. The potential outcomes framework 26,27 allows us to define two potential outcomes for the patient: Y (0) i and Y (1) i representing the patient's outcomes when they have not received the treatment (W i = 0) and when they have received the treatment (W i = 1) respectively. The factual patient outcome Causal inference methods can be trained on observational patient data from electronic health records and then used to estimate individualized patient outcomes under possible treatment options. We illustrate how these estimates can be used to decide, at diagnosis time, the chemotherapy is beneficial, and the optimal dosage for a patient with breast cancer.
counterfactual outcome is unobserved, and represents what would have happened to the patient under a different treatment option than the one assigned in the observational dataset. The individualized treatment effect is defined as the difference in potential outcomes The fundamental challenge of causal inference is that the counterfactuals are never observed. This makes causal inference significantly different from standard supervised learning. In supervised learning, all data points are labeled, and a machine learning model can be trained to map the input features to the label. In causal inference, the label needed for assessing the benefit of treatment is given by the difference in potential outcomes However, we only have access to part of this outcome, namely the factual outcome of the patient; therefore, the counterfactual outcome needed for computing the individualized treatment effect cannot be observed.
As illustrated in Figure 3, the observational (training) dataset only contains information about the factual treatment and factual patient outcome. However, computing the ground-truth causal effects requires access to both factual and counterfactual outcomes. To solve this problem, machine learning methods are trained to estimate both response surfaces f 0 (x) = Y (0) |X = x and f 1 (x) = Y (1) |X = x using the observational data. By learning f 0 and f 1 , these data-driven methods can estimate both potential outcomes conditioned on the patient's features, and thus can be used to compute the individualized treatment effect. Several modeling alternatives exist for estimating the treatment response functions. Moreover, as previously described, the treatment assignment mechanism creates selection bias in the observational dataset, which induces a covariate shift between the treated and untreated patient distributions. An additional modeling choice comes from handling this distributional mismatch between the training and testing distributions. Thus, machine learning methods for estimating individualized treatment effects are characterized in terms of how they model the treatments (in terms of the response surfaces), as well as how they handle selection bias. We describe each in turn.
There are two options for using the observational data in a supervised learning setting to estimate the treatment response surfaces. One option is to fit a single regression model where the treatment assignment variable is added to the feature space f 0 (x) = f x,0 and f 1 (x) = f x,1 . Such a supervised model shares all the training samples, but it is less flexible in terms of modeling the response surfaces. Another option fits two separate regression models for the treated and untreated patients: f 0 (x) and f 1 (x). Although this provides more flexibility in modeling the response surfaces, it has the disadvantage of splitting the training samples between the two supervised regression models. A third option-adopted by many of the recent machine learning methods-involves having a shared structure between the two response functions, while at the same time fitting separate outcome models for the different treatments. The latter two options allow for model structures with "outcome-specific" hyperparameters. We illustrate in Figure 4 these approaches; note that in this figure the second and third options mentioned in this paragraph fit under the approach on the right.
In terms of handling the treatment selection bias, a common approach is to use the propensity score (i.e., the probability that a patient receives a treatment) to perform inverse probability of treatment weighting of the loss in the prediction models, to perform propensity matching, or as part of regularization schemes (e.g., propensity-dropout). A different approach involves building a "balancing representation" that makes the treated and control distributions more similar. Balancing representations, also known as treatment invariant representations, are representations of the patient covariates that have equal probabilities under the possible treatment options. Conditioning on the balancing representations to estimate the patient outcomes effectively reduces the selection bias present in observational datasets. 51 Many methods have been developed for estimating individualized treatment effects, each using different ways of modeling response surfaces and handling selection bias. Bayesian Additive Regression Trees, 52,53 Causal Forest, 54,55 and Bayesian Causal Forest 56 use tree-based estimators. Balancing Neural Networks, 20 Treatment-Agnostic Representation Network, 51 Counterfactual Regression, 51 and Similarity Preserved Individual Treatment Effect 57 are based on neural networks and build balancing representations of the patients' features. Deep Counterfactual Networks with Propensity Dropout 58 are also based on neural networks and use a regularization scheme based on the propensity score. MTDL-KNN 24 is a multitask deep learning model that builds patient representations predictive of their assigned treatment and uses matching based on K-Nearest Neighbors to estimate patient outcomes. Causal Multi-task Gaussian Processes (CMGP) 21 and Non-stationary Gaussian Processes (NSGP) 22 model the data distribution through Gaussian Processes. Conversely, GANITE 23 use the generative adversarial framework 59 for modeling that observational data distribution and estimate counterfactual outcomes. Table 2 provides a summary of the current state-of-the-art methods for estimating treatment effects and highlights their design choices with respect to using outcome-specific hyperparameters, handling the selection bias, and the type of machine learning model used.
Several of these methods allow for the computation of confidence intervals for the potential outcome estimates: CMGP, NSGP, Causal Forest, and GANITE. In addition, the Causal Forests, CMGP, and NSGP models have built-in variable selection, which allows us to understand which patient features are relevant for the patient's response to treatment. The methods described, except for GANITE, were designed for binary treatment assignments. Nevertheless, by comparing treatments in pairs, such methods can also be adapted to the case of multiple treatment options. GANITE, on the other hand, can be directly used to estimate potential outcomes for multiple treatments.
Alaa and van der Schaar 22,60 developed a unifying theory highlighting the fundamental limits of learning individualized treatment effects from observational data, as well as how they can be achieved. In light of the theory, ref. 22 provides several guidelines for the algorithmic design choice depending on the observational data available. Thus, ref. 22 shows that when limited amounts of observational data are available (small sample regime) it is vital to handle the selection bias and it is better to share training data between the response surfaces, by either fitting a single regression model for the two response surfaces with the treatment as an input feature, or by using multitask learning to estimate the response surfaces. The causal inference methods that are most appropriate in this case are CMGP, NSGP, Counterfactual Regression, and Similarity Preserved Individual Treatment Effect. Conversely, when the observational dataset has enough samples (large sample regime), selection bias no longer represents a serious problem and the model structure plays a more crucial role-that is, the way in which outcomes are modeled and how hyperparameter tuning is performed. Thus, using outcome-specific hyperparameters become more important. The methods that are more appropriate in this case are Treatment-Agnostic Representation Network and GANITE. These Figure 4 Different approaches for estimating individualized patient outcomes using machine learning. Methods that use the approach on the left are less flexible, as they include the treatment as an input feature and thus have the same outcome model for both treated and untreated patients. On the other hand, methods using the approach on the right fit different models for the different treatments and thus provide more flexibility in estimating patient outcomes. Such an approach allows for model structures with "outcome-specific" hyperparameters. theoretical results are essential for understanding which algorithms to use depending on the patient observational data available for training. These methods for causal inference in the static setting have been designed to be general and applicable to various types of observational datasets. They are usually evaluated on semisynthetic datasets with real patient features and simulated outcomes. The outcomes need to be simulated to be able to have access to the counterfactuals. Alaa and van der Schaar 21 evaluate the CMGP model on estimating the survival benefits of using Left Ventricular Assistance Devices on patients awaiting a heart donor. Moreover, Wendling et al. 61 evaluate Bayesian Additive Regression Trees 52,53 and Causal Forest 54,55 on estimating the impact of the volume of transcatheter aortic valve replacement on 30-day hospital readmission, 62 as well as the effect of endovascular and surgical revascularization therapy for critical limb ischemia on 30-day hospital readmission. 63 Alternatively, ref. 24 also evaluates the ability of their method to match the factual treatment recommendation present in the dataset. For this purpose, they use a heart failure dataset and asses the effect of the following medication: ACEI/ARB, beta-blockers, and aldosterone antagonist on 1-year hospital readmission.

Individualized dose-response estimation
Most of the methods developed for estimating individualized treatment effects from observational data focus on the binary or categorical treatment settings, and very few methods consider the impact of the associated treatment dosage on the patient's response. Nevertheless, when making treatment recommendations, the specific dosage chosen for each treatment also has a critical role in the decision. Observational patient data about how different patients have responded to different dosages can also be used to train machine learning methods for estimating individualized dose-response curves. Naturally, being able to estimate such curves for multiple treatments can aid us in determining the best treatment-dosage pair for each patient, as well as to obtain therapeutic indices of drugs tailored to the heterogeneous patient features. This becomes increasingly important when the relationships among drug toxicity, efficacy, and heterogeneous patient features are complex.
In this setting, we need observational data consisting of patient features X i , assigned treatment W i , assigned dosage D i , and observed outcome Y i for each patient i. The potential outcomes framework 26,27 can also be extended to dose response estimation. Thus, for each treatment-dosage pair w,d let Y i w,d (e.g., 1-year survival probability) can be the potential outcome for patient i. The aim of machine learning methods is to derive unbiased estimates of the potential outcomes w, d , x = [Y w, d |X = x], for each treatment dosage pair w,d and for any patient covariates X. w, d , x represents the individualized dose-response function.
As before, we again face the problem of only observing the factual treatment type W i and dosage D i , as well as the factual patient outcome Y i = {Wi=w, D i =d} Y w, d for patient i. However, in the dose-response setting, this problem is exacerbated by the fact that the number of counterfactual outcomes being potentially infinite. Thus, it is no longer possible to estimate a response surface for each possible treatment-dosage outcome. Moreover, observational data will not only present treatment assignment bias, but also dosage assignment bias. Due to the continuous nature of the dosage parameter, standard methods for adjusting for the treatment selection bias in the categorical treatment setting no longer apply-or at least need to be significantly adjusted to also handle the bias in the dosage assignment.
Several methods developed for estimating individualized dose-response curves make use of the generalized propensity score, [64][65][66] which represents the probability of the treatment-dosage assignment given the patient's characteristics. The propensity score is used to weight individual samples in the regression method for estimating the outcome. Such methods require correct specification of the generalized propensity score model and can also be numerically unstable due to extreme propensity weights. More recently, Schwab et al. 67 developed a deep learning model for this problem, named the Dose-Response Network (DRNet). The DRNet proposes subdividing the dosage interval for each treatment into equally sized subintervals, and then using multitask learning for estimating patient outcomes for each subinterval. However, this architecture restricts the shape of the dose-response function; moreover, whereas DRNet can be augmented to handle the treatment selection bias, it does not handle the dosage selection bias. Bica et al. 25 propose SCIGAN, a model that uses generative adversarial networks to learn the data distribution of the counterfactual outcomes and thus generate individualized dose-response curves. SCIGAN does not place any restrictions on the form of the treatment-dose response functions and is capable of estimating patient outcomes for multiple treatments, each with an associated dosage parameter. These methods have been also been evaluated on semisynthetic medical datasets with simulated outcomes and have considered the effect of antibiotic dosage on patient outcomes 25 or the effect of dosage in cancer treatments, such as chemotherapy, radiotherapy, and immunotherapy on patient relapse. 25,67 We note that, despite its importance, the dose-response problem has been less studied in the causal inference literature. To be able to achieve the full potential of machine learning methods for learning individualized dose-response curves, additional theory is needed to understand which model structures learn more efficiently, how to best handle treatment and dosage selection bias, as well as how to best select models in this context. The challenge in a static setting is recognized but the dose response problem using longitudinal data is even more complex.

INDIVIDUALIZED EFFECTS OF TIME-DEPENDENT TREATMENTS
How can machine learning help to make treatment decisions in the longitudinal setting?
EHRs also contain information about time-dependent patient covariates and treatment strategies, including how treatment plans are changed over time due to adverse effects or drug resistance. Learning treatment effects in the cross-sectional setting cannot be used to understand what is the optimal time to administer a treatment, when the treatment regime needs to be stopped, or in which order treatments should be given to obtain the best patient response. Answering these questions involves modeling a patient's history and how the patient's features are affected by treatments in a longitudinal setting. By estimating counterfactual trajectories under different treatment plans in the future, we can then choose the sequence and timing of treatments that lead to the best patient outcome. Machine learning methods for estimating treatment effects over time can, therefore, be used to support decision making in the temporal setting.
Besides baseline patient features, such as sex and genetic information, longitudinal information about the patient is collected and added to their EHRs for each follow-up event. This information may include laboratory tests, specific treatments administered, the patient's performance status while undergoing treatment, etc. While the patient is receiving particular treatments, it is possible that the treatment's toxicity becomes too high (for instance, in the case of chemotherapy) or the patient starts developing resistance to the treatment. In this case, machine learning methods can be used to estimate what would happen to the patient if the treatment strategy were changed. The counterfactual outcomes that depend on the patient's history can be used to decide what is the best future treatment plan. The counterfactual predictions can be made in a dynamic fashion, by modeling and updating the patient's state over time to consider all past medical events. At each time point, we can assess whether the treatment strategy chosen for the patient is beneficial or if it needs to be modified to prevent adverse outcomes.
In addition to estimating single or multiple patient outcomes, in the longitudinal setting, an outcome of interest can either be observed immediately after receiving the treatment (at the next timestep) or after a planned sequence of treatments. In terms of treatment options, we can model time-dependent binary treatment applications (i.e., treating or not treating at a specific time), multiple combinations of treatments, as well as treatment dosages. Machine learning methods for estimating individualized treatment effects in these scenarios can be used not only for choosing the best sequences of treatments, but also for choosing their best timing. In Figure 5 we illustrate such an example for a patient with breast cancer.

Challenges in using observational data for treatment decisions in the time-series setting
Estimating treatment effects in the longitudinal setting becomes more challenging than in the cross-sectional setting due to the presence of time-dependent confounders: patient covariates that are affected by past treatments and which then influence future treatments and outcomes. For instance, consider for patients with type II diabetes that we want to estimate the effect of an antiglycemic drug on the risk of experiencing a cardiac event. Sustained high blood glucose levels for a patient will result in an escalation of the treatment, while also leading to an increased risk of cardiac events. Thus, blood glucose levels are confounders that are part of the causal path to the risk of the cardiac event. Without adjusting for the bias introduced by these confounders, we may incorrectly conclude that the antiglycemic drug is harmful. Because the blood glucose levels vary over time and are affected by past treatments (e.g., an effective antiglycemic drug will lower the blood glucose levels) it is considered a time-varying confounder.
In the longitudinal setting, we once again only observe factual patient trajectories, but we need to estimate counterfactuals in order to make treatment decisions. To be able to identify the counterfactual patient outcomes from observational data, we also need the overlap and unconfoundedness assumptions to hold. In this case, overlap means that at each timestep we assume that each treatment option has a non-zero probability of being administered. The overlap assumption can be assessed by computing the probability of a patient for receiving each possible treatment at each time point. Moreover, the unconfoundedness assumption in the sequential setting means that at each timestep we observe all variables affecting the patient's treatment and outcome. Validating this assumption requires domain expert knowledge, performing sensitivity analysis, [39][40][41] or estimating substitutes for the hidden confounders. 49 For instance, Bica et al. 49 propose taking advantage of the sequential assignment of multiple treatments in order to estimate the effect Figure 5 Causal inference methods for estimating individualized effects of time-dependent treatments need model the patient's history in order to make counterfactual predictions of potential outcomes for possible sequences of future treatments. We illustrate an example of how such predictions can be used to decide the best timing and order of chemotherapy and radiotherapy treatments for a patient with breast cancer.
of time-varying treatments in the presence of hidden confounders. They apply their model to improve the estimation of the effect of antibiotics, mechanical ventilator, and vasopressors on the patient's white blood cell count, blood pressure, and oxygen saturation.

Modeling choices and methods for estimating individualized effects of time-dependent treatments
In the setting of estimating the effects of time-varying treatments, we consider that for each patient i we observed time-dependent patient features X i,t , treatments administered W i,t , and outcomes Y i,t+1 for T i timesteps. The dependence on time is indicated by the subscript t. Note that although we provide notation for the discrete-time setting, this framework can be easily extended to continuous time. Moreover, the patient can also have baseline covariates V i , such as genetic information, age, and sex.
The potential outcomes framework proposed by refs. 26, 27 for the cross-sectional setting has been extended by ref. 28 to account for time-varying treatments. Thus, let Y (w) be the potential outcomes, either factual or counterfactual, for each possible course of treatment w. In this setting, w can represent either an individual treatment application or a possible sequence of future treatments. Let The history H i,t contains the of the baseline variables, the evolution of the time-dependent features, and the treatment assignments until time t. We are interested in characterizing the distribution of potential outcomes for any patient history H t : where w t,t + − 1 = w t ,...,w t+ −1 represents a possible sequence of treatments from timestep t just until before the potential outcome Y t+ is observed.
Modeling choices in the estimation of treatment effects over time involve handling the time-dependent confounding bias, handling discrete-time or continuous-time interventions, and modeling the progression of one or multiple patient covariates of interest.
The main methods for handling the time-dependent confounding bias were originally proposed in the epidemiology literature and include the g-computation formula, g-estimation of structural nested mean models, and inverse probability of weighting estimation of marginal structural models. 28,[68][69][70] These methods are based on strong theoretical foundations for removing the time-dependent confounding bias. However, they were originally developed with prediction models based on linear/logistic regression, which make strong assumptions about the underlying data structure (i.e., that the outcome is a linear function of the history of patient covariates and treatments). Additionally, to handle the changes in the patient covariates and treatments over time, these prediction models based on linear/logistic regression also make strong assumptions about which of the previous timesteps affect the current one, and require different sets of coefficients to be estimated at each timestep. Consequently, such methods are not capable of handling complex nonlinear relationships and long-term dependencies among the history of patient covariates, treatments, and potential outcomes.
Several methods based on Bayesian nonparametrics have been proposed to estimate the effects of longitudinal treatment assignments in discrete-time 71 and continuous-time settings. [72][73][74] These models based on Gaussian Processes also make strong assumptions about how the treatments and patient features interact. As a result, the heterogeneous treatment effects resulting from baseline variables (e.g., genetic and demographic information) are either omitted 73 or incorporated as a linear component. 71  Lim et al. 75 propose Recurrent Marginal Structural Networks (RMSNs), which can estimate treatment effects in the discrete-time setting, and can incorporate baseline patient features as well as model the progression of multiple clinical covariates and treatments of interest. RMSNs improve on standard Marginal Structural Models by using recurrent neural networks to estimate the inverse probability of treatment weights and the counterfactual treatment outcomes. Through inverse probability of treatment weighting, these models create a pseudo-population where, at each timestep, the probability of treatment does not depend on the time-varying confounders. However, RMSNs are not robust to model miss-specification in computing the sequential treatment probabilities and can also be numerically unstable due to extreme weights. Bica et al. 76 introduce the Counterfactual Recurrent Network (CRN), which builds balancing representations to remove the time-dependent confounding bias, thereby overcoming the problems of inverse probability of treatment weighting. CRN estimates individualized treatment effects in the discrete-time setting and allows for flexible modeling of the patient's baseline information and progression of multiple clinical variables and treatments of interest. We provide in Table 3 a comparison of the causal inference methods for estimating the effects of time-varying treatments.
Although these methods have been developed to be general and applicable to various types of observational datasets from EHRs, we describe several example applications that these methods have been used for. To begin with, Xu et al. 71 consider patients with sepsis in the intensive care unit and evaluate the ability of their model to estimate the individualized effect of intermittent hemodialysis and continuous renal replacement therapy on creatine levels, a measure of kidney deterioration. Schulam and Saria 73 also consider the same clinical scenario, and also estimate the effect of intermittent hemodialysis, continuous venovenous hemofiltration, and continuous venovenous hemodialysis on creatine levels. Alternatively, Bica et al. 76 also use observational data from the intensive care unit and evaluate the ability of their proposed CRN model to estimate the personalized effect of antibiotics on the white blood cell count. Note that these methods are usually evaluated on their ability to estimate the factual patient. However, as we discuss in the Role of RCTs and Pharmacometric and QSP approaches in machine learning methods for causal inference, this is not enough for thorough validation of these methods.
These models for performing counterfactual estimation of future treatment outcomes can be used to answer critical medical questions, such as deciding when to give treatments, when to start and stop treatment regimes, as well as how to select from multiple treatments over time. However, the estimation of treatment effects over time also requires additional theoretical understanding and more principled methods of performing model selection.

EVALUATION OF CAUSAL INFERENCE METHODS
Evaluating causal inference methods and deploying them in practice still represents a challenge. In standard supervised learning, one can choose the best performing method by computing the prediction error on a validation dataset, and then selecting the model with the lowest error. Because we do not observe counterfactuals, for causal inference models, it is not possible to compute the error on estimating the ground truth causal effects on real data. An initial step for validating machine learning methods for causal inference involves using synthetic data. To create a synthetic observational dataset, parametric functions, such as polynomials or exponentials, are usually used to generate patient outcomes under different treatment options. 21,25 Confounding bias is introduced by assigning treatments according to Bernoulli or Categorical random variables, with parameters depending on the patient characteristics. 23,25 In this synthetic setup, the causal inference methods can be evaluated based on their error on estimating all potential outcomes and the ground truth causal effects. The sensitivity of the causal inference methods to various degrees of confounding bias can also be assessed by changing the parameters of the probability distributions used to assign treatments for the patients in the observational data generated for training. 25 Whereas this allows us to perform a preliminary evaluation and assess the robustness of the machine learning methods for estimating individualized treatment effects, synthetic data generated in this way will not capture the complexities of real-world data. We describe in the role of RCTs and pharmacometric and QSP approaches in machine learning methods for causal inference how PKs/PDs can be used to improve the synthetic evaluation of causal inference methods.
Furthermore, for each observational dataset, several causal inference methods may be appropriate to use and, in fact, no model will perform best on all datasets. 73 In fact, as discussed in Modeling choices and methods for individualized treatment effects, different modeling choices for the causal inference methods will be important depending on the size of the observational dataset available for training. 22 The theoretical guidelines provided by Alaa and van der Schaar 22 and validation through synthetic data enables us to narrow down the choice of suitable causal inference methods for the observational data available. Nevertheless, data-driven validation procedures, similar to cross-validation for standard machine learning tasks, 15 are needed to find the most suitable causal inference method for a given real-world study. We discuss further possible ways of performing internal model evaluation (selecting the causal inference model that performs the best on a held-out test set of the available observational data) and external model evaluation (assessing causal inference methods once they have been deployed in practice).
For internal model evaluation, Alaa and van der Schaar 77 propose a method for assessing causal inference models by using influence functions (a technique in robust statistics and efficiency theory 78,79 ) to estimate the loss of machine learning models for causal inference without requiring counterfactual data. As illustrated in Figure S1, model selection is an essential first step before deploying any causal inference method in practice. The method proposed Alaa and van der Schaar 77 allows us to approximate the error of causal inference methods on a held-out test set and subsequently enables us to select the best performing model for an observational dataset of interest. Although this method has been developed for the cross-sectional setting, future work could extend it to the setting of time-varying treatments.
Once a model has been deployed in practice, external model evaluation is needed. In this context, the machine learning method for causal inference can be evaluated by measuring whether the patient outcomes have been improved when assigning treatments according to the recommendation of the causal inference model.

ROLE OF RCTS AND PHARMACOMETRIC AND QSP APPROACHES IN MACHINE LEARNING METHODS FOR CAUSAL INFERENCE
Although relatively little work has been done in this direction, combining experimental data from RCTs and observational data from EHRs, as well as integrating pharmacometric and QSP approaches into machine learning models can potentially lead to even better methods for estimating individualized treatment effects. For estimating individualized dose responses, experimental data from clinical trials where a wider range of possible dosages are tested out could be particularly important for improving causal inference methods. In this setting, Silva 80 proposed leveraging both interventional data from RCTs and observational data from EHRs to build a method based on Gaussian processes that can estimate individualized dose responses. Alternatively, Kallus et al. 81 propose using data from RCTs to adjust from the potential bias from hidden confounders that can be present in observational data.
The interplay between machine learning and pharmacometric and QSP approaches represents an important research frontier, which has also been explored by Ribba et al. 16 in the context of integrating PK/PD modeling into reinforcement learning algorithms for precision dosing. In addition, Koch et al. 82 illustrate the benefits of incorporating pharmacometrics into machine learning for classification of clinically relevant patient outcomes.
In the context of causal inference, we believe that a promising avenue for future work involves using PK/PD modeling to enforce constraints and priors on the machine learning methods. For instance, knowledge of the form of dose-response curves can be used to improve the design of the neural network architectures in refs. 25 and 67 This would be especially useful in the small-sample size setting, where PK/PD modeling could be leveraged to enable faster learning. Furthermore, integrating PK/PD models into machine learning methods could also lead to more interpretable causal inference models, which can consequently allow for faster validation and adoption in practice.
Moreover, PK/PD modeling can also be used to evaluate and select causal inference methods. For instance, Lim et al. 75 proposed using a PK/PD model for non-small cell lung cancer, which models the evaluation of tumor growth under the combined effects of chemotherapy and radiotherapy. 83 This PK/PD model has been used to obtain counterfactual outcomes and thus evaluate causal inference methods for estimating treatment effects over time.

LIMITATIONS AND FUTURE RESEARCH DIRECTIONS
We have yet to harness the full potential of observational data for clinical decision support. 15 Although the simple case of static, binary treatments has been well-studied in the causal inference literature, EHRs contain information about significantly more complex treatment scenarios. In order to design more robust and effective machine learning methods for personalized treatment recommendations, it is vital that we gain a deeper theoretical understanding of the challenges and limitations of modeling multiple treatment options, combinations, and treatment dosages from observational data-in both the cross-sectional and time-series settings. In addition, developing machine learning methods that utilize both RCT and EHR data and integrating pharmacometric and QSP approaches into machine learning methods represent important research frontiers. Leveraging different data sources, such as RCTs and EHRs, to train machine learning methods provides more evidence for the effectiveness of treatments. 84,85 PK/ PD approaches may allow us to introduce priors into machine learning methods and improve their interpretability and performance. 9,17,82 The potential to use PK/PD models to drive machine learning models or to support validation are important aspects that require further collaboration across the quantitative sciences.
In practice, incorporating machine learning models into clinical decision support systems entails challenges of its own. Causal inference methods for estimating treatment effects from observational data require thorough medical validation before real-world deployment. We refer the readers to the work of Eichler et al. 86 for a more thorough discussion of the requirements for making machine learning methods acceptable for decision makers and regulators. In particular, domain knowledge is critical for providing assurance that the unconfoundedness assumption actually holds in the observational dataset of interest, and the overlap assumption also requires testing. At the end of the day, because we never observe any counterfactuals, expert knowledge is always needed to validate the model' s estimates of the treatment effects. 9 Furthermore, we must ensure that the patient population used to train the model is representative of the patient population the model is to be deployed on. If this is not the case, additional methods are needed to ensure that causal inference models are robust to distributional shifts in the patient population.
Equally important for clinical adoption is understanding which patient features are the strongest drivers of treatment response particularly when patient data are high-dimensional. Developing more sophisticated and accessible tools for this can lead to a better understanding of diseases, or even discoveries about better treatments. In addition, finding relevant patient features for estimating treatment effects can also lead to more transparent, interpretable models. Finally, uncertainty estimates are also beneficial for assessing the model's confidence in its predictions. After all, if the patient outcomes estimated by a model are highly variable, it becomes difficult to trust the model safely and consistently for treatment decisions.
Future work addressing these aforementioned research directions will lead to better and more reliable machine learning methods capable of estimating individualized patient responses for complex treatment scenarios. It is our goal that these methods be safely and effectively integrated into the clinical decision support process. It is only by integrating data generated in RCTs and EHRs and integrating the corresponding models that society will fully benefit from the opportunity that presents itself to us. We can only achieve that by collaborating across the quantitative disciplines.

SUPPORTING INFORMATION
Supplementary information accompanies this paper on the Clinical Pharmacology & Therapeutics website (www.cpt-journal.com).

FUNDING
The work in this paper was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1 and the National Science Foundation (NSF) grant 1722516.