Assessment of an In Silico Mechanistic Model for Proarrhythmia Risk Prediction Under the CiPA Initiative

The International Council on Harmonization (ICH) S7B and E14 regulatory guidelines are sensitive but not specific for predicting which drugs are pro‐arrhythmic. In response, the Comprehensive In Vitro Proarrhythmia Assay (CiPA) was proposed that integrates multi‐ion channel pharmacology data in vitro into a human cardiomyocyte model in silico for proarrhythmia risk assessment. Previously, we reported the model optimization and proarrhythmia metric selection based on CiPA training drugs. In this study, we report the application of the prespecified model and metric to independent CiPA validation drugs. Over two validation datasets, the CiPA model performance meets all pre‐specified measures for ranking and classifying validation drugs, and outperforms alternatives, despite some in vitro data differences between the two datasets due to different experimental conditions and quality control procedures. This suggests that the current CiPA model/metric may be fit for regulatory use, and standardization of experimental protocols and quality control criteria could increase the model prediction accuracy even further.


Background
Under the Comprehensive In Vitro Proarrhythmia Assay (CiPA) Initiative, the candidate model (CiPAORdv1.0) (1) and metric (qNet/Torsade Metric Score) established by training data have been used to predict the TdP risk levels of the 16 validation drugs and the prediction performance is assessed according to the CiPA In Silico Model Validation Strategy (hereinafter referred to as Validation Strategy) prespecified prior to validation. This report contains the technical details and comprehensive results for this validation study.

Model, Metric, and Validation Datasets
The model (CiPAORdv1.0) and metric (qNet/Torsade Metric Score) were all "frozen" in previous publications based on training data (1)(2)(3), and documented in Validation Strategy prior to validation. Similar to training data, two semi-independent validation datasets, one manual and one hybrid, were generated for the same set of 16 CiPA drugs. The two thresholds that classify drugs into three TdP risk categories were also calculated (see section 6 of this document for calculation method) and "frozen" based on the training data. For the manual dataset, Threshold 1 (separating Low from Intermediate/High Risk) has a value of 0.0689 and Threshold 2 (separating High from Intermediate/Low Risk) has a value of 0.0579 µC/µF. For the hybrid dataset, the two thresholds were calculated as 0.0671 and 0.0581 µC/µF respectively.
The different experimental conditions for manual and hybrid datasets are listed below in Table   1. The detailed experimental procedures used to generate these two datasets are provided in section 3 of this document.  . For whole-cell voltage clamp recordings, the command potential values were corrected for the 15 mV liquid junction potential that resulted from using the above internal solution. For protocols involving step voltage waveforms, signals were filtered at 2-3 kHz, digitized using a Digidata 1550A1 interface (Molecular Devices, CA) at 5-10 kHz, and transferred to a computer using pClamp10 software (Molecular Devices, CA). Seal resistance ranged from 2-10 G, and series resistance was electronically compensated for at 80%. HERG current was evoked with a 10 s depolarizing voltage step from -80 mV to 0 mV for 10 s every 25 s. To evaluate the effects of drugs on hERG current, baseline hERG current stability was achieved prior to drug application for every cell. Each cell was exposed only to one concentration of drug.
Nav1.5 (Nav1.5-CHO; SCN5A; Charles River Laboratories (Wilmington, MA); Catalog #CT6007) ion channels were constructed as described previously (6). CA) in Population Patch-Clamp™ (PPC) mode as previously described (7). The extracellular solution was loaded into the PPC plate wells (11 mL/well) and a cell suspension was added into the wells (9 mL/well). After establishment of a whole-cell configuration (7-min perforation), membrane currents were recorded by IWB on-board patch clamp amplifiers.

Uncertainty Quantification of Experimental Data
The manual hERG dynamic data were preprocessed as previously (3) Table   2 for an explanation.

TdP Risk Levels and Clinical Exposure of Validation Drugs (Cmax)
The known TdP categories and Cmax (maximum free therapeutic concentration) for validation drugs are listed in Table 5 below. For droperidol, only the total therapeutic Cmax was reported in the literature with a value of 60 ng/ml (10). No protein binding assays were reported for this drug, but one study reported the human serum albumin association constant (Ka) for a series of psychoactive butyrophenones including droperidol (11 (1), we used 100 nM instead of 140 nM as ibutilide's Cmax in calculating its qNet and torsade metric score. This is not projected to change the prediction results. ***The Redfern paper only listed the total Cmax of droperidol as 60 ng/ml. We calculated the (free) Cmax here based on this total Cmax and one study reporting the binding association constant between droperidol and human serum albumin (11). See main text above for details. ****Metoprolol has a maximum total Cmax of 221 nM at a dose of 50 mg (14) and a protein binding rate of 90% (13). According to FDA label (https://www.accessdata.fda.gov/drugsatfda_docs/label/2008/017963s062,018704s021lbl.pdf), the maximum daily dose is 450 mg and there is a linear relationship between oral dose and plasma concentration up to 400 mg. A free Cmax of ~1800 nM was calculated assuming a linear PK profile up to a single oral dose of 450 mg.

Modeling Procedures Human Ventricular Model and TdP Risk Metric
The numerical methods and simulation procedures were published in details previously (1).
Briefly, the optimized IKr-dynamic ORd model, which includes the dynamic hERG model as a component (3) and was developed using training data (2), was frozen as CiPAORdv1.0 model for validation in this study. Model equations were written in C and compiled for use with R (https://www.r-project.org) and the deSolve package (8). Equations were integrated using the lsoda solver with relative and absolute error tolerance both set at 10 -6 .
To calculate the torsade metric score (TMS), the CiPAORdv1.0 model was used to simulate action potentials (APs) at a cycle length of 2s (to mimic bradycardia) for 1000 beats, starting from control conditions achieved by pacing without drug for 1000 times to reach steady state (1). For each drug, 4 concentrations (1-4x Cmax) were used to calculate the metrics, which were then averaged across concentrations to obtain a mean metric value for each drug. For qNet, this mean value is torsade metric score (TMS). Other metrics (like APD90) were calculated in a similar way, averaging across 1-4x Cmax.
The qNet metric (2) was computed by integrating the sum of six major currents (IKr, ICaL, INaL, Ito, IKs, and IK1) over one of the last 250 beats with the steepest reactivation of the membrane potential during the plateau phase (as this is the "worst case scenario" in terms of EAD generation) (1). APD90 and APD50 were defined as the time it takes an AP to repolarize by 90% and 50% respectively, between the peak of AP and the resting membrane potential. When

Calculating Classification Thresholds from Training Data
To classify drugs into distinct TdP risk categories, a classifier was trained on all TMS distributions of all 12 drugs using ordinal logistic regression implemented in the R package rms (https://CRAN.R-project.org/package=rms). For the consideration of inter-correlation between all samples for the same drug please refer to the prespecified Validation Strategy.
The classifier reports the probability of each sample belonging to each of the three risk categories, and the sample was assigned to the category with the highest probability. To convert this process into classification thresholds for visualization, the follow procedure was taken.
The ordinal logistic regression can be written as And the corresponding Threshold 2 can be calculated as Threshold 2 = log( − 1 * − 2 /−(2 * − 1 − − 2 )) / After this calculation, a drug sample with its TMS right to (greater than) Threshold 1 would have its highest probability in the Low Risk category, while a drug sample whose TMS is left to (less than) Threshold 2 would have its highest probability in the High Risk category. A drug sample whose TMS is in between Threshold 1 and 2 would have the highest probability in the Intermediate Risk category.

Performance Measures
The model performance measures on the validation drugs were pre-specified in Validation Strategy, where the rationale and analytical methods for these measures are also provided.

Receiver Operating Characteristic as a Rank Performance Measure
The Receiver Operating Characteristic (ROC) curve was constructed by taking all possible TMS cutoffs (thresholds) for the 16 validation drugs to plot sensitivity vs 1-specificity. For ROC1 sensitivity is the proportion of true High-or-Intermediate Risk drugs predicted to be so, while specificity is the proportion of true Low Risk drugs predicted to be so. For ROC2 sensitivity is the proportion of true High Risk drugs predicted to be so, while specificity is the proportion of true Intermediate-or-Low Risk drugs predicted to be so. Area under the curve (AUC) of ROC was calculated by the R package ROCR (15). The representative ROC curves as well as distribution of AUCs across 10000 curves for ROC1 analysis are shown for the manual (Fig 1A) and hybrid (Fig 1B) validation dataset respectively. The results suggest that the probability of ranking a High-or-Intermediate Risk drug above (torsade metric score lower than) a Low Risk drug using CiPAORdv1.0 is 0.89 (95%CI 0.84 -0.95) and 0.98 (0.93 -1) for manual and hybrid validation dataset, respectively. Similarly, for ROC2 analysis (Fig 2A&B), the probability of ranking a High Risk drug above (torsade metric score lower than) an Intermediate-or-Low Risk drug is 1 (0.92 -1) and 0.94 (0.88 -0.98) for manual and hybrid validation dataset, respectively.
The median values of all these AUCs exceed or are very close to the pre-defined "excellent" ranking performance level (AUC > ~0.9) ( Table 1 of Validation Strategy).  shown for the manual (Fig 3 A) and hybrid (Fig3 B) validation datasets respectively. This analysis indicates that the probability of correctly ranking a drug through pairwise comparison against the CiPA reference drugs is 0.95 (95% CI 0.92 -0.98) and 0.96 (0.92 -0.99) for the manual and hybrid dataset respectively, well above the prespecified "excellent" performance level (> ~0.9). In addition, the Validation Strategy specifies a measure where each validation drug's predicted TdP risk rank among all 28 CiPA reference drugs is compared to the true rank.
The data are shown in Table 6, which show a high level of consistency and low variability (relatively narrow 95%CI).  For each validation drugs, the known CiPA TdP risk rank against all 28 reference drugs are compared to predicted risk rank for the manual and hybrid dataset. The most dangerous drug has a rank of 1, while the safest drug a rank of 28. Note that because with-in CiPA risk category ranking is unknown, each drug has a range of possible risk rank (column 2). High Risk drugs have a range of 1-8, while Intermediate and Low Risk drugs have a range of 9-19 and 20-28 respectively. The median and 95%CI of the predicted ranks for the manual and hybrid validation dataset are shown in columns 3&4. The 95% CI are based on the 10000 subsets sampled from the probability distribution of TMS.

Likelihood Ratio as a Classification Performance Measure
Likelihood Ratio for positive results (LR+) was calculated as sensitivity divided by 1-specificity, while likelihood ratio for negative results (LR-) was 1-sensitivity divided by specificity. To prevent the denominator from becoming 0, a very small number, randomly sampled from a normal distribution (µ=1x10 -6 and δ=1x10 -12 ), was added to both the dividend and denominator. In the calculation of LR+ and LR-for Threshold 1, the definition of sensitivity and specificity is the same as in ROC1, while for Threshold 2 the definition is the same as in ROC2. Note though that unlike ROC analysis, where many thresholds were tried to calculate a series of sensitivity and specificity, for LR analysis sensitivity and specificity were based on specific thresholds (Threshold 1&2 as defined in section 2 of this document) pre-defined by training data.
The resulting values (Table 7) suggest that, using Threshold 1 as a cutoff value, a High-or-Intermediate Risk drug is 4.5 and 8x10 5 times (median of LR+) more likely to be classified into the High-or-Intermediate category, but 8.8 and 5.5 times (median of 1/LR-) less likely to be classified into the Low Risk category, compared to a Low Risk drug, for the manual and hybrid validation dataset, respectively. Similarly, using Threshold 2, a High Risk drug is 12 and 6 times more likely to be classified into the High category, but 9x10 5 and 3.7 times less likely to be classified into the Low-or-Intermediate Risk category, compared to a Low-or-Intermediate Risk drug, for the manual and hybrid dataset, respectively. These measures exceed or are very close to the "good" classification performance levels pre-defined by Validation Strategy (LR+ and 1/LR-> ~10 for excellent and >~5 for good performance). Median and 95%CI of the LR+ and LR-values are shown. LR for positive results (LR+) indicates how much more likely a higher risk drug would be classified into the higher risk category compared to a lower risk drug, while the inverse of LR for negative results (LR-) indicates how much less likely a higher risk drug would be classified into the lower risk category compared to a lower risk drug.

Mean Classification Error as a Classification Performance Measure
For the calculation of mean classification error, the Low, Intermediate, and High Risk levels    1 and 2), and the last one fraction classified as Low Risk (right to Threshold 1). The distribution using the manual validation dataset is shown in column 3 while the hybrid dataset is shown in column 4.

Comparing To Alternative Metrics
The Leave-One-Out Cross Validation (LOOCV) to compare different metrics were done similarly to before (1). For each metric a probability distribution characterized by 2000 sample values per drug was estimated using the UQ method we developed for qNet/TMS above. An ordinal logistic regression classifier was trained on all samples from each metric using all but one left-out drug, and then was used to predict the TdP risk category of all 2000 samples of that left-out drug. As typically done for LOOCV, both ranking and classification were performed based on the predicted risk category probabilities. Of note this is different from the validation of the TMS metric as done in section 7, where ranking is based on the continuous TMS values directly, without introducing any statistical models to predict probabilities.
For ROC1 the ranking was based on the predicted probabilities in High or Intermediate Risk categories (P(High)+P(Intermediate) as the basis for ranking drugs), while ROC2 was based on the predicted probabilities in High category (P(High) as the basis for ranking drugs). For pairwise comparison the predicted ranking was compared to the true ranking of risk categories between two drugs in a pair. The predicted pairwise ranking was determined by following rules: if the two drugs are predicted into different categories (highest probabilities of two drugs are in two different categories), then ranking was based on predicted categories; if two drugs are both predicted into High category, then the predicted ranking was based on their probabilities in the High Risk category; if two drugs are both predicted into Intermediate or Low category, then the predicted ranking was based on their probabilities in the Low Risk category.
The Likelihood Ratio (LR) for LOOCV was calculated in the following manner. For Threshold 1, positive drugs are those whose highest predicted probability is in High or Intermediate category (P(High)> P(Low) or P(Intermediate)>P(Low)), while negative drugs are those predicted as Low Risk (P(Low) > P(High) and P(low) > P(Intermediate)). For Threshold 2, positive drugs are those whose highest predicted probability is in High Risk category (P(High)> P(Low) and P(High)>P(Intermediate)), while negative drugs are those predicted as Intermediate or Low Risk (P(Low) > P(High) or P(Intermediate) > P(High)). The mean classification error of LOOCV for 28 drugs was calculated the same way as the validation study using 16 drugs.
The comparison of qNet/torsade metric score to various alternative metrics are given in Tables   9-11 below. In Table 9 three metrics are compared: qNet/torsade metric score, APD90(16), and APD50&diastolic Ca concentration (17). Note that the latter two were originally developed using different experimental protocols and simulation/statistical models, whereas here all three are computed by CiPAORdv1.0 with drug binding dynamic parameters on hERG and block potency parameters on non-hERG currents. In Table 10 the same three metrics were compared, but they were computed by CiPAORdv1.0 model using drug block potency parameters (IC50s and Hill coefficients) for all 4 essential currents. For all the simulations in Tables 9&10, 1-4x Cmax was used for metric calculations. For Table 11 two simple statistical models were used without introducing any physiological models: the MICE model 5 used IC50s for ICaL and IKr/hERG (12), while the Bnet metric used IC50s and Hill coefficients to calculate the percentage of block of both depolarizing and repolarizing currents at 1x therapeutic free concentration (1x Cmax) (18). For calculation of the metrics in Tables 10-11, non-hERG block potency (IC50 and Hill coefficients) parameters for INaL, INa, and ICaL were listed in Tables 2&4 before. hERG block potency parameters were not listed before due to the use of dynamic hERG binding parameters (Table 3), so they are listed in Table 12 instead. The probabilities of being classified into each of the three risk categories for all 28 drugs based on LOOCV are listed for various dataset and metric combinations in Tables 13-20. Overall the CiPA metric (torsamde metric score) calculated by the CiPAORdv1.0 model (column 1 of Table 9) outperforms all other metrics (all other columns in Tables 9-11), especially on the measures that evaluate across all three categories (Pairwise Comparison Correct Rate for ranking, and Mean Classification Error for classification).   Table 9 except that for hERG block, drug potency parameters (IC50s and Hill coefficients) were used in place of binding dynamic parameters. The hERG block potency parameters can be found in Table 12 and non-hERG block potency parameters were listed in Tables 2&4 before. For the manual dataset, the hERG potency parameters were calculated from Crumb et al. (5). For the HTS dataset, the hERG potency parameters were calculated from Site 6 high throughput data. Note that other than the hERG data used being different, the manual and HTS dataset in this table are the same as the manual and hybrid dataset in Table 9.

Conclusions
Over two pre-specified validation datasets, the candidate CiPA model and metric predefined by training data meet all pre-specified performance measures, and outperform all alternative metrics. This suggests the CiPAORdv1.0 model and qNet/torsade metric score proarrhythmia metric are fit for regulatory use under the new CiPA cardiac safety assessment paradigm.