De Econometrist neemt een statistische kijk op de wereld.
Credit risk modeling has been widely adopted as an essential part of modern risk management approaches. Financial institutions are obligated to model the risk of default of accounts by the European Central Bank (ECB). They are allowed to develop and implement credit risk models independently, but otherwise have to adhere to the models operated in the Basel frameworks. If calibrated well, credit risk models provide a useful instrument to guide financial institutions in their risk assessment of accounts. However, inadequate performance of such models forms a risk as it may lead to misjudgment of potential accounts and thereby damage business.
In light of the increasing use of such models to facilitate data-driven decision making, model testing has become critical. In our previous article we discussed the general facets of model testing. In this article we will focus on the testing of credit risk models specifically. To assess the performance of a credit risk model, multiple facets are evaluated (see figure 1);
Figure 1: Model testing
Each of these facets will be elaborated on in the following chapters. Since data quality plays an important role in model input, output backtesting and portfolio testing, we dedicate extra attention to it in a separate chapter.
Evaluation of data quality lies at the basis of model testing since low quality data can diminish the validity of all test outcomes. Quality here does not refer to having the ‘right data’ – which in itself is naturally crucial – but rather about the complexion of the data in hand. Assessment of data quality aims to validate appropriateness, accuracy, completeness, timeliness, uniqueness and consistency of the data.
Figure 2: Data quality
Appropriateness is tested to verify that the same data type (e.g. discrete versus categorical) is used through model development and testing. Accuracy concerns the plausibility of the gathered data. Completeness comprises of assessing the quantity and consequences of missing data, while uniqueness assesses the number of cases of duplication due to human or computing error. Finally, consistency concerns the internal integrity of the data and the involved processes.
Model input concerns all explanatory variables (i.e. risk drivers) that are fed into the model to model the outcome in question. Model input testing aims to validate appropriateness and predictive value of the risk drivers. Assessment of these risk drivers comprises of evaluating their stability, concentration and interdependency.
Figure 3: Model input testing metrics
Stability tests assess the difference in the distribution of model input parameters between different time periods to give a measure of stability. To do so, the present data is commonly compared to the data used during development or previous rounds of model testing. This enables one to identify changes in the distribution of input parameters to potentially adjust the model to. Importantly, lack of stability may induce potential malperformance or explain existing malperformance.
Concentration tests assess whether there is unwarranted concentration present in each risk driver. This is achieved by first clustering the data into buckets (e.g. using deciles), setting value thresholds within each individual input feature, and checking if there is concentration in a set of buckets. High concentration in input variables reduces discriminatory power of a model as a consequence of the low contrast between data points.
Correlation tests assess pairwise correlation between risk drivers to gain insight into the possible reinforcing or negating effects of pairs of risk drivers. In addition to correlation tests, interaction tests can be performed to gain insight into the interdependency of risk drivers. This helps in identifying interaction effects between input factors that could be overlooked especially in the case of linear models.
In addition to the above, trend graphs provide a good tool to perform a visual check for any trends potentially developing in a risk driver. Even if not of a quantitative nature, trend graphs tend to be a valuable tool to spot developments that might not show up in other test results.
Figure 4: Portfolio testing metrics
Portfolio testing, which is in many ways like model input testing, is performed to evaluate the stability and concentration of the output data and developing trends in the output. The same tests that are used for input testing are performed for portfolio testing. However, this time they are used to identify and assess significant fluctuations in the model output. Even in scenarios where input testing shows adequate stability in input parameters, model output may still fluctuate. Small observed and tested changes in input parameters may lead to large changes in model output due the non-linearity of a model. Unwarranted changes in model output may explain present malperformance or highlight risk of future malperformance.
Figure 5: Model output testing metrics
Model output testing regards the evaluation of the model performance, which is generally validated by comparison of the predicted output and the historically realized output. Model output testing aims to validate first the similarity of the predicted output to the realized output and second the predictive value of the modeled output.
Similarity of the predicted output to the realized output is, particularly in the case of credit risk models, assessed by a backtest in which accuracy measures are computed. These accuracy metrics give a measure of the overlap of the predicted and realized output values. Note however that a model with high accuracy does not necessarily hold any predictive power. Particularly in outcome variables with low contrast, that is to say the vast majority of observations hold the same outcome, a model biased to always predict the same outcome will have a relatively high accuracy, but no predictive value whatsoever.
To assess the predictive value of a credit risk model, discriminatory power tests are performed to assess the model’s ability to distinguish between high and low risk clients. Various options exist to test discriminatory power.
A tandem of tests that efficiently cover this topic are the sensitivity and specificity tests. The sensitivity test expresses the proportion of all positives for whom the model also correctly predicted their positive state, i.e. the proportion of all high-risk clients that are correctly identified by the model. The specificity test complements the sensitivity test by indicating the proportion of all low risk accounts that are correctly predicted to be low risk by the model. An ideal model would score close to 100% on both tests, that is to say the model would be able distinguish low risk accounts from high risk accounts. The added value of the combination of these tests is that they are less biased by false positives and negatives.
The sensitivity and specificity can be combined to plot a receiver operating characteristic (ROC) curve for different discriminatory thresholds, in which the area under the curve indicates the discriminatory power of a model. The larger the area under the curve the better the discriminatory power of the model.
Calibration of the model is tested to validate whether the model can make unbiased predictions of the propensity of default. Calibration testing is not necessarily focused on the alikeness of separate predicted events to observed events, but rather on the similarity between the predicted and observed event rates as distributed over different subgroups of the population.
Model testing as discussed in the chapters above largely focusses on quantifiable data and model characteristics that are considered when developing and using a model. Model performance is however also dependent on the immeasurable context in which a model is operating, referred to here as model environment. Model environment can be broken down to factors pertaining to a changing environment, the processes around the use of the model and the soundness of the model. This area of tests is of qualitative nature but should not be overlooked.
The macro environment of credit risk models is sensitive to change. Regulatory adjustment, changes in customer behavior, market shifts and strategy readjustments s of the financial institutions running the model may alter the frame in which a model is judged. Keeping account of such macro changes is important to signal whether the mode and its performance are indirectly affected.
Model use is steered by its predefined purpose, coverage and support throughout its lifecycle. Changes in model use are however not uncommon and may alter the criteria on which a model is evaluated. It is important to test whether the model meets its predefined purpose in appropriateness. In case of changes in model use it should be tested whether the model fits its new purpose or whether redevelopment is needed.
The last part of model environment concerns the soundness of the model, which largely focusses on the model design. Model design is determined by model limitations, the key assumptions, the balance between expert and data-based elements and in turn affects the extent to which a model is in line with current-best practices.
The model testing results give insight into the performance of a model, or lack thereof. These results dictate the next steps when it comes using, (re)calibrating or (re)developing the model. If the test outcomes suffice then the model can be implemented or remain in function. If not, the model may have to be recalibrated or redeveloped depending on the degree of malfunction. A recalibration focuses on updating the fixed modeling parameters to enhance performance. A redevelopment goes further by assessing whether the model type needs to be altered and whether parameters should be deleted or added. Before deciding on a recalibration or redevelopment, it is necessary to identify the sources of malperformance.
Lack of data quality is interesting as it may diminish the performance of a well calibrated model. It makes little sense to proceed with (re)developing or (re)calibrating a model if data quality is inadequate. Lack of data quality requires one to first focus on cleaning and securing the integrity of the data.
Signs of instability or concentration in model input testing results may explain malperformance to a degree. This is an issue that may be solved through (re)calibration of the model. However, if malperformance issues persist, then one may have to look beyond the input parameters currently implemented in the model. In this case, redevelopment is advised. Moreover, if input changes strongly, but output testing is still adequate, then the question arises whether one is indeed still modeling the intended output.
Portfolio testing results will show whether the model output is susceptible to fluctuation. Retracing the source of fluctuation is necessary if the results indicate instability. If the fluctuation is a product of changes in the input parameters, then the model may be adjusted to account for predictable fluctuation.
Model output testing results will provide a starting point for where performance is lacking. The sensitivity, specificity and calibration tests all indicate different sources of bias. The former pair of tests may indicate a tendency to report false positives or false negatives, while the latter can indicate in which particular subgroups predicted event rates are biased.
Considering the impact that models may have on decision making, model testing is becoming increasingly important. In the scenario of credit risk models, model testing should be further emphasized since malperformance can have big consequences for financial organizations, their clients and even the entire economy. The value of proper model testing reaches beyond meeting regulatory requirements. It allows financial institutions to better manage their capital, increases model transparency and allows for better decision making. In this article we gave an overview of model testing of credit risk models and described the different facets that can be assessed in input, portfolio and output testing. By doing so we hope to facilitate and encourage the implementation of proper model testing practice.