Outliers seem to be one of the biggest issue in operational risk calculation. Including them especially when they represent high magnitude events can result, when fitting common loss severity distributions or applying EVT, in the selection of distribution with infinite mean, variance resulting in unrealistic estimates of capital charge . The literature devoted to outliers detection and treatment is quite substantial even in operational risk management. I shall only mention the article Chernobai and Rachev (2005)[1]. I will propose a different approach allowing to detect outliers in small and medium size samples.
Outlier definition
Let us consider as an outlier an observation that lies an “abnormal distance” from other values in a random sample from a population. In terms of distribution function, the outlier is an observation that cannot be “reasonably” described by the selected distribution.
The outliers origination
According to Chandola, Banarjee and Kumar[2] outliers are caused by: “
malicious activity {such as insurance or credit card or telecom fraud, a cyber intrusion, a terrorist activity),
instrumentation error { such as defects in components of machines or wear and tear),
change in the environment { such as a climate change, a new buying pattern among consumers, mutation in genes),
human error { such as an automobile accident or a data reporting error)”.
Personally I would add to this list intentional grouping of data (a group of events are posted as a single event due to lack of information, convenience or data representing different types of events and business lines activities are put into a single cluster in order to overcome data scarcity issues).
Outlier definition
Let us consider as an outlier an observation that lies an “abnormal distance” from other values in a random sample from a population. In terms of distribution function, the outlier is an observation that cannot be “reasonably” described by the selected distribution.
The outliers origination
According to Chandola, Banarjee and Kumar[2] outliers are caused by: “
malicious activity {such as insurance or credit card or telecom fraud, a cyber intrusion, a terrorist activity),
instrumentation error { such as defects in components of machines or wear and tear),
change in the environment { such as a climate change, a new buying pattern among consumers, mutation in genes),
human error { such as an automobile accident or a data reporting error)”.
Personally I would add to this list intentional grouping of data (a group of events are posted as a single event due to lack of information, convenience or data representing different types of events and business lines activities are put into a single cluster in order to overcome data scarcity issues).
The outliers treatment
Even if errors may be corrected, change into environment included into the model as an additional scaling parameter, the intentional grouping of data is a real issue which may directly affect the operational risk measurement outcome. Finding robust alternative to classical moments is only a partial solution as:
1) parameters are in general estimated by maximizing the likelihood method since moments’ estimators are supposed to tend to perform poorly (they used a few feature of data, rather than the entire set of observations).
2) robust estimates of parameters for different types of severity distribution functions are rarely offered in computer programs. Try to give the robust estimate of moments and parameters of a simple transformed beta distribution.
The proposed solution
First let us consider loss data. One may argue that a single loss amount is always a finite figure because the size of transactions, value of individual assets are limited. The number of losses is also limited so the moments of the severity and frequency distribution should be positive finite numbers. For this reason distribution without moments should be disregarded as contrary to done observations. This is, however, not sufficient to prevent unrealistic assessment of operational risk.
Thus a way to overcome these difficulties will be to detect outliers and create a mixture of 2 distributions:
-parametric model without outliers,
-contamination with only outliers,
or to calculate VaR for data cleaned from outliers and separately for outliers.
In order to discover the outliers let us remember that a traditional measure of “outlyingness” of an observation respect to a sample is the ratio between its distance to the sample mean and the sample standard deviation (Marona, Martin, Yohai 2006)[3] - the so called 3 sigma rule
The following step is to find a limit on “t”.
For any sample we have the following inequality:
For any sample we have the following inequality:
so an observation is an outlier if:
This solution is attractive for small number of observations (below 38) as the quotient:
We may put an upward limit on the value of the above quotient. According to the Czebyszew inequality we have:
If we want to calculate the VaR at 99.9%, an initial upper bound on the value of “t” can be fixed at 33. This bound may be later reduced after the selection of the proper distribution function.
After calculating VaR for the bulk of data (without outliers), the analyst should calculate the VaR. Suppose we have only N outliers for a T years period. One may argue basing on historical evidence that the probability of an extreme event is N/T and that the possible loss for any outlier can be modeled by observed historical loss. The evaluation can be done with conditional probability. As the number of outliers is limited the calculation should not be cumbersome.
Example:
Suppose we have noted in a 10 year period 2 outliers with losses equal to 500,000 and 2,000,000 units). As we suppose that the outliers are independent events (we have already accepted this hypothesis before declaring them outliers when they were put together in a single cluster) the probability of a single event in 1 year period is 1/5, the probability of 2 events (1/5)^2, 3 events (1/5)^3, etc… The probability of no event is 0.75.
The loss if a single event occurs 500,000 or 2,000,000 with probabilities for each of them of 10%.
As we a still far from the 99.9% confidence level, we shall continue.
The loss if 2 events occurs will be 1,000,000; 2,500,000 or 4,000,000. The probability of these losses are respectively: (1/5)^2 *1/4; (1/5)^2*2/4 and (1/5)^2*1/4. Above probabilities can be obtained from the Pascal triangle.
The joint probability of no event, 1 or 2 events is 75%+20%+4% =99%.
The loss if 3 events occurs will be 1,500,000; 3,000,000; 4,500,000 and 6,000,000.
With probabilities (1/5)^3*1/8+(1/5)^3*3/8+(1/5)^3*3/8+(1/5)^3*1/8
The joint probability of no event; 1; 2 or 3 events is 99.8%.
The loss if 4 events occurs will be 2,000,000; 3,500,000; 5,000,000; 6,500,000 and 8,000,000 with probabilities:
(1/5)^4*1/16+(1/5)^4*4/16+(1/5)^4*6/16+(1/5)^4*4/16+(1/5)^4*1/16.
The joint probability of no event, 1; 2; 3 or 4 events is 99.96%.
So if we disregard the remaining cases we obtain:
Probability of no loss is 75%
Probability of a loss of 500,000 PLN is 10%
Probability of a loss of 1,000,000 PLN is 1%
Probability of o loss of 1,500,000 PLN is 0.1%
Probability of a loss of 2,000,000 PLN is 10%+0.01%
Probability of a loss of 2,500,000 PLN is 2%
Probability of a loss of 3,000,000 PLN is 0.3%
Probability of a loss of 3,500,000 PLN is 0.04%
Probability of a loss of 4,000,000 PLN is 1%
Probability of a loss of 4,500,000 PLN is 0.3%
Probability of a loss of 5,000,000 PLN is 0.06%
Probability of a loss of 6,000,000 PLN is 0.1%
Probability of a loss of 6,500,000 PLN is 0.04%
Probability of a loss of 8,000,000 PLN is 0.01%
The VaR for the outliers can be estimated conservatively at 6,000,000 PLN
Comparing the approach with robust version of t test
I have compared the approach with the robust version of ti that is t’i proposed by Ricardo A. Maronna, R. Douglas Martin and Victor J.Yohai. As data we used these provided by the author in example 1.1 and 1.2.
For example 1.1. I found that the last (24th) observation is an outlier, the (23rd) is closed to the limit. The 3 authors mentioned above also consider that the last observation is an outlier. They did not pronounce if the 23rd was an outlier. However their test shows it as an outlier with the t’i value of 3.
For example 1.2. both approaches have shown that the first two observations are outliers.
My test seem to have a little less power but the t’ indicates only “suspicious” value as the confidence level is equal to 99.9%. With the help of equation (3) the probability that the observation belongs to the population[4] described by the remaining data is 0 provided that the upper or lower limit is exceeded.
Details of data and calculation in appendix I.
Appendix I
First set of data: Copper content in the wholemeal flour (in parts per millions), sorted in ascending order (Analytical Methods Committee, 1989)[5].
2.2;2.2;2.4;2.4;2.5;2.7;2.8;2.9;3.03;3.03;3.1;3.37;3.4;3.4;3.4;3.5;3.6;3.7;3.7;3.7;3.7;3.77;5.28;28.95.
t’i for the last 2 values are 3.60 and 48.57. The proposed critical value for t’i is equal to 3.. With my approach the critical value (lower and upper bound) are 0.68 and 5.54. That means that the last observation for sure is an outlier, the previous one is closed and can be considered as a dubious case.
Second set of data: 20 determinations of the time in microseconds needed for light to travel a distance of 7442 meters. The figures (sorted in ascending order) provided by Stigler[6] are the figures below multiplied by 0.001 +24.8.
-44; -2; 16; 19; 21; 22; 23; 24; 24; 25; 26; 27; 28; 29; 29; 30; 31; 33; 34; 40
t’i for the first 2 observations are -11.73 and -4.64 indicating presence of outliers. In my approach the lowed bound for an outlier is 2.8 also indicating that they are outliers.
[1] Anna Chernobai, Svetlozar T. Rachev (2005) Applying Robust Methods to Operational Risk Modeling
[2] Varun Chandola, Arindam Banerjee, Vipin Kumar (2007) Outlier Detection : A Survey
[3] Ricardo A. Maronna, R. Douglas Martin and Victor J.Yohai (2006)”Robust Statistcis: Theory and Method”, Chichester, UK, John Wiley & Sons Ltd..
[4] I am considering the original data and not the distribution supposed to describe the data(!).
[5] Analytical Methods Committee (1989) Robust Statistics – How not to reject outliers, Analyst, 114, p.1693-1702.
[6] S.M. Stigler (1977) Do robust estimators deal with real dara: The Annals of Statistics,5, pp.1055-1098.
After calculating VaR for the bulk of data (without outliers), the analyst should calculate the VaR. Suppose we have only N outliers for a T years period. One may argue basing on historical evidence that the probability of an extreme event is N/T and that the possible loss for any outlier can be modeled by observed historical loss. The evaluation can be done with conditional probability. As the number of outliers is limited the calculation should not be cumbersome.
Example:
Suppose we have noted in a 10 year period 2 outliers with losses equal to 500,000 and 2,000,000 units). As we suppose that the outliers are independent events (we have already accepted this hypothesis before declaring them outliers when they were put together in a single cluster) the probability of a single event in 1 year period is 1/5, the probability of 2 events (1/5)^2, 3 events (1/5)^3, etc… The probability of no event is 0.75.
The loss if a single event occurs 500,000 or 2,000,000 with probabilities for each of them of 10%.
As we a still far from the 99.9% confidence level, we shall continue.
The loss if 2 events occurs will be 1,000,000; 2,500,000 or 4,000,000. The probability of these losses are respectively: (1/5)^2 *1/4; (1/5)^2*2/4 and (1/5)^2*1/4. Above probabilities can be obtained from the Pascal triangle.
The joint probability of no event, 1 or 2 events is 75%+20%+4% =99%.
The loss if 3 events occurs will be 1,500,000; 3,000,000; 4,500,000 and 6,000,000.
With probabilities (1/5)^3*1/8+(1/5)^3*3/8+(1/5)^3*3/8+(1/5)^3*1/8
The joint probability of no event; 1; 2 or 3 events is 99.8%.
The loss if 4 events occurs will be 2,000,000; 3,500,000; 5,000,000; 6,500,000 and 8,000,000 with probabilities:
(1/5)^4*1/16+(1/5)^4*4/16+(1/5)^4*6/16+(1/5)^4*4/16+(1/5)^4*1/16.
The joint probability of no event, 1; 2; 3 or 4 events is 99.96%.
So if we disregard the remaining cases we obtain:
Probability of no loss is 75%
Probability of a loss of 500,000 PLN is 10%
Probability of a loss of 1,000,000 PLN is 1%
Probability of o loss of 1,500,000 PLN is 0.1%
Probability of a loss of 2,000,000 PLN is 10%+0.01%
Probability of a loss of 2,500,000 PLN is 2%
Probability of a loss of 3,000,000 PLN is 0.3%
Probability of a loss of 3,500,000 PLN is 0.04%
Probability of a loss of 4,000,000 PLN is 1%
Probability of a loss of 4,500,000 PLN is 0.3%
Probability of a loss of 5,000,000 PLN is 0.06%
Probability of a loss of 6,000,000 PLN is 0.1%
Probability of a loss of 6,500,000 PLN is 0.04%
Probability of a loss of 8,000,000 PLN is 0.01%
The VaR for the outliers can be estimated conservatively at 6,000,000 PLN
Comparing the approach with robust version of t test
I have compared the approach with the robust version of ti that is t’i proposed by Ricardo A. Maronna, R. Douglas Martin and Victor J.Yohai. As data we used these provided by the author in example 1.1 and 1.2.
For example 1.1. I found that the last (24th) observation is an outlier, the (23rd) is closed to the limit. The 3 authors mentioned above also consider that the last observation is an outlier. They did not pronounce if the 23rd was an outlier. However their test shows it as an outlier with the t’i value of 3.
For example 1.2. both approaches have shown that the first two observations are outliers.
My test seem to have a little less power but the t’ indicates only “suspicious” value as the confidence level is equal to 99.9%. With the help of equation (3) the probability that the observation belongs to the population[4] described by the remaining data is 0 provided that the upper or lower limit is exceeded.
Details of data and calculation in appendix I.
Appendix I
First set of data: Copper content in the wholemeal flour (in parts per millions), sorted in ascending order (Analytical Methods Committee, 1989)[5].
2.2;2.2;2.4;2.4;2.5;2.7;2.8;2.9;3.03;3.03;3.1;3.37;3.4;3.4;3.4;3.5;3.6;3.7;3.7;3.7;3.7;3.77;5.28;28.95.
t’i for the last 2 values are 3.60 and 48.57. The proposed critical value for t’i is equal to 3.. With my approach the critical value (lower and upper bound) are 0.68 and 5.54. That means that the last observation for sure is an outlier, the previous one is closed and can be considered as a dubious case.
Second set of data: 20 determinations of the time in microseconds needed for light to travel a distance of 7442 meters. The figures (sorted in ascending order) provided by Stigler[6] are the figures below multiplied by 0.001 +24.8.
-44; -2; 16; 19; 21; 22; 23; 24; 24; 25; 26; 27; 28; 29; 29; 30; 31; 33; 34; 40
t’i for the first 2 observations are -11.73 and -4.64 indicating presence of outliers. In my approach the lowed bound for an outlier is 2.8 also indicating that they are outliers.
[1] Anna Chernobai, Svetlozar T. Rachev (2005) Applying Robust Methods to Operational Risk Modeling
[2] Varun Chandola, Arindam Banerjee, Vipin Kumar (2007) Outlier Detection : A Survey
[3] Ricardo A. Maronna, R. Douglas Martin and Victor J.Yohai (2006)”Robust Statistcis: Theory and Method”, Chichester, UK, John Wiley & Sons Ltd..
[4] I am considering the original data and not the distribution supposed to describe the data(!).
[5] Analytical Methods Committee (1989) Robust Statistics – How not to reject outliers, Analyst, 114, p.1693-1702.
[6] S.M. Stigler (1977) Do robust estimators deal with real dara: The Annals of Statistics,5, pp.1055-1098.





Brak komentarzy:
Prześlij komentarz