Why we need so many performance metrics in hypothesis testing ?
Hypothesis testing is one of most important important exercises in any data driven or evidence based research. In a common scenario we have two alternatives — one is called the ‘null hypothesis’ and another the ‘alternative hypothesis’. For example, the null hypothesis may be ‘tomorrow is a sunny day’ and alternative ‘tomorrow is a rainy day’. There are many examples of it, such as, the alternative hypothesis could be that an Email is spam, a transaction is fraud , or a person is covid +ve.
In order to find out which of the hypothesis is true we use statistical methods and the data coming from some experiments or observations.
Before we make a decision about the outcome of a test result we must decide the values of a set of decision parameters, which may be in the form of thresholds. Let us take an example of the Fasting Plasma Glucose Test or FPG test in which we measure the Plasma Glucose (mg/dL) of a patient and if it is found higher than 99, we declare the patient diabetic (note that this is just for illustration and the actual test is very different than what I am discussing here).
The numbers like 99 as is used above are not god given (like fundamental constants of nature) and we must decide ourself about their values. Disagreements about the values of these are more common than agreements. In any case, these parameters are important and control the performance measures in hypothesis testing.
When we first time encounters performance measures of hypothesis testing such as accuracy, precision, recall, specificity, power, F1 score, ROC curve etc., the first question which may come in our mind is why we cannot have just one performance measure like accuracy or why we need so many ?
I have gone through many blogs and articles and found that although these measures are explained very well, but not much emphasis is given on explaining why we need different measures and that is exactly the goal of this article. In this article I will explain why we need different performance matrices with giving their definitions and some examples.
Hypothesis tests are not very different from the medical diagnostic tests, in fact they are motivated from those only, in which we measure some parameter and from that find the probability of someone being normal (alternatively having some health issue) as is discussed above also for FPG test.
Since the diagnostic parameters we measure are random variables so their values will follow some probability distributions, most probably Gaussian. This means that although most of the data points will fall within a narrow range but there will be a few very high and low values also. For example, in the case of FPG test we may have normal persons with values 105 also. It is up to us to decide how big the range we can consider for the normal case or for the null hypothesis. Larger range will lead to an unhealthy person being declared healthy and smaller value will lead a healthy person being declared unhealthy.
One of the most common methods used to decide about the value of the decision parameter is to use the p-value, which represents the probability of getting an extreme value of a data point assuming that the null hypothesis being true. There is generally a threshold, called alpha, which is used to decide about the p-value. We can also say that the p-value tells us how significant are the results of our hypothesis testing exercise. The smaller the value of p-vale, stronger is the evidence we have against the null hypothesis and so lower the probability of false positives.
In the Figure 1 below three vertical lines represent three values of thresholds used to reject a null hypothesis, represented by the Gaussian probability shown in the figure. If we chose the threshold value too small (1 sigma) then there is a very high probability of rejecting the null hypothesis when it is in fact true. Or in the other words we are considering the alternative hypothesis on the basis of a very weak evidence.
Lower, threshold also means too many false alarms (or type I error) as is represented the by the area under the probability curve outside the threshold values. If we consider the threshold large (3 sigma) the probability of false alarm is small but there is a danger of high probability of false negative. Now it is up to us to decide what we can live with it. It is a common practice in research to consider p-value 0.05 or less than that — probability of false positive 5% or less than that.
PERFORMANCE METRICS
After giving some background let us discuss about the performance metrics and what are the motivations behind those. In order, to do that we need to use the following set of variables.
1. True Negative (TP)
2. False Positive (FP)
3. True Positive (TP)
4. False Negative (FN)
The meaning of these variables is clear from the Figure 2 below. On the basis of the above four variables we can define the following set of performance matrices.
- Accuracy — This is the most common measure and is defined in the following way:
Accuracy = (TP + TN) /(TP + TN + FP + FN)
This measure is good enough when there is no difference between the risk associated with the false positive and false negative and we just want to measure the fraction of cases which have been correctly identified — false is being identified false and true as true. Apart from that we also assume that the process leading to the false positive and false negative is common which may not be the case in many situations.
2) Precision — It is defined in the following way:
Precision = TP / (TP + FP)
This measure focuses on TP and can be maximise by keeping the FP (Type error I) low. High value of the threshold or low p-vale can help in that. Note that high precision means we have less false positives but it does not say anything about false negatives. For example, if our alternative hypothesis is that ‘an email is a spam’ then high precision means the probability of genuine email being declared spam is less and that is desirable in this case. An important email being declared ‘spam’ will be a disaster.
3) Recall /Sensitivity — It is defined in the following way:
Recall = TP / (TP + FN)
In clinical tests 100% sensitivity means we can correctly identifies all patients which have a disease. Now in this case we can maximise the recall by keeping false negatives low. Sensitivity or recall can be quite important in medical diagnostic where we do not want a person having a deadly virus like Covid-19 to be identified as negative. Here, we can tolerate a healthy person being declared unhealthy and receive treatment (which may be unlikely since further testing will clear the person) but we cannot tolerate an unhealthy person being declared healthy and do not get an opportunity for a treatment.
Note that in many discussion recall is called specificity or true negative rate.
4) Specificity : It is defined in the following way:
Specificity = TN / (TN + FP)
In any clinical test if we have 100% specificity then that means we can correctly identifies all patients who do not have a disease.
5) F1 Score :
As we can see from the above formula that just having a high precision or recall is not good so a new measure called F1 score as defined below is used for this purpose.
F1 Score = 2 * Precision * Recall / (Precision + Recall)
6) Likelihood Ratio
There are many practical situations in which we use the likelihood ratio which is defined in the following way in terms of the sensitivity and specificity.
Likelihood Ratio = Sensitivity / (1- Specificity)
RATES
Some of the measures which are mentioned above can also be defined in terms of the following four rates.
1) True Positive Rate (TPR)
TPR = True Positive / Total positive = TP / (TP + FN)
Sometime this is also called sensitivity, recall or hit rate.
2) True Negative Rate (TNR)
TNR = True Negative / Total Negative = TN / (TN + FP)
Sometime this is also called specificity or selectivity.
3) False Positive (Discovery) Rate (FDR)
FDR = False Positive / Total Positive = FP / (FP + TP)
4) False Negative (Miss) rate
FNR = False Negative / Total Negative = FN / (FN + TP)
There are more measures than what I have discussed here and my aim was not give a reference of those but was to emphasise that a performance measure which may be relevant and effective for one case may not be so for others. So we need different performance measures.
In case you find this article useful please like & share it and if have comments let me know those also.