Making sense of randomised trials

This article originally appeared in HIV Treatment Update, a newsletter published by NAM between 1992 and 2013.
This article is more than 14 years old. Click here for more recent articles on this topic

Want to know more about cutting edge treatment news but feeling blinded by science? Caroline Sabin, Professor of Medical Statistics and Epidemiology at University College London, describes how clinical trials are constructed and analysed to ensure they provide a fair and unbiased result and offers some guidance on how to interpret the statistics.

Glossary

endpoint

In a clinical trial, a clearly defined outcome which is used to evaluate whether a treatment is working or not. Trials usually have a single primary endpoint (e.g. having an undetectable viral load) as well as a few secondary endpoints, covering other aspects of treatment safety, tolerability and efficacy.

treatment effect

In clinical trials that compare treatments, the treatment effect is the additional benefit provided by the new treatment, over and above that which would have been expected by chance or using standard care.

 

randomised controlled trial (RCT)

The most reliable type of clinical trial. In a trial comparing drug A with drug B, patients are split into two groups, with one group receiving drug A and the other drug B. After a number of weeks or months, the outcomes of each group are compared.

p-value

The result of a statistical test which tells us whether the results of a study are likely to be due to chance and would not be confirmed if the study was repeated. All p-values are between 0 and 1; the most reliable studies have p-values very close to 0. A p-value of 0.001 means that there is a 1 in 1000 probability that the results are due to chance and do not reflect a real difference. A p-value of 0.05 means there is a 1 in 20 probability that the results are due to chance. When a p-value is 0.05 or below, the result is considered to be ‘statistically significant’. Confidence intervals give similar information to p-values but are easier to interpret. 

statistical significance

Statistical tests are used to judge whether the results of a study could be due to chance and would not be confirmed if the study was repeated. If result is probably not due to chance, the results are ‘statistically significant’. 

Randomised controlled trials (RCTs) are of fundamental importance when judging the value of a new treatment or strategy.

There are two main reasons for this. Firstly, trials include a control group (or ‘arm’). This group is made up of people who will take the standard treatment, or none if there is no standard care, instead of the new treatment. This means the investigators can evaluate any additional health gains associated with the new treatment, over and above any that would have been expected anyway.

Secondly, treatments in an RCT are allocated to individuals in a random manner. This means that the characteristics of the people receiving each treatment should be similar at the start of the trial, so if there are any differences in outcomes at the end of the trial, it can be assumed that these are due to the treatment itself.

Often a placebo (dummy) treatment is used to ‘blind’ the trial. The aim of this is so the patient (in a ‘single-blinded’ trial) or, as is more often the case, the patient and the clinical team (in a ‘double-blinded’ trial) do not know which treatment they are receiving. This is important so that patients are not treated in a different way, or report symptoms selectively, because they know which treatments they are on.

To really benefit from these features, data from RCTs has to be examined carefully to distinguish genuine findings from ones that mean nothing, and to identify possible causes of bias.

Randomisation

Randomisation means that a particular treatment is allocated to a participant in a trial on the basis of chance alone. As a result, the ‘baseline characteristics’ of people in each arm of the trial should be broadly similar.

Thus the first important analysis to be performed in any trial is a simple comparison of the key demographic and clinical characteristics of the individuals in each arm of the trial at baseline (before staring the new treatment or strategy).

We sometimes perform hypothesis tests to investigate whether any of the differences between the groups are statistically significant or not. These tests usually start with an assumption that any difference between the groups is due to chance alone: this is referred to as the null hypothesis. If the differences witnessed are then bigger than would be expected by chance, the differences are said to be statistically significant.

Hypothesis testing of the people taking part in the trial (the ‘trial population’) is vital in non-randomised trials, such as an observational survey of patient outcomes in a ‘real-world’ setting. In an RCT, however, we know that randomisation should mean that any differences seen aredue to chance only. The statistical methods used to analyse the outcomes of RCTs make an allowance for these chance differences, and so in general we don’t need to worry about them. However, if there are any substantial differences in the baseline characteristics of those receiving the different treatments, we may have to allow for this in any analysis.

Endpoints

Trial endpoints are outcomes that capture the effects of the new treatment. It is generally recommended that investigators should identify a single primary endpoint which best reflects how well the treatment works. Any major decisions about how good it is will be based on this endpoint.

Deciding what the primary endpoint will be will also be used to decide how many participants will be needed to be randomised to each arm: for instance you’ll need fewer participants if your endpoint is the number who achieve an undetectable viral load (a common event) than the number who die (hopefully a rare one).

The trial will also have several secondary endpoints, chosen to capture other important aspects of treatment effect and safety, which will be used to provide supportive evidence when decisions are made about the future use of the new treatment.

In the HIV setting, the choice of a trial endpoint is not always straightforward. Endpoints may be binary (i.e. ones with a yes/no response, such as whether the viral load is above or below 50 copies/ml at week 48), continuous, such as the CD4 count, or time-to-event (endpoints that measure the time taken for an event – such as viral load suppression - to occur).

However, as well as potent antiviral and immunological activity, HIV drugs should ideally also have minimal potential for the development of resistance, be associated with as few toxicities as possible and should have minimal impact on quality of life. Choosing a single endpoint to capture these aspects is often problematic.

For this reason, there has been a move towards the use of composite endpoints in trials. These consider a range of outcomes, with an individual meeting the criteria for the composite endpoint as soon as s/he meets the criteria for any one component of the endpoint. For example, the time-to-loss-of-virological-response(TLOVR) algorithm generates a composite endpoint, which incorporates components relating to confirmed virological failure, early drop-out from the trial, switching to a new treatment and the development or worsening of illness.

Analysing a composite endpoint takes a lot of care. Take a situation, for instance, where overall there is no difference in TLOVR between arms but, while the new treatment is more effective than the old one, it also causes more drop-outs due to mild but bothersome side-effects. How much weight do you give to each component?

Analysing the endpoints: how to read trial results

Assuming that there are no large imbalances in baseline characteristics, endpoints can be compared across the treatment arms using simple tests such as the Chi-squared test or a t-test. The outcome of these methods is a value known as the p-value, which allows investigators to judge whether their findings are likely to be real or due to chance.

However, in an RCT, this information on its own is not sufficient - we also need to estimate the size of the treatment effect (how much additional benefit has been gained through the use of the new treatment?) and calculate its confidence interval (how precise is our estimate of the effect?).

As an example, look at the box here. We have a trial with two treatment arms – regimen A (the investigational regimen) and regimen B (standard care). Our primary endpoint is the proportion of patients with a viral load below 50 copies/ml at 48 weeks. The box shows the outcome in each arm of the trial: 85% with a viral load under 50 copies/ml in arm A and 77.4% in arm B.

 The p-value in this table (which equals 0.007) is less than 0.05 (the usual threshold that we use to indicate statistical significance). This means that there’s only a 0.7% probability that this sort of difference could have arisen by chance and is not a real effect.

The treatment effect is 7.6% (85.0% minus 77.4%). This means that for every 100 patients treated with regimen A, we would expect that an additional 7.6 patients would attain virological suppression, compared to the number we would have expected had they all been treated with regimen B.

The confidence interval for this effect tells us that this true benefit could be as small as 2.3% or as large as 12.8%.

We can now use this information to consider the benefits of the new regimen in light of any disadvantages (e.g. an increased cost or worse toxicity profile).

Our secondary endpoint is the CD4 count and here we would also want to see what, if any, additional improvement is provided by the new regimen over-and-above standard care.

In the example, it can be seen that regimen A is associated with only an additional 6 cell/mm3 gain in CD4 count over the 48-week period compared to regimen B. This is of borderline significance: the p-value is 0.05, indicating that had we conducted 100 such trials, and there really had been no difference between the drugs, then we would have seen a difference of this size or greater in five of them. You’ll notice also that the confidence interval crosses (i.e. includes) zero – the true additional ‘benefit’ provided by the new drug could be as great as plus 12.1 cells/mm3 or as small as  minus 0.06 cells/mm3 (i.e. a very small detrimental effect on the CD4 count).

Outcomes from a RCT in which individuals are randomised to one of two treatment arms

 

Treatment arm

p-value

Treatment effect (A – B); (95% confidence interval)

 

Regimen A

Investigational drug

Regimen B

Standard care

Number (N) of individuals randomised

413

421

 

 

Primary endpoint

N (%) with viral load below 50 copies/ml at 48 weeks

351 (85.0%)

326 (77.4%)

0.007

Treatment effect 7.6%

Confidence interval

2.3% to 12.8%

Secondary endpoint

Mean change in CD4 cell count (cells/mm3) from baseline to week 48

63

57

0.05

Treatment effect +6 Confidence interval

-0.06 to +12.1

Confounders

A major benefit of randomisation is that it minimises the possibility that confounding may be present. Confounding is encountered frequently in observational studies and occurs because the characteristics of individuals receiving one regimen are different to those of individuals receiving another. This means that if outcomes differ between the groups, it is difficult to know whether this is a result of the different treatments, or the different characteristics. Similarly, if outcomes turn out to be the same, we may fail to detect a true difference.

RCTs provide the most robust form of evidence when judging whether a new treatment is likely to be more effective than existing treatments.

The lack of confounding in most RCTs means that it is acceptable to present simple comparisons of the outcomes in the different treatment arms. If, however, the characteristics of people recruited to the different treatment arms in a trial are substantially different, and any of these characteristics could also impact on outcomes, then investigators may need to perform an adjusted analysis. In adjusted analyses, we try to control for (weed out) the other factors. We usually use regression, a mathematical technique that allows us to gauge the degree of influence one or more factors have on an outcome.

False positives and negatives

When we perform any statistical comparison on a dataset, there is always the possibility that we may get the wrong answer. This doesn’t mean that our sums are wrong: it means that the result of the hypothesis test fails to reflect the true situation.

Statistical errors are of two types. Firstly, we may find a significant difference in outcome that is simply a chance finding (a false-positive signal): this is called a type 1 error. The threshold usually set for statistical significance means that false-positive signals will occur, on average, in one of every 20 tests performed. What seems significant may not always be so.

Secondly, we may fail to detect an important effect of the new treatment: this is a false-negative result and is called a type 2 error. The most common reason for this is that the trial did not recruit sufficient numbers of individuals. This will mean that the confidence intervals are too large for findings, even if they are real, to be statistically significant. It’s like trying to see details in an out-of-focus photograph.

Dealing with missing people and data

Ideally all people recruited into an RCT will be able to continue their participation until the end of the trial, but there will nearly always be some individuals who drop out of the trial prior to the planned completion date. Such patients are said to have been lost to follow-up’ (LTFU). Others may have to switch treatments halfway through the trial and others may discontinue treatment totally. The way these so-called protocol deviations are dealt with in the analysis is highly influential.

One school of thought takes the pragmatic view that in practice there will always be people who do not return for care or who have to switch their prescribed treatments. An intent-to-treat (ITT)analysis includes people who drop-out or switch, as they are retained in the treatment arms to which they were randomised, regardless of their actual treatment usage.

This approach is preferred for several statistical reasons, in particular because it provides a better estimate of the treatment’s effect in a real-life setting in which some people are bound not to take the treatment as allocated.

Opponents of this approach argue that by incorporating treatment switches in this way, the measured effect of the treatment is less than its true effect because of the inclusion of people who did not take it properly. They argue that the best estimate of the treatment’s true potential value can only be estimated in those people who actually took the treatment as prescribed. Such an on-treatment (OT)analysis only includes individuals who continue to take the treatment that is randomly allocated to them, exactly as prescribed.

An OT analysis may certainly appear to provide a better idea of how well the treatment works under ideal conditions. But people who switch treatments are often the very ones for whom the treatment is not so suitable, because of tolerability/toxicity problems, because they find it difficult to adhere to, or simply because they don’t feel it’s working. So OT analyses generally give overly optimistic estimates of the true treatment effect.

For this reason, this approach is not recommended for superiority trials – those designed to show that one drug/regimen is better than another. Non-inferiority trials, which are designed to show that a new treatment is the same as, or not substantially worse than, standard care require a different analytical approach; for these trials, OT analyses may also be recommended. Clinicians do want to know how well the treatment is likely to perform in the real world, but may also want to know how well they could get it to perform if factors preventing it being taken properly could be modified.

Summary

RCTs provide the most robust form of evidence when judging whether a new treatment is likely to be more effective than existing treatments. As such, they are often heavily cited and - rightly - form the basis of many treatment guidelines. However, it is important that those reviewing trial results should be aware of the specific issues that may arise with RCTs, and take these into account. Forewarned is forearmed: next time you read a report on a treatment trial, look carefully at the results.