Article Text

Download PDFPDF

A CHecklist for statistical Assessment of Medical Papers (the CHAMP statement): explanation and elaboration
Free
  1. Mohammad Ali Mansournia1,2,
  2. Gary S Collins3,4,
  3. Rasmus Oestergaard Nielsen5,6,
  4. Maryam Nazemipour7,
  5. Nicholas P Jewell8,9,
  6. Douglas G Altman3,
  7. Michael J Campbell10
  1. 1 Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, Tehran, Iran
  2. 2 Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, Tehran, Iran
  3. 3 Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, UK
  4. 4 National Institute for Health Research Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford, UK
  5. 5 Department of Public Health, Section for Sports Science, Aarhus University, Aarhus, Denmark
  6. 6 Research Unit for General Practice, Aarhus, Denmark
  7. 7 Psychosocial Health Research Institute, Iran University of Medical Sciences, Tehran, Iran
  8. 8 Department of Medical Statistics, London School of Hygiene & Tropical Medicine, London, UK
  9. 9 Division of Epidemiology & Biostatistics, School of Public Health, University of California, Berkeley, California, USA
  10. 10 ScHARR, University of Sheffield, Sheffield, UK
  1. Correspondence to Professor Mohammad Ali Mansournia, Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, 14155-6446 Tehran, Iran; mansournia_ma{at}yahoo.com; Dr Maryam Nazemipour, Psychosocial Health Research Institute, Iran University of Medical Sciences, 14665-354 Tehran, Iran; nazemipour.m{at}iums.ac.ir

Abstract

Misuse of statistics in medical and sports science research is common and may lead to detrimental consequences to healthcare. Many authors, editors and peer reviewers of medical papers will not have expert knowledge of statistics or may be unconvinced about the importance of applying correct statistics in medical research. Although there are guidelines on reporting statistics in medical papers, a checklist on the more general and commonly seen aspects of statistics to assess when peer-reviewing an article is needed. In this article, we propose a CHecklist for statistical Assessment of Medical Papers (CHAMP) comprising 30 items related to the design and conduct, data analysis, reporting and presentation, and interpretation of a research paper. While CHAMP is primarily aimed at editors and peer reviewers during the statistical assessment of a medical paper, we believe it will serve as a useful reference to improve authors’ and readers’ practice in their use of statistics in medical research. We strongly encourage editors and peer reviewers to consult CHAMP when assessing manuscripts for potential publication. Authors also may apply CHAMP to ensure the validity of their statistical approach and reporting of medical research, and readers may consider using CHAMP to enhance their statistical assessment of a paper.

  • statistics
  • methodology

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

The misuse of statistics by implementing flawed methodology in medical and sports science research can lead to unreliable or even incorrect conclusions. The consequences of flawed methodology can have undesirable consequences on public health, patient management and athlete performance.1 Unfortunately, errors in the study design, statistical analysis, reporting and interpretation of results are common in medical journals2 3 and raise questions regarding the quality of medical papers.4

Sound methodology has been prioritised in the past decades, especially in high-impact factor journals. This is illustrated by the inclusion of more statistical editors and other methodologists (eg, epidemiologists) in the review process. In addition, stakeholders in research have been encouraged to intensify their investments in statistical, epidemiological and methodological education, such as training authors and reviewers, providing online resources, developing (and extending) guidelines and including methods content in regular scientific meetings.5 There has also been a stronger emphasis on adherence to reporting guidelines (eg, CONsolidated Standards Of Reporting Trials, STrengthening the Reporting of OBservational studies in Epidemiology, STAndards for the Reporting of Diagnostic accuracy studies, REporting recommandations for tumor MARKer prognostic studies, and Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis).6–10

Still, many medical and sports science journals do not involve statistical experts in the review process. This is unfortunate because the existence of basic statistical errors is more likely when authors, editors and referees do not have sufficient knowledge of statistics and, worse, are unconvinced about the importance of correct statistics in medical research. Rarely do clinical journals systematically assess the use of statistics in submitted papers.11 12 Thus, even after a paper is published in a scientific journal, it is necessary to read the content with some caution and pay careful attention to whether the statistical design and analysis were appropriate and the conclusions justified. Studies published in high-ranked journals are not immune from methodological or statistical flaws, which were not identified during the peer review process. Although some journals attempt to mitigate against such issues by using statisticians in the review process (as statistical reviewers or statistical editors), guidelines to assess methodological or statistical content in scientific papers would be useful when expert statistical reviewers are unavailable.5 13 14

While guidelines on how to report statistics in medical papers exist,15 16 we propose a general checklist to judge the statistical aspects of a manuscript during peer review. While it is impossible to cover everything, we believe it would be useful to have a basic checklist for assessing the statistical methods used more broadly within medical and sports science research papers. Based on an extensive revision of a previous checklist,17 we describe CHecklist for statistical Assessment of Medical Papers (CHAMP; table 1) comprising 30 items in the design, analysis, reporting and interpretation stages to aid the peer review of a submitted paper.18

Table 1

CHecklist for statistical Assessment of Medical Papers

Development and explanation of the 30-item checklist

The 30 items in the checklist were selected based on a previous BMJ checklist,17 an extensive literature review and the authors’ collective experience in reviewing the statistical content of numerous papers submitted to a variety of medical journals. The first author produced a checklist draft, the coauthors suggested the addition or removal of items, and all authors approved the final version. Other colleagues provided extensive comments on the paper and are listed in Acknowledgements. Our checklist is not intended to, nor can it, cover all aspects of medical statistics. Our focus is rather on key issues that generally arise in clinical research studies. Therefore, only common statistical issues encountered during the review of research manuscripts were included in CHAMP. Using our checklist requires some primary knowledge of statistics; however, we provide a brief explanation for each item and cite the relevant references for further details. The first six items relate to the design and conduct of research, items 7–16 address data analysis, items 17–23 concern reporting and presentation, and items 24–30 pertain to interpretation.

Items 1–6: design and conduct

Item 1: clear description of the goal of research, study objective(s), study design and study population

The research goal, study objectives, study design, and study and target populations must be clearly described so that the editors of journals and readers can judge the internal and external validity (generalisability) of the study.

Being explicit about the goal of research is a prerequisite for good science regardless of the scientific discipline. For such clarification, a threefold classification of the research goal may be used: (1) to describe; (2) to predict, which is equivalent to identifying ‘who’ is at greater risk of experiencing the outcome; or (3) to draw a causal inference, which attempts to explain ‘why’ the outcome occurs (eg, investigating causal effects).5 19

The study objective refers to the rationale behind the study and points to the specific scientific question being addressed. For example, the objective of the heated water-based exercise (HEx) trial, a randomised controlled trial (RCT), was to evaluate the effect of heated water-based exercise training on 24-hour ambulatory blood pressure (BP) levels among patients with resistant hypertension.20 The study objective is usually provided in the introduction after the rationale has been established.

The study design refers to the type of the study, which is explained in the Methods section.21 Examples of common study designs include RCTs and observational studies such as cohort, case–control or cross-sectional studies.22 The study design should be described in detail. In particular, the randomisation procedure in RCTs, follow-up time for cohort studies, control selection for case–control designs and sampling procedure for cross-sectional studies should be adequately explained.6 7 As a general principle, the study design must be explained sufficiently so that another investigator would be able to repeat the study exactly.

The study population refers to the source population from which data are collected, whereas the target population refers to the population to whom we are going to generalise the study results; the relationship between these two populations may be characterised using inclusion and exclusion criteria and is crucial for assessing generalisability. Returning to the HEx trial, the study population was restricted to persons whose ages were between 40 and 65 years with resistant hypertension for more than 5 years.20 For both trials and observational studies, it is important to know what proportion of the source population is studied and what proportion of the intended data set is used in the analysis data set. For example, the source population may include all patients admitted to a hospital with a certain condition over a certain period of time. However, the analysis data set may only be 50% of this, for various reasons such as patients refusing consent, measurements not taken and patients dropping out. In the HEx trial, for instance, they screened 125 patients with hypertension to find 32 who met the inclusion criteria with resistant hypertension. This has some bearing on the generalisability of the study and to whom heated water-based exercise training can be given and how likely it is to be relevant to practitioners.

Item 2: clear description of outcomes, exposures/treatments and covariates, and their measurement methods

All variables considered for statistical analysis should be stated clearly in the paper, including outcomes, exposures/treatments, predictors and potential confounders/mediators/effect-measure modifiers (see Box 1). The measurement method and timing of measurement for each of these variables should also be specified. If the goal of the research is to draw a causal inference (explain ‘why’ the outcome occurs) via observational studies, authors should present their causal assumptions in a causal diagram.23–25 To exemplify this concept, in a cohort study evaluating the effect of physical activity on functional performance and knee pain in patients with osteoarthritis,26 physical activity (exposure) was measured using the Physical Activity Scale for the Elderly, and functional performance and self-reported knee pain (outcomes) were measured by the Timed 20-Metre Walk Test and the Western Ontario and McMaster Universities Osteoarthritis Index, respectively. Depressive symptoms were considered a potential confounder and measured using the Center for Epidemiologic Studies Depression Scale. All variables were measured at baseline and in three annual follow-up visits, and a causal method along with a causal diagram representing the study population was used to estimate the effect of interest.26

Box 1

Glossary

Association: Statistical dependence, referring to any relationship between two variables.

Association measure: A measure of association between two variables, either in absolute or in relative terms. Absolute association measures are differences in occurrence measures e.g., risk difference and rate difference. Relative association measures are ratios of occurrence measures e.g., risk ratio, rate ratio, and odds ratio.

Causal diagram (Causal directed acyclic graph (DAG)): A diagram which includes nodes linked by directed arrows and has two properties: (i) the absence of an arrow between two variables implies the absence of a direct causal effect, and (ii) all shared causes of any pair of variables are included in the graph.

Collider: A variable that is a common effect of two other variables.

Effect-measure modifier: A variable that modifies the effect of the exposure on the outcome.

Confounder: A variable that is on the common cause path (confounding path) of the exposure and outcome.

Confounding: A bias created by a common cause of the exposure and outcome.

Correlation: Any monotonic (either entirely non-increasing or entirely non-decreasing) association.

Data dredging (Data fishing): The misuse of data analysis to find patterns which can be presented as statistically significant.

Design effect (in survey): The ratio of the variance of an estimator from a sampling scheme to the variance of the estimator from simple random sampling with the same sample size.

Effect (Causal effect): In the potential outcome (counterfactual) framework of causation, we say that A has a (causal) effect on B in a population of units if there is at least one unit for which changing A will change B.

Linearity assumption: An assumption underlying regression models imposed by inclusion of quantitative predictors which should be assessed.

Mediator: A variable that is affected by the exposure and also affects the outcome.

Null hypothesis: A hypothesis which is assumed in hypothesis testing, often corresponds to no association between two variables in the population.

Occurrence measure: A measure of disease frequency such as risk (incidence proportion), incidence rate, and prevalence.

Sparse-data bias: A bias arisen as a result of sparse data, leading to inflation of the effect size estimates.

Item 3: validity of the study design

The design should be valid and match the research question without introducing bias in the study results. For example, an editor should be able to assess whether the controls in a case–control study were adequately representative of the source population of the cases. Alternatively, in a clinical trial, it should be clear whether there was one (or more) control groups, and if so, whether patients were randomised to treatment or control, and if so, whether the randomisation method and allocation concealment were appropriate.

Item 4: clear statement and justification of sample size

The manuscript should have a section clearly justifying the sample size.27 When a sample size calculation is warranted, the sample size section should be described in enough detail to allow replication of the estimate, along with a clear rationale (supported by references) on choice of values used in the calculation, the outcome for which the calculation is based on, including the minimum clinically important effect size.28 29 For example, typical sample size calculations aim to ensure that the study contains a sufficiently large precision for estimates of occurrence measures (eg, risk) or association measures (eg, risk ratio),30 31 or that there is an adequate power to detect genuine effects (eg, true differences) if they exist (statistical tests). Attrition/loss to follow-up/non-response and design effects (eg, due to clustering) should be taken into consideration. Guidance for sample size calculation for prediction model development and validation has been described previously.32–34

Item 5: clear declaration of design violations and acceptability of the design violations

Design violations frequently occur in research. Non-response in surveys, censoring (loss to follow-up or competing risks) in prospective studies35 and non-compliance with the study treatments in RCTs are examples and should be declared explicitly in the paper.36 37 Given the validity of the design, the acceptability of violations should be assessed. For example, was an observed non-response/censoring proportion too high? What were the reasons for data loss, and is this level acceptable to achieve the scientific goals of the study?

Item 6: consistency between the paper and its previously published protocol

The reviewer should identify inconsistencies with any published protocol (and where relevant, registry information) regarding important features of the study, including sample size, primary/secondary/exploratory outcomes and statistical methods.

Items 7–16: data analysis

Item 7: correct and complete description of statistical methods

A separate part in the Methods section of the manuscript should be devoted to the description of the statistical procedures. Both descriptive and analytical statistical methods should be sufficiently described so that the methods can be assessed by a statistical reviewer to judge their suitability and completeness in addressing the study objectives.

Item 8: valid statistical methods used and assumptions outlined

The validity of statistical analyses relies on some assumptions. For example, the independent t-test for the comparison of two means requires three assumptions: independence of the observations, normality and homogeneity of variance.38 As another example, all expected values for a χ2 test must be more than 1, and at most 20% of the expected values can be less than 5. These statistical assumptions should be judged as a matter of context or assessed using appropriate methods such as a normal probability plot for checking the normality assumption.39 In this regard, an alternative statistical test should be applied if some assumptions are clearly violated. It should be noted that some statistical tests are robust against mild-to-moderate violations of some assumptions. For the t-test, lack of normality and lack of homogeneity of variance do not necessarily invalidate the t-test, whereas lack of independence of the outcome variables will imply the results are invalid.40 It has been demonstrated that the independent t-test can be valid but suboptimal for the ordinal scaled data (eg, a variable with values 0, 1, 2, 3) even with a sample size of 20.41

An important but often ignored aspect in practice is that ratio estimates such as the estimated odds ratio (OR), risk ratio and rate ratio are biased away from the null value. This bias is amplified with sparse data known as sparse-data bias.42 A sign of sparse data is an unrealistically large ratio estimate or confidence limit which is simply an artefact of sparse data. For example, an OR >10 for a non-communicable disease should be considered as a warning sign for sparse-data bias. In the extreme, an empty cell leads to an absurd OR estimate of infinity, known as separation.43 Special statistical methods such as penalisation or Bayesian methods must be applied to decrease the sparse-data bias.43 44 Some other important considerations in statistical analysis are (1) accounting for correlation in the analysis of correlated data (eg, variables with repeated measurements in longitudinal studies,45 cluster randomised trials46 and complex surveys47); (2) accounting for matching in the analysis of matched case–control and cohort data48–50; (3) considering ordering of several groups in the analysis; (4) considering censoring in the analysis of survival data; (5) adjusting for baseline values of the outcome in the analysis of randomised clinical trials28; (6) correct calculation and interpretation of the population attributable fraction51 52; (7) adjusting for overfitting using shrinkage or penalisation methods when developing a prediction model53 54; and (8) assessment of similarity and consistency assumptions in network meta-analysis.55

Item 9: appropriate assessment of treatment effect or interaction between treatment and another covariate

Appropriate statistical tests should be used for the assessment of treatment effects and potential interactions. Assessment of overlapping treatment group-specific confidence intervals (CIs can be misleading.56–58 Thus, the comparison of the CIs of the treatment groups should not be used as an alternative to the statistical test of treatment effect. Moreover, comparing p values for the treatment effect at each level of the covariate (eg, men and women) should not be used as an alternative for an interaction test between the treatment and covariate. For example, in the case of observing p value <0.05 in men and p value >0.05 in women, one might incorrectly conclude that gender was an effect modifier.59 Similarly, we cannot conclude no effect modification if the CIs of the subgroups are overlapping.60

Item 10: correct use of correlation and associational statistical testing

The misuse of correlation and associational statistical testing is not uncommon. As an example, correlation should not be used for assessing the agreement between two methods in methods-comparison studies.61 To see why, two measures of X and Y are perfectly correlated but in poor agreement if X is twice Y. Similarly, we cannot infer that the two methods agree well because the p value is large enough using the statistical testing of the means such as paired t-test. In fact, a high variance of differences indicates poor agreement but also increases the chance that the paired t-test will result in a large p value, and thus the methods will appear to agree.1

Item 11: appropriate handling of continuous predictors

Reviewers should be wary of studies that have dichotomised or categorised continuous variables–this should be generally avoided.62 Bias, inefficiency and residual confounding may also result from dichotomising/categorising a continuous variable and using it as a categorical variable in a model. Continuous variables should be retained as continuous and their functional form be examined, as a linearity assumption may not be correct. Approaches for handling continuous predictors include fractional polynomials or regression splines.62–65

Item 12: CIs do not include impossible values

A valid CI should exclude impossible values. For instance, a simple Wald CI for a proportion (Embedded Image ) is not valid when p is close to 0 or 1, and may yield negative values outside the possible range for a proportion (0≤p≤1).66 To remedy such conditions, the Wilson score or Agresti-Coull interval can be applied.6

Item 13: appropriate comparison of baseline characteristics between the study arms in randomised trials

In a randomised clinical trial, any baseline characteristic difference between groups should be due to chance (or unreported bias). Reviewers should look out for any statistical testing at the baseline as reporting p values does not make sense.67 The decision on which baseline characteristics (prognostic factors) are included in any adjustment should be prespecified in the protocol and should be based on the subject-matter knowledge, not on p values. The differences between groups in baseline characteristics should be identified by their size and discussed in terms of potential implications for the interpretation of the results.

Item 14: correct assessment and adjustment of confounding

An important goal of health research is drawing a causal inference. Here, the interest is in the causal effect of an exposure on the outcome. The major source of bias threatening causality studies, including observational studies as well as randomised studies (with small-to-moderate sample size), is confounding.68–71 Confounding can be controlled in the design phase (eg, through restriction or matching) or analysis phase (eg, using regression models, standardisation or propensity score methods).72–74 Selection of confounders should be based on a priori causal knowledge, often represented in causal diagrams,23 75–77 not p values (eg, using stepwise approaches). Automated statistical procedures, such as stepwise regression, do not discriminate between confounders and other covariates like mediators or colliders which should not be adjusted for in the analysis. Moreover, stepwise regression is only based on the association between confounders and outcome, and disregards the association between the confounders and exposure. Thus, stepwise procedures should not be used for confounder selection. In practice, many confounders (and exposures and outcomes)78 79 are time-varying, and the so-called ‘causal methods’ should be applied for the appropriate adjustment of time-varying confounders.80 81 Similarly, in studies evaluating the prognostic effect of a new variable, adjustment for existing prognostic factors should be routinely performed, and variable selection of the existing factors is not generally needed.53

Item 15: avoiding model extrapolation not supported by data

The goal of interest in many health studies is predicting an outcome from one or more explanatory variables using a regression model. The model is valid only within the range of observed data on the explanatory variables, and we cannot make prediction for people outside the range. This is known as model extrapolation.82 Suppose we have found a linear relation between body mass index (BMI) and BP based on the following equation in a cohort study:

Embedded Image

Now the intercept, A, cannot be interpreted because it corresponds to the expected BP value of a person with BMI of zero! The remedy is centring BMI and including the centred variable (BMI−average BMI) in the model so that the new intercept refers to the expected BP value of a person with the average BMI in the population.

As another example, suppose the following linear relation holds in an RCT:

Embedded Image

where TRT denotes treatment (1: intervention, 0: placebo) and TRT*BMI is the product term (interaction term) between treatment and BMI. In this model, the parameter B cannot be interpreted on its own because it is the mean difference in BP between two treatment groups for a person with BMI of zero. Again, the solution is centring BMI and including centred BMI and the product term between TRT and centred BMI in the model so that Embedded Image (coefficient of TRT in the new model) refers to the mean difference in BP of a person with the average BMI in the population.

Item 16: adequate handling of missing data

The methods used for handling missing data should be described and justified in relation to stated assumptions about the missing data (missing completely at random, missing at random and missing not at random), and sensitivity analyses must be done if appropriate. Missing data83 can introduce selection bias and should be handled using appropriate methods such as multiple imputation84 and inverse probability weighting.85 Naïve methods such as complete-case analysis, single imputation using the mean of the observed data, last observation carried forward and the missing indicator method are statistically invalid in general and they can lead to serious bias.86

Items 17–23: reporting and presentation

Item 17: adequate and correct description of the data

The mean and standard deviation (SD) provide a satisfactory summary of data for continuous variables that have a reasonably symmetric distribution. The standard error (SE) is not a sound choice to be used in place of SD.87 A useful memory aid is to use SD to Describe data and SE to Estimate parameters.88 Besides, ‘mean±SD’ is not suitable because it implies the range in which 68% of data are within (not a relevant concept we are looking for), and ‘mean (SD)’ should be reported instead.1 In case of having highly skewed quantitative data, median and interquartile range (IQR) are more informative summary statistics for description. It should be noted that the mean/SD ratio of <2 for positive variables is a sign of skewness.89 Categorical data should be summarised as number and percentage.90 For cohort data, a summary of follow-up time such as median and IQR should be reported.

Item 18: descriptive results provided as occurrence measures with CIs and analytical results provided as association measures and CIs along with p values

The point estimates of the occurrence measures, for instance, prevalence, risk and incidence rate with 95% CIs, should be reported for descriptive objectives.90 Alternatively, the point estimates of the association measures, for instance, OR, risk ratio and rate ratio with 95% CIs along with p values, should be reported for analytical objectives as part of the Results section.91 92

Item 19: CIs provided for the contrast between groups rather than for each group

For analytical studies like RCTs, the 95% CIs should be given for the contrast between groups rather than for each group.6 For the BP example mentioned above,20 the authors reported the mean of BP with 95% CI in each group, but they should have given the mean difference in 24-hour ambulatory BP levels between groups with 95% CI as the aim of the trial was to compare treatment with control.

Item 20: avoiding selective reporting of analyses and p-hacking

All statistical analyses performed should be reported regardless of the results. P-hacking, playing with data to produce the desired p value (upwards as well as downwards), must be avoided.93–95 This is probably difficult to assess as a reader/reviewer, but usually one would be clued in if there are many more analyses than those stated in the objectives or only statistically significant comparisons are presented when a larger pool of variables were identified in the methods.

Item 21: appropriate and consistent numerical precisions for effect sizes, test statistics and p values, and reporting the p values rather than their range

P values should be reported directly with one or two significant figures even if they are greater than 0.05, for example, p value=0.09 or p value=0.28. One should not focus on ‘statistical significance’ or dichotomise p values (eg, p<0.05)96–98 or express them as ‘0.000’ or ‘NS’. Nonetheless, spurious precision, too many decimals, in numerical presentation should be avoided.99 100 For example, typically p values less than 0.001 can be written as <0.001 without harm, and it does not make sense to present percentages with more than one decimal when the sample size is much less than 100.

Item 22: providing sufficient numerical results that could be included in a subsequent meta-analysis

Meta-analyses of randomised trials and observational studies provide high levels of evidence in health research. Providing numerical results in individual studies contributing to subsequent meta-analysis is of special importance. Follow-up score and change score from the baseline are two possible approaches that can be applied to estimate treatment effect in RCTs.101 While the follow-up score meta-analysis requires after-intervention mean and SD in two groups of intervention and control, the mean and SD of differences from the baseline are prerequisite for performing change-score meta-analysis. However, authors often only report mean and SD before and after intervention. The mean of the difference in each group can be calculated from the difference of the means, but calculating the SD of differences needs a guessed group-specific correlation between baseline and follow-up scores besides before-intervention and after-intervention SDs.

Item 23: acceptable presentation of the figures and tables

Tables and figures are effective data presentation that should be properly managed.102–105 Figures should be selected based on the type of variable(s) and appropriately scaled. The error bar graph as an illustration can be used for displaying the mean and CI. It is inappropriate to give a bar chart with an SE bar superimposed instead (the so-called ‘dynamite plunger plot’105). Tables should be able to stand on their own and include sufficient details such as labels, units and values.

Items 24–30: interpretation

Item 24: interpreting the results based on association measures and 95% CIs along with p values and correctly interpreting large p values as indecisive results, not evidence of absence of an effect

The study results should be interpreted in light of the point estimate of appropriate association measures such as mean difference and 95% CI as well as precise p values. When testing a null hypothesis of no treatment effect, the p value is the probability the statistical association would be as extreme as observed or more extreme, assuming that null hypothesis and all assumptions used for the test are correct. P values for non-null effect sizes can also be computed. The point estimate is the effect size most compatible with the data in that it has p value=1.00, while the 95% CI shows the range of effect sizes reasonably compatible with the data in the sense of having p value >0.05.97 We should judge the clinical importance and statistical reliability of the results by examining both of the 95% CIs as well as looking at precise p values, not just whether a p value crosses a threshold or not.28 106 It is incorrect to interpret p value >0.05 as showing no treatment effect; instead, it represents an ambiguous outcome.107 108 It is not evidence that the effect is unimportant (‘absence of evidence is not evidence of absence’); inferring unimportance requires that every effect size inside the CI be considered unimportant.97

Item 25: using CIs rather than post hoc power analysis for interpreting the results of studies

Conceptually, it is not valid to interpret power as if it pertains to the observed study results.109–111 Rather, power should be treated as part of the study rationale and design before actual conduct begins, for example, as in sample size calculations. Power does not correctly account for the observations that follow; for example, a study could have high power and observe a high p value, yet still favour the alternative hypothesis over the null hypothesis.111 The precision of results should be gauged using CIs.

Item 26: correctly interpreting occurrence or association measures

It will be crucial to interpret occurrence and association measures correctly. ORs commonly provide examples of misinterpretation: if the event is rare, they can approximate risk ratios, but they are not conceptually the same and will differ considerably if the event is common.112 113 In a study with a risk of 60% in an exposed group and 40% in an unexposed group, the error in interpreting the OR (2.25) as a risk ratio (1.5) is considerable. Prevalence in cross-sectional studies is another example, which sometimes has been incorrectly called ‘risk’.

Item 27: distinguishing causation from association and correlation

We should be cautious about the correct use of technical terms such as effect, association and correlation. Association, meaning no independence, does not imply causation (and effect). Causal effect estimation requires measurement of exposure before outcome (temporality) as well as confounding adjustment. The correlation refers to a monotonic association between two variables. Therefore, no correlation does not imply no association.

Item 28: results of prespecified analyses are distinguished from the results of exploratory analyses in the interpretation

The results obtained from the prespecified (a priori) analyses that have been already designed and mentioned in a protocol are much more reliable than the results obtained after data dredging (data-derived or post hoc analysis).

Item 29: appropriate discussion of the study methodological limitations

The methodological limitations of the study design and analysis should be discussed. Ideally, the probabilistic bias analysis, in which a probability distribution is assumed for the bias parameters and bias is probabilistically accounted for using Monte Carlo sensitivity or Bayesian analysis, should be performed for adjustment of uncontrolled confounding (eg, due to an unmeasured confounder), selection bias (eg, through missing outcome data) and measurement bias (eg, subsequent to measurement error in the exposure).114–116 The authors should at least qualitatively discuss the main sources of bias and their impact on the study results.117 118

Item 30: drawing only conclusions supported by the statistical analysis and no generalisation of the results to subjects outside the target population

The study interpretation must be based not only on the results but also in the light of the study population as well as any limitation in the design and analysis.82 For example, if the study has been done in women, it cannot necessarily be generalised to a population of men and women.

Conclusion

The important role of sound statistics and methodology in medical research cannot be overstated. We strongly encourage authors to adhere to CHAMP for carrying out and reporting medical research, and to editors and reviewers for assisting the evaluation of manuscripts for potential publication. We have only covered some basic items, and each type of study or statistical model (eg, randomised trial, prediction model) has its own issues that ideally require statistical expertise. We appreciate that for some items in the checklist there is no unequivocal answer, and thus assessing the statistics of a paper may involve some subjectivity. Moreover, the questions in the checklist are not equally important for example, papers with serious errors in design are statistically unacceptable regardless of how the data were analysed and aspects of presentations are clearly less important than other elements of the checklist. It is important to note that statistical review, carried out by experienced statisticians, is the preferred way of reviewing statistics in research papers, more so than what any checklist can achieve. We hope CHAMP improves authors’ practice in their use of statistics in medical research and serves as a useful handy reference for editors and referees during the statistical assessment of medical papers.

Ethics statements

Acknowledgments

We thank Sander Greenland, Stephen Senn and Richard Riley for their valuable comments on an earlier draft of this paper.

References

Footnotes

  • Twitter @RUNSAFE_Rasmus

  • Deceased Douglas G Altman's deceased date: June 3, 2018

  • Contributors MAM, MN and DGA produced the first draft. GSC, RON, NPJ and MJC suggested revisions. All authors approved the final version.

  • Funding GSC was supported by the NIHR Biomedical Research Centre, Oxford, and Cancer Research UK (programme grant: C49297/A27294).

  • Competing interests None declared.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Linked Articles