Annals of the Rheumatic Diseasesに掲載された統計処理の誤用に関する記事が興味深い（前半）

最近は研究に取り組まれる理学療法士・作業療法士も多いので統計解析をされる方も多いと思います．

5年前の論文ですが，統計処理の誤用に関する記事が非常に興味深かったのでブログの中でシェアさせていただきます．

常識であると思っていた内容が実は誤っているといったものが複数あり，これは目からうろこでした．

研究に取り組まれる理学療法士・作業療法士は必見です．

けっこうボリュームがあるので前半・後半に分けてご紹介させていただきます．

まずは前半部分をご紹介させていただきます．

この論文の主旨

Introduction

From 2006 to 2014, I have carried out approximately 200 statistical reviews of manuscripts for ARD. Some errors and weaknesses occur more often than others. The following is a description of 14 of my comments most frequently given to authors. The first 10 points concern choosing an appropriate analysis method, points 11–12 concern avoiding superfluous analyses and points 13–14 concern reporting formats. Some statistical terms are explained in Appendix. I hope this can help authors to avoid these statistical errors and weaknesses in future manuscripts.

2006年から2014年までのARDに掲載された論文について約200件の統計的レビューを行ってきました．

レビューの中で統計学的解析の誤用も頻繁に発生しています．

以下はこの著者に最も頻繁に与えたコメントのうち，14点が記述されたものです．

最初の10点は適切な分析方法の選択，11-12点は余計な分析を避けること，13-14点は報告書のフォーマットに関するものです．

今後の論文執筆においてこれらの統計的誤りや弱点を回避するための参考になれば幸いです．

1．欠損データの処理について

1．Report how missing data were handled

Report the amount of missing data in the different variables, and how this was handled in the analysis.1 Commonly used methods are, from the less to the more complex ones, complete case analysis (disregarding cases with partially missing data), single imputation methods like expectation-maximation imputation, multiple imputation and full information maximum likelihood. Further, in longitudinal studies, mixed models analysis may be appropriate, while ‘last observation carried forward’ is not unbiased under any sensible assumptions, and should not be used.

欠損データに関して，一般的に使用される手法としては，あまり使われていないものからより複雑なものまでさまざまです．

完全ケース分析（部分的にデータが欠落しているケースを無視する），期待値最大化入力，多重入力，完全情報最尤のような単一入力法がよく用いられます．

縦断的研究では混合モデル分析が適切かもしれませんが，脱落を起こした時点での値を単純に代入値として利用する単一代入法は，使用すべきではありません．

欠損データの扱いって難しいですよね．

サンプルサイズが減少することやバイアスが生じることを覚悟して完全ケース分析を行うのがやはり妥当ということでしょうか？

2．回帰分析における共変量数の制限

2．Limit the number of covariates in regression analyses

Some authors attempt to include too many covariates compared with the number of cases in a regression model, for example, 17 covariates in a study with 64 cases. Traditional rules of thumb state that the ratio of cases per covariate ought to be in the size of order 10. Some authors recommend 15, some 20, others state that 5 is sufficient. In logistic regression and Cox regression, 10 events per variable is usually sufficient2 and in many situations 5 events per variable is sufficient.3 Note that in logistic regression this is not the total number of observations, but the smallest of the two outcome groups. Similarly, in Cox regression, only the number of events excluding censored observations is counted as cases in this context.

著者の中には回帰モデルのサンプルサイズに対して共変量を多く（例えば64症例の研究で17の共変量を含む），使用している者もいます．

経験則では投入する共変量（独立変数）の数は症例数の10倍程度であるべきであるとされております．

また15倍を推奨する研究者もいれば，20倍を推奨する研究者もいますが， 5倍で十分だとする研究者もいます．

ロジスティック回帰分析やCox回帰モデルの場合には，一般的には独立変数×10倍で十分であり，多くの状況では独立変数×5倍で十分です．

またCox 回帰モデルでは，打ち切られた観察数を除くイベント数だけが，症例数としてカウントされるといった点に注意が必要です．

3．共変量のステップワイズ法を安易に使用しない

3．Do not use stepwise selection of covariates

Automated variable selection procedures like stepwise selection used to be very popular. Today an increasing number of analysts criticise such methods. For example,4 page 419 states: “There are several systematic, mechanical, and traditional algorithms for finding models (such as stepwise and best-subset regression) that lack logical and statistical justification and that perform poorly in theory, simulations and case studies … One serious problem is that the P-values and standard errors … will be downwardly biased, usually to a large degree”.

Selection of covariates should be based on the research question at hand and on substantial knowledge such as what is biologically plausible. Chapter 10 ‘Predictor selection’ in the book5 gives good guidance on this matter.

ステップワイズ法のような自動化された変数選択手順は，以前は非常に人気がありました．

今日では，ステップワイズ法に否定的な統計学者が増えております．

モデルを見つけるためのいくつかの系統的，機械的，伝統的なアルゴリズム（ステップワイズ回帰やベストサブセット回帰など）は，論理的，統計的な正当性を欠き，理論，シミュレーション，ケーススタディでのパフォーマンスが悪いものがあります.

安易にステップワイス法を使用すべきではないということですね．

4．無作為化比較試験におけるベースライン値の調整に共分散分析を用いる

4．Use analysis of covariance to adjust for baseline values in randomised controlled trials

Consider a randomised controlled trial (RCT) comparing two treatments, where the outcome variable is measured before treatment and after treatment. Testing if there is a significant change (difference) from before to after treatment in each treatment arm separately is not an appropriate analysis method. One can compare the mean change between the treatment arms. But an even better approach is regression with outcome after treatment as dependent variable, and baseline value and treatment group as covariates.6 This method is often called analysis of covariance (ANCOVA).

2つの治療法を比較する無作為化比較試験（RCT）を考えてみましょう．

各治療群を別々に，治療前から治療後に有意な変化（差）があるかどうかを検定することは，適切な分析方法ではありません．

この方法では治療群間の平均変化を比較することができますが，理想は治療後のアウトカムを従属変数とし，ベースライン値と治療群を共変量とした回帰法を用いる方法です．

最近はベースラインの値を共変量とした無作為化比較試験って結構よく見かける気がします．

変化量の比較って天井効果や床効果を考えるとやはり問題があるでしょうね．

5．観察研究では共分散分析を用いるべきではない

Do not use ANCOVA to adjust for baseline values in observational studies

In an observational study, on the other hand, use of ANCOVA cannot be generally recommended7 (page 126). In fact, ANCOVA can produce different conclusions than analysing a score difference (after score minus before score), a phenomenon also known as Lord’s paradox.8 A central issue is that in a randomised trial, the treatment is applied after measuring the baseline score. Hence the treatment cannot have affected the baseline score. In an observational study, the exposure may also have been present before the baseline score was measured. Then, ANCOVA would generally introduce bias. See also ref 9.

観察研究では共分散分析の使用は推奨できません．

実際のところ，共分散分析はスコアの差（スコア後のスコアからスコア前のスコアを引いたもの）を分析するのとは異なる結論を導き出すことができ，これはLordのパラドックスとしても知られている現象です．

無作為化試験ではベースラインのスコアを測定した後に治療が適用されます．

観察研究ではベースラインスコアが測定される前に曝露が存在していた可能性もありますので，その場合には共分散分析ではバイアスが加わってしまうことになります．

連続変数は二分化すべきではない

Dichotomising a continuous variable: a bad idea

Avoid dichotomising continuous variables if possible.10–12 Dichotomising implies loss of information and hence loss of statistical power. Moreover, dichotomizing a covariate implies that the effect of that covariate is a step-function changing only at the threshold. In reality, most effects are smooth functions of the covariate. However, sometimes it can be sensible to dichotomise according to some predefined clinical threshold. Data-driven categorisation such as above/below the median of the observations is never a good idea. The same arguments are valid for categorising into more than two categories, although the harm is then somewhat less than by dichotomising.

二分化することは，情報の損失を意味し，それゆえに統計的な力の損失を意味します．

さらに共変量を二分化することはその共変量の効果が閾値でのみ変化する段階関数であることを意味します．

実際にはほとんどの効果は共変量の平滑関数です．

しかし時には事前に定義された臨床的な閾値に従って2群に分類するような方法に意味がある場合もあります．

観察値を中央値で2群に分類するような方法は適切ではありません．

2つ以上のカテゴリーに分類する場合もありますが，その場合にはは2群に分けて分析する場合よりも幾分問題は少なくなります．

理学療法士・作業療法士も2群比較が大好きな人多いですよね．

データを安易に切り捨てて2群に分類するのはやはりNGですね．

もちろん研究の目的によってはカットオフ値で2群に分類するような方法を用いるのが妥当だという場合も考えられるでしょう．

Studentのt検定はノンパラメトリック検定より優れている

Student’s t test is better than non-parametric tests

Student’s t test has major advantages over non-parametric tests such as the Wilcoxon test13: First, the method allows to compute a CI for the mean of interest, not only a p value. Second, Student’s t test is more powerful, particularly in small samples.14 A widespread misunderstanding is that Student’s t test should not be used in small samples. Third, Student’s t test is readily generalised into regression analysis and other analyses.

Student’s t test is rather robust to deviations from normality15 as long as there are no residuals extremely distant, say much more than 4–5 SDs, from zero. Visual inspection of Q-Q plots is well suited to detect such deviations. Visual inspection of P-P plots is not suited for detecting such deviations. When the data deviate substantially from the normal distribution, one can for example, use bootstrapping to obtain CIs and p values.16 Bootstrapping has been available in standard statistical software for several years, and is an underused technique in many applications of statistics.

Studentのt 検定は，Wilcoxon 検定のようなノンパラメトリック検定よりも優れたところがあります．

第1にt検定では，p値だけでなく平均の信頼区間を算出することができます．

誤解されていることが多いのが，Studentのt検定はサンプルサイズが小さい場合には使用すべきではないといった考え方です．

これはもちろん誤りです．

さらStudentのt 検定は，回帰分析やその他の分析に容易に一般化できるといった利点があります．

Studentのt 検定は，ゼロから極端に離れた残差がない限り，正規性からの逸脱にはむしろ頑健性を持ちます．．

Q-Qプロットの目視検査はそのような偏差を検出するのに適しています．

P-Pプロットの目視検査はそのような偏差の検出には適していません．

データが正規分布から大きく乖離している場合には，ブートストラップを使用して信頼区間やp値を得ることができます．

ブートストラップは数年前から標準的な統計ソフトウェアで利用可能ですが，統計学の多くのアプリケーションではあまり使われていない手法です．

Shapiro-wilk検定を使って正規性だけを確認して，ノンパラメトリック検定を用いるのも考え物だということですね．