(Total Views: 1036)
Posted On: 07/12/2020 10:47:50 AM
Post# of 148903
Good morning, TechGuru. I hate to get into the weeds, but it is so fun and I can't hold back.
The patterns in fig 2 are exceptionally clear. Almost all the p-values should be <0.0000001. The exception would be the pVl fig, which should be maybe 0.003 instead of 0.01. An appropriate test would give those results.
Should? The figs are designed to show "reduction of inflammation, restoration of T cell lymphopenia, and reduced SARS-CoV-2 plasma viremia" [lines 57-58], but the statistical test they use (Kruskal-Wallace) tests for (non-ordered) differences among the times rather than changes through time---a mismatch between the statistics and what they want to show. It makes a big difference statistically.
Common for this type of data would be to use a mixed-effects model on log(y) ~ time + patient. The benefit of this approach would be three-fold (at least!): 1) it would better reflect the aim of the paper to show leronlimab's effect at changing the blood parameters, 2) it would give an enormous boost to the statistical power and defend against a dilettante charge of data snooping*, and 3) it would add some of the supplementary figures to the pantheon of "significant" effects (in particular lymphopenia and CRP).
* For those who are unfamiliar with the term, "data snooping" is when you run a whole bunch of statistical tests on masses of data and pick out one or two with relatively low p-values and call them "significant." The problem is that, in theory, 1 in 20 tests will have p < 0.05 strictly by chance even if the treatment has zero effect. If you snoop through fig 2 and supplementary fig 1, you see close to 20 tests, and it's reasonably likely that even if there were no effect on any of the blood parameters, you'd still get a p < 0.05 for one of them by chance. If you ask 50 people to roll a pair of dice and declared that the one that happens to roll box cars significantly better at rolling dice than the other 49, you'd get laughed out of the room. Likewise, you can't pull a single (or small number of) test out and call it "significant". If someone else did the same experiment, it would not be a surprise for them to also get p < 0.05 for one or two of the effects but for different effects than the one(s) you got.
An effective defense against charges of data snooping is to have very low p-values, which fig 2 could have if more appropriate, more powerful tests had been used. The patterns in fig 2 are striking and well-deserving of p < 0.0000001. I presume this issue was raised by reviewers and addressed already. If not, unlike many journals, NEJM has a formal statistical review that follows the traditional review, and issues like this are bound to be brought up.
A sidebar...to avoid the dangers of data snooping, clinical trials select a single, predetermined test as the "primary outcome." If you instead run 17 different tests and are allowed to pick out the one with p < 0.05 and call it "significant", my pancake recipe could well be touted by Dr Fauci as the new SOC because it could easily come out box cars in one of the 17 dice rolls. That's why the change in the primary outcome in the NIAID remdesivir trials is so galling.
The patterns in fig 2 are exceptionally clear. Almost all the p-values should be <0.0000001. The exception would be the pVl fig, which should be maybe 0.003 instead of 0.01. An appropriate test would give those results.
Should? The figs are designed to show "reduction of inflammation, restoration of T cell lymphopenia, and reduced SARS-CoV-2 plasma viremia" [lines 57-58], but the statistical test they use (Kruskal-Wallace) tests for (non-ordered) differences among the times rather than changes through time---a mismatch between the statistics and what they want to show. It makes a big difference statistically.
Common for this type of data would be to use a mixed-effects model on log(y) ~ time + patient. The benefit of this approach would be three-fold (at least!): 1) it would better reflect the aim of the paper to show leronlimab's effect at changing the blood parameters, 2) it would give an enormous boost to the statistical power and defend against a dilettante charge of data snooping*, and 3) it would add some of the supplementary figures to the pantheon of "significant" effects (in particular lymphopenia and CRP).
* For those who are unfamiliar with the term, "data snooping" is when you run a whole bunch of statistical tests on masses of data and pick out one or two with relatively low p-values and call them "significant." The problem is that, in theory, 1 in 20 tests will have p < 0.05 strictly by chance even if the treatment has zero effect. If you snoop through fig 2 and supplementary fig 1, you see close to 20 tests, and it's reasonably likely that even if there were no effect on any of the blood parameters, you'd still get a p < 0.05 for one of them by chance. If you ask 50 people to roll a pair of dice and declared that the one that happens to roll box cars significantly better at rolling dice than the other 49, you'd get laughed out of the room. Likewise, you can't pull a single (or small number of) test out and call it "significant". If someone else did the same experiment, it would not be a surprise for them to also get p < 0.05 for one or two of the effects but for different effects than the one(s) you got.
An effective defense against charges of data snooping is to have very low p-values, which fig 2 could have if more appropriate, more powerful tests had been used. The patterns in fig 2 are striking and well-deserving of p < 0.0000001. I presume this issue was raised by reviewers and addressed already. If not, unlike many journals, NEJM has a formal statistical review that follows the traditional review, and issues like this are bound to be brought up.
A sidebar...to avoid the dangers of data snooping, clinical trials select a single, predetermined test as the "primary outcome." If you instead run 17 different tests and are allowed to pick out the one with p < 0.05 and call it "significant", my pancake recipe could well be touted by Dr Fauci as the new SOC because it could easily come out box cars in one of the 17 dice rolls. That's why the change in the primary outcome in the NIAID remdesivir trials is so galling.
(18)
(0)
Scroll down for more posts ▼