© www.CartoonStock.com |
There are rumblings in the jungle of neuroscience.
There’s been a recent spate of high-profile papers that have drawn attention to
methodological shortcomings in neuroimaging studies (e.g., Ioannidis, 2011;
Kriegeskorte et al., 2009; Nieuwenhuis et al, 2011) . This is in response to published papers that regularly flout methodological standards that have been established for
years. I’ve recently been reviewing the literature on brain imaging in relation
to intervention for language impairments and came across this example.
Temple et al (2003) published an fMRI study of 20 children with dyslexia who were
scanned both before and after a computerised intervention (FastForword)
designed to improve their language. The article in question was published in
the Proceedings of the National Academy of Sciences, and at the time of writing
has had 270 citations. I did a spot check of fifty of those citing articles to
see if any had noted problems with the paper: only one of them did so. The
others repeated the authors’ conclusions, namely:
1. The training improved oral language and reading
performance.
2. After training, children with dyslexia showed
increased activity in multiple brain areas.
3. Brain activation in left temporo-parietal cortex and
left inferior frontal gyrus became more similar to that of normal-reading
children.
4. There was a correlation between increased activation
in left temporo-parietal cortex and improvement in oral language ability.
But are these conclusions valid? I'd argue not, because:
- There was no dyslexic control group. See this blogpost for why this matters. The language test
scores of the treated children improved from pre-test to post-test, but where
properly controlled trials have been done, equivalent change has been found in
untreated controls (Strong et al., 2011). Conclusion 1 is not valid.
- The authors presented uncorrected whole brain activation
data. This is not explicitly stated but can be deduced from the z-scores and p-values. Russell Poldrack, who happens to be one of the authors of this paper, has
written eloquently on this subject: “…it is critical to employ accurate
corrections for multiple tests, since a large number of voxels will generally
be significant by chance if uncorrected statistics are used. .. The problem of
multiple comparisons is well known but unfortunately many journals still allow publication
of results based on uncorrected whole-brain statistics.”
Conclusion 2 is based on uncorrected p-values and is not valid.
- To demonstrate that changes in activation for dyslexics
made them more like typical children, one would need to demonstrate an
interaction between group (dyslexic vs typical) and testing time (pre-training
vs post-training). Although a small group of typically-reading children was
tested on two occasions, this analysis was not done. Conclusion 3 is based on
images of group activations rather than statistical comparisons that take
into account within-group variance. It not valid.
- There was no a priori specification of which language
measures were primary outcomes, and numerous
correlations with brain activation were computed, with no correction for
multiple comparisons. The one correlation that the authors focus on (Figure reproduced below) is (a) only significant on a one-tailed test at .05 level; (b)
driven by two outliers (encircled), both of whom had a substantial reduction in left
temporo-parietal activation associated with a lack of language improvement.
Conclusion 4 is not valid. Incidentally, the mean activation change (Y-axis) in
this scatterplot is also not significantly different from zero. I'm not sure what this means, as it’s hard
to interpret the “effect size” scale, which is described as “the weighted sum
of parameter estimates from the multiple regression for rhyme vs. match
contrast pre- and post-training.”
Figure 2 from Temple et al. (2003). Data from dyslexic children | . |
How is it that this paper has been so influential? I
suggest that it is largely because of the image below, summarising results from the study. This was reproduced in a review paper by the senior author that appeared in Science in 2009. This
has already had 42 citations. The image is so compelling that it’s also been
used in promotional material for a commercial training program other than the one that was used in the study. As McCabe and Castel (2008) have noted, a
picture of a brain seems to make people suspend normal judgement.
I don’t like to single out a specific paper for criticism
in this way, but feel impelled to do so because the methodological problems
were so numerous and so basic. For what it’s worth, every paper I have looked
at in this area has had at least some of the same failings. However, in the case of Temple et al (2003) the problem is compounded by the declared interests of two of the authors, Merzenich and Tallal, who co-founded the firm that markets the FastForword intervention. One would have expected a journal editor to subject a paper to particularly stringent scrutiny under these circumstances.
We can also ask why those who read and cite this paper
haven’t noted the problems. One reason is that neuroimaging papers are
complicated and the methods can be difficult to understand if you don’t work in
the area.
Is there a solution? One suggestion is that reviewers and
readers would benefit from a simple cribsheet listing the main things to look
for in a methods section of a paper in this area. Is there an imaging expert
out there who could write such a document, targeted at those like me, who work in this broad area, but aren’t imaging
experts? Maybe it already exists, but I couldn’t find anything like that on the
web.
Imaging studies are expensive and time-consuming to do, especially when they involve clinical child groups. I'm not one of those who thinks they aren't ever worth doing. If an intervention is effective, imaging may help throw light on its mechanism of action. However, I do not think it is worthwhile to do poorly-designed studies of small numbers of participants to test the mode of action of an intervention that has not been shown to be effective in properly-controlled trials. It would make more sense to spend the research funds on properly controlled trials that would allow us to evaluate which interventions actually work.
References
Gabrieli, J. D. (2009). Dyslexia: a new synergy between education and cognitive neuroscience. Science, 325(5938), 280-283.
Ioannidis, J. P. A. (2011). Excess significance bias in
the literature on brain volume abnormalities. Arch Gen Psychiatry, 68(8),
773-780. doi: 10.1001/archgenpsychiatry.2011.28
Kriegeskorte, N., Simmons, W. K., Bellgowan, P. S. F.,
& Baker, C. I. (2009). Circular analysis in systems neuroscience: the
dangers of double dipping. [10.1038/nn.2303]. Nature Neuroscience, 12(5),
535-540. doi:
http://www.nature.com/neuro/journal/v12/n5/suppinfo/nn.2303_S1.html
McCabe, D., & Castel, A. (2008). Seeing is believing: The effect of brain images on judgments of scientific reasoning Cognition, 107 (1), 343-352 DOI: 10.1016/j.cognition.2007.07.017
Nieuwenhuis, S., Forstmann, B. U., & Wagenmakers,
E.-J. (2011). Erroneous analyses of interactions in neuroscience: a problem of
significance. [10.1038/nn.2886]. Nature Neuroscience, 14(9), 1105-1107.
Poldrack, R. A., & Mumford, J. A. (2009).
Independence in ROI analysis: where is the voodoo? Social Cognitive and
Affective Neuroscience, 4(2), 208-213.
Strong, G. K., Torgerson, C. J., Torgerson, D., & Hulme, C. (2010). A systematic meta-analytic review of evidence for the effectiveness of the ‘Fast ForWord’ language intervention program. Journal of Child Psychology and Psychiatry, in press, doi: 10.1111/j.1469-7610.2010.02329.x.
Temple, E., Deutsch, G. K., Poldrack, R. A., Miller, S. L., Tallal, P., Merzenich, M. M., & Gabrieli, J. D. E. (2003). Neural deficits in children with dyslexia ameliorated by behavioral remediation: Evidence from functional MRI. Proceedings of the National Academy of Sciences of the United States of America, 100(5), 2860-2865. doi: 10.1073/pnas.0030098100