Is there a significance filter in criminology?

Is there publication bias in criminology? There could be a bias towards only publishing significant results, which is sometimes referred to as a “significance filter”. If so, researchers have incentives to do a bit of data snooping and p-hacking to make sure they get significant results without necessarily reporting all the massaging they have done.

Although not a study of the “significance filter”, a recent study of published experimental studies in criminology nevertheless includes some relevant information. It is reported that 68% of the reported effect estimates were not significant. This might be interpret as it is no not likely to be much of a significance filter in criminology. However, the finding is based on 402 effect estimates from 66 publications, which implies an average of six effect estimates per study. The probability of getting one out of six with p<0.05 is surely much larger than 0.05.

Perhaps it only takes at least one significant result to get published? If so, the expectation given the existence of a significance filter would be that that at least 66 out of the 402 estimates were statistically significant (16%). It would have been interesting to know how many of the studies did not find any significant findings, and how many reported significant findings only for subgroup analysis or alternative specifications – as well as how many subgroups and alternative specifications were (actually) tried out.

Maybe the cartoon XKCD got it uncomfortably right?

p_values

And perhaps also: “Hey, look at this interesting alternative outcome measure”.

 

P.S. I have no reason to think the significance filter is more/less prevalent in criminology than in other fields. It is probably similar to other social sciences.

 

The post Is there a significance filter in criminology? appeared on The Grumpy Criminologist 2016-07-25 12:38:33 by Torbjørn.

About the Weisburd paradox

The “Weisburd paradox” refers to the finding by Weisburd, Petrosino and Mason who reviewed the literature of experimental studies in criminology and found that increasing the sample size did not lead to increased statistical power. While this paradox has perhaps not achieved great attention in the literature so far, the study was replicated last year by Nelson, Wooditch and Dario in Journal of Experimental Criminology confirming the phenomenon.
The empirical finding that larger sample size does not increase power is based on calculating “achieved power”. This is supposed to shed light on what the present study can and cannot achieve (see e.g. here). “Achieved power” is calculated in the same way as conventional power calculations, but instead of using the assumed effect size, one uses the estimated effect in the same study.
Statistical power refers to the probability of correctly rejecting the null hypothesis, based on assumptions about the size of the effect (usually based on previous studies or other substantive reasons). By increasing the sample size, the standard error gets smaller and this increases the probability of rejecting the null hypothesis if there is a true effect. Usually, power calculations are used to determine the necessary sample size as there is no point of carrying out a study if one cannot detect anything anyway. So, one needs to ensure sufficient statistical power when planning a study.
But using the estimated effect size in the power calculations gives a slightly different interpretation. “Achieved power” would be the probability of rejecting the null hypothesis, based on the assumption that the population effect is exactly equal to the observed sample effect. I would say this is rarely a quantity of interest since one has already either rejected or kept the null hypothesis… Without any reference to external information about true effect sizes, post-hoc power calculations brings nothing new to the table beyond what the point estimate and standard error already provides.
Larger “achieved power” imply larger estimated effect size, so let’s talk about that. The Weisburd paradox is that smaller studies tend to have larger estimated effects than larger studies. While Nelson et al discuss several reasons for why that might be, they did not put much weight on what I would consider the prime suspect: a lot of noise combined with the “significance filter” to get published. If there is a significant effect in a small study, the point estimate needs to be large. If significant findings are easier to publish, then the published findings from small studies would be larger on average. (In addition, researchers have incentives to find significant effects to get published and might get tempted to do a bit of p-hacking – which makes things worse). So, the Weisburd paradox might be explained by exaggerated effect sizes.
But why care? First, I believe the danger is that such reasoning might mislead researchers to justify conducting too small studies, ending up chasing noise rather than making scientific progress. Second, researchers might give the impression that their findings are more reliable than it really is by showing that they have high post-hoc statistical power.
Just to be clear: I do not mind small studies as such, but I would like to see the findings from small studies replicated a few times before giving them much weight.
Mikko Aaltonen and I wrote a commentary to the paper by Nelson et al. and submitted it to Journal of Experimental Criminology, pointing out such problems and argued that the Weisburd paradox is not even a paradox. We were rejected. There are both good and bad reasons for this. One of the reviewers pointed out a number of points to be improved and corrected. The second reviewer was even grumpier than me and did not want to understand our points at all. When re-reading our commentary, I can see much to be improved and I also see that we might be perceived as more confrontational than intended. (I also noticed a couple of other minor errors). Maybe we should have put more work into it. You can read our manuscipt here (no corrections made). We decided not to re-write our commentary to a more general audience, so it will not appear elsewhere.
When writing this post, I did an internet search and found this paper by Andrew Gelman prepared for the Journal of Quantitative Criminology. His commentary on the Weisburd paradox is clearly much better written than ours and more interesting for a broader audience. Less grumpy as well, but many similar substantive points. I guess Gelman’s commentary should pretty much settle this issue. Kudos to Gelman. EDIT: , but also to JQC for publishing it. An updated version of Gelman’s piece is here – apparently not(!) accepted for publication yet.
The post About the Weisburd paradox appeared on The Grumpy Criminologist 2016-07-14 10:00:39 by Torbjørn.

Criminological progress!

I recently came across this article by David Greenberg in the Journal of Developmental and Life Course Criminology. I have previously seen an early draft, and I am glad to see it finally published! (Should have been published a long time ago as the version I saw was pretty good, but I have no idea why it has not). Greenberg shows how to use standard multilevel modeling with normal distributed parameters to test typological theories. The procedure is actually not very complicated: estimate a random effects model, use empirical Bayes to get point estimates for each person’s intercept and slope(s), and explore the distributions of those point estimates using e.g. histograms. And no: those empirical Bayes estimates do not have to be normal distributed! You need to decide for yourself (preferably up front) what it takes for these distributions to be in support of your favourite typology, so it requires a bit of thinking. This can all be done in standard statistical software, only requiring knowing a little bit about what you’re doing. It would be really nice to see previous publications using group-based models reanalyzed in this way.

The article also discuss a number of related modeling choices which are highly informative. So far, I have only read the published version of the article very quickly, and I need to read it more carefully before I fully embrace all arguments, but I might very well end up embracing it all.

I have noticed that it has been claimed in the literature that models assuming normal distributed random effects cannot test for the existence of subpopulations. Well, it is the other way around.

The post Criminological progress! appeared on The Grumpy Criminologist 2016-07-04 12:00:12 by Torbjørn.

Testing typological theories using GBTM?

As I mentioned in the post yesterday, I think the debates about group-based trajectory modeling have some unresolved issues. For this reason, I submitted a commentary to Journal of Research in Crime and Delinquency. I had two reasons for doing so. First, I think Nagin mischaracterized his critics, and I believe his essay was a willful attempt to avoid serious criticism by ignoring serious arguments. (Maybe I could have been less outspoken about that). But after all, he has not addressed the actual argument I (and others) have put forward. I can only interpret this as an attempt to avoid discussing the substantive matter by keeping silent, and now subtly dismissing the whole thing. If Nagin find it worthwhile saying his critics have misunderstood, he should also bother to point out how. So far, he has done no such thing.

Second, I actually think there is a need to clarify whether GBTM can test for the presence of groups or not. If the advocates of GBTM had been clear about this, it would obviously not have been needed. There is no doubt that Nagin and others have been clear that GBTM can – or maybe even should – be interpreted as an approximation to a continuous distribution. There is no disagreement on that point. But they have also given the impression that one can identify meaningful real groups in the data by way of GBTM. They have not been clear on what this really means or under what conditions this can be done. A clarification is in order, since it is clear in the literature that findings from GBTM analyses have been interpreted as giving very strong evidence to a certain typological theory (see e.g. here). I have claimed this empirical evidence is weak and largely based on overinterpretation of empirical studies using GBTM (see, here and here). It would be helpful if Nagin could clarify the strength of this evidence.

So I wrote a commentary and submitted it to The Journal of Research in Crime and Delinquency. (See the full commentary here). According to the letter from the editor, it was rejected because:

Language at the top of page 2 in your comment underscores a fundamental misunderstanding and misreading of Nagin’s work.
(See the full rejection letter here).

Well, maybe I should have put things more politely, but I still believe my arguments are right. I can understand that there might be good editorial reason for why not having another debate about GBTM in the journal, but I am not impressed with the reason given. My fundamental misunderstanding is revealed (on the top of page 2) where I point out that Nagin himself is responsible for some of the confusion regarding the interpretation of the groups. I do so with clear references, so you can decide for yourself whether these are misreadings or not.

Even in his recent essay, Nagin presents one of the main motivations for using GBTM by first arguing that other methods are not capable of testing for the presence of groups, and then suggesting that GBTM can indeed solve this problem:

To test such taxonomical theories, researchers had commonly resorted to using assignment rules based on subjective categorization criteria to construct categories of developmental trajectories. While such assignment rules are generally reasonable, there are limitations and pitfalls attendant to their use. One is that the existence of distinct developmental trajectories must be assumed a priori. Thus, the analysis cannot test for their presence, a fundamental shortcoming. (…) The trajectories reported in Figure 2 provide an example of how GBTM models have been applied to empirically test predictions stemming from Moffitt’s (1993) taxonomic theory of antisocial behavior.
(My emphasis).

It might not say straight out whether the groups from GBTM are interpretable as real or not in this setting, nor what can be concluded from such “tests”. But given the previous debates and misconceptions, this is hardly a clarification.

My point is simply this: it has been claimed that GBTM can be used to test for the presence of distinct groups, and generally to test typological theories. (I have discussed this in more detail here and here). However, it is hard to see how such typological theories can be tested using GBTM. That is indeed very vaguely explained by the advocates of the methodology. I think (but I am not entirely sure), that in this context “testing a theory” only means findings that are consistent with a given theory. I think this is a generous use of the term “test”. I prefer to reserve the word “test” for situations where something is ruled out – or when using methods that at least in principle would be able to rule something out. In other words: If the findings are consistent with a theory but also consistent with one or several competing (or non-competing) theories, this is at best weak evidence for either theory. (This holds regardless of methods used). It is good that a theory is consistent with the empirical findings, but that is far from enough. I know of no published criminological study using GBTM that provides a test of typological theories in the stricter sense of the term. So far, it seems to me that the advocates of GBTM have not been clear on this issue. Some clarification would be in order.

The post Testing typological theories using GBTM? appeared on The Grumpy Criminologist 2016-07-01 12:00:36 by Torbjørn.
Social Media Auto Publish Powered By : XYZScripts.com