Tuesday, October 25, 2016

(Fast) Food for Thought

Often I find myself walking to a local fast food establishment for lunch. The staff there is excellent: They keep the place clean, they always greet me with a smile, and they make delicious food. A few years ago, this particular fast food chain had a string of bad press where it was discovered that a very small number of employees were doing some unsavory things to customers’ food.

I felt bad for the staff at my local restaurant. They had no association with these trouble-makers other than they happened to work for the same restaurant chain, just like thousands of other individual employees. After the news broke, some customers were worried about what was happening behind closed doors in their local restaurant (e.g., Is some bad employee doing something unsanitary to my lunch?). And the staff was probably concerned about being perceived as being one of the trouble-makers (e.g., Do my customers think that I am doing something unsanitary to their lunch?). A few bad news stories ruined the whole employee-customer relationship.
The response by my local franchisee was simple and effective: They modified the store to have an open-kitchen design (http://business.time.com/2012/08/20/nothing-to-hide-why-restaurants-embrace-the-open-kitchen/). Now, I can order my lunch and watch the employees prepare my food. I can see into the kitchen and see exactly who is handling my food and how they are handling my food. It is transparent. I suspect the staff likes the open-kitchen concept too. They know that if they are following the proper procedures that customers will not erroneously suspect them of doing something unsavory to their food. By opening up the food preparation process, the whole employee-customer relationship was improved. Now, customers can receive their lunch with the confidence that it was made properly and the staff can provide customers their lunch with the confidence that customers are not suspicious.

I also suspect the open-kitchen concept had several secondary benefits too. For example, the staff probably keeps the kitchen cleaner and avoids cutting obvious corners when they know they are in plain sight of customers. When I go to a different restaurant that still has a “closed-kitchen” design, I wonder what I would see if I could peer into their kitchen. Consequently, all else being equal, I choose open-kitchen establishments over closed-kitchen establishments. Open-kitchen designs are good for the bottom line.

The parallels between the open-kitchen design and "open science" are obvious. As researchers, we produce information that other people consume and we consume information that other people produce.

Here is some (fast) food for thought. As a producer of research, would you feel comfortable allowing your consumers to transparently see your research workflow? As a consumer of research, if you were given the choice between consuming research from an open-science establishment or a closed-science establishment, which would you choose?  

Friday, September 2, 2016

Preregistration increases the informativeness of your data for theories

Theories predict observations. Observations are either consistent or inconsistent with the theory that that implied the observations. Observations that are consistent with the theory are said to corroborate the theory. Observations that are inconsistent with the theory should cast doubt on the theory, or cast doubt on one of the premises of the theory.

There are a few things that affect the extent to which data inform theory. Below I argue that preregistration can strengthen the informativeness of data for a theory in a few ways.

First, data only informs theory via a chain of auxiliary assumptions. All else being equal, data that inform a theory through fewer auxiliary assumptions are more informative to that theory than data that make contact with theory through more auxiliary assumptions.

For example, data are informative to a particular theory so long as readers assume the predictor is valid, the outcome is valid, the conditions relating the predictor to the outcome have been realized, the sample was not selected based on the obtained results, the stated hypotheses were not modified to match the obtained results, etc. Sometimes these auxiliary assumptions are not accepted (by some individuals) and the theory is treated (by some individuals) as uninformed by the data.

Preregistration can essentially eliminate some assumptions that are required to interpret the data. Readers do not need to accept the assumption of which hypotheses were indeed a priori. Readers do not need to accept the assumption of how the sample was determined. Etc. There is a date-stamped document declaring all of these features. The only assumption necessary is that the preregistration is legit. Thus, all else being equal, a preregistered study has fewer links in the chain of auxiliary assumptions linking the observed data to the theory that is being tested. Thus, all else being equal, data from a preregistered study are more informative to a theory than data from a non-preregistered study.

Second, the informativeness of the data is related to the degree to which the data can be consistent or inconsistent with the theory. If a theory really sticks its neck out there, the data can more strongly corroborate or disconfirm the theory. If a theory does not stick its neck out there, the data are less informative to a theory, regardless of the specific pattern of data.

Preregistration clearly specifies which outcomes are predicted and which are not prior to the data being analyzed. Predictions that were made prior to the data being analyzed are riskier than predictions that were made after the data have been analyzed. Why? Because with preregistration there is no ambiguity in whether the predictions were actually made independently of the results of those predictions. Preregistered predictions stick their neck out. Very specific preregistered predictions really stick their neck out.

I am not saying that you must preregister your studies. I am saying that choosing not to preregister your studies is also choosing not to maximize the informativeness of your data for your theories.

Tuesday, July 5, 2016

hostile priming effects seem to be robust

In 1979, Srull & Wyer published a study wherein participants were presented with a series of 4 words from which they had to construct grammatically correct 3-word phrases. Some phrases described aggressive behaviors (i.e., break his leg). Later, participants read a story about the day in the life of a man named Donald. In this story, Donald performed ambiguously aggressive behaviors (e.g., argued with his landlord). Finally, participants provided their judgments of Donald by rating him on a series of traits that were combined into a measure of hostility (e.g., hostile, unfriendly, dislikable, kind (r), considerate (r), and thoughtful (r)). Participants who completed more aggressive phrases in the first task subsequently rated Donald as more hostile. Exposure to hostile-relevant stimuli that subsequently affects some type of a subsequent hostile-relevant impression is generically referred to as a “hostile priming” effect.*

So, is there strong evidence for the robustness of a “hostile priming” effect?** If you asked me 6 years ago I would have said “yes.” Why? First, DeCoster and Claypool (2004) performed a meta-analysis on cognitive priming effects that used an impression formation outcome variable and found that, overall, there is an effect of about 1/3 of a standard deviation in the predicted direction (i.e., k = 45, N = 4794, d = 0.35, 95% CI[0.30, 0.41]). Further, several of the studies included in the DeCoster and Claypool meta-analysis primed the construct of “hostility” and had an outcome variable that was relevant to the construct “hostility.” Second, the hostile priming effect has been demonstrated in dozens of published studies.

However, in 2016, I feel there are a few reasons to question the robustness of the hostile priming effect. First, DeCoster and Claypool didn’t investigate the presence of publication bias. That is not a knock on their excellent meta-analysis. But I believe that everybody in 2016 is more cognizant of the potential problems of publication bias than 12+ years ago. And we currently have more tools to detect publication bias than we did 12+ years ago. Second, it seems that cognitive phenomena labeled as a “priming” effect are currently viewed more skeptically. Fair or not, that is my belief about the current perceptions of cognitive priming effects. Third, many of the studies in the DeCoster and Claypool meta-analysis were authored by Diederik Stapel. Obviously, we should interpret the studies authored by Stapel differently in 2016 than in 2004. 

In addition to being interested in the hostile priming effect, I also wanted to force myself to learn some new tools. (This is probably a good place to note that this exercise was mostly a way for me to practice using some new tools, so there may be errors involved or I may describe something in a slightly incorrect way. If you find an error, help me learn.*** Please and thank you!)

I gathered what I believe to be a comprehensive list of all of the publications with (a) an assimilative hostile priming manipulation and (b) some type of a hostile-relevant impression formation task. I found 27 publications with 38 individual studies (please let me know if you are aware of studies I missed).

First, I did a p-curve analysis on all of the studies. Here is a link to the p-curve disclosure table (https://mfr.osf.io/render?url=https://osf.io/ar5cf/?action=download%26mode=render). The analysis reveals these studies contain evidential value, z = -6.02, p < .001. The p-curve analysis estimated the average power of the studies to be 65%.

I was pretty liberal with my inclusion criteria for this first p-curve analysis. I next winnowed the studies down to ones that I believed were most focused on the effect of interest. First, I took out the Stapel studies (the continuous p-curve analyses implied “evidentiary” value in Stapel’s studies for those who are interested, z = -3.85, p = .0001).****  Next, there were some studies that were not really interested in the effects of hostile priming on impression formation per se.  Instead, some studies were interested in the relation between some construct they believed was associated with the construct of “hostility”, and these studies used an impression formation outcome variable to demonstrate this hypothesized relation.  For example, DeWall and Bushman (2009) proposed that people hold mental associations between the construct of “hot temperatures” and “hostility.”  This proposition was demonstrated by exposing individuals to words associated with hot temperatures prior to having them report their judgments of Donald.  Thus, the emphasis in studies of this ilk are not so much on impression formation, but these studies use impression formation tasks as a tool to test their other hypotheses.  This led me to take out studies that primed “hostility” with hot temperatures (e.g., DeWall & Bushman, 2009; McCarthy, 2014), alcohol (e.g., Bartholow & Heinz, 2006; Pederson et al., 2014), sex (Mussweiler & Damisch, 2008; Mussweiler & Forster, 2000), and aggressive sports (e.g., Wann & Branscombe, 1990). (collectively, these omitted studies did not have evidential value, z = -0.71, p = .24). 

A p-curve analysis on the remaining 18 effects still revealed evidentiary value for an effect, z = -4.91, < .001. The p-curve analysis estimated the average power of the remaining studies to be 74%. 

So, this is just a first pass on examining the evidentiary value within these studies.  And, based solely on these p-curve analyses, it looks like there is evidence for a hostile priming effect on subsequent impressions of hostility.  I plan on working through a few other tests such as Test for Insufficient Variance, meta-analysis of effect sizes, etc.  Again, this is my way of forcing myself to learn, possibly get some free feedback, identify obvious errors, etc.. 

* “Hostile priming” needn’t be limited to subsequent impression formation tasks. It could, for example, include a behavioral outcome measure (e.g., Carver et al., 1983). However, for the present purposes, I limit the discussion to outcome variables that broadly fall into the class of impression formation tasks. This choice is purely based on my personal interests and not on any theoretical justification.

** In this blog post I am only referring to an “assimilation effect”: That is, priming effects that cause subsequent judgments to possess more of the primed construct.  There are situations when the primed constructs cause subsequent judgments to possess less of the primed construct. These latter effects are referred to as “contrast effects” and are not discussed herein. Again, this choice is purely based on my personal interests and not on any theoretical justification.

***If you find an error, please find me at the next conference and say "hi." You are entitled to one free beer/coffee from me (either one at any time of the day/night, I don't judge). 

**** As Alanis Morissette says “isn’t it ironic” 

Thursday, June 9, 2016

Getting a feel for equivalence hypothesis testing

A few weeks ago, Daniel Lakens posted an excellent blog about equivalence hypothesis testing (http://daniellakens.blogspot.com/2016/05/absence-of-evidence-is-not-evidence-of.html). Equivalence hypothesis testing is a method to use frequentist statistical analyses (specifically, p-values) to provide support for a null hypothesis. Briefly, Lakens describes a form of equivalence hypothesis testing wherein the "null hypothesis" is a range of effects that are considered to be smaller than the smallest effect of interest. To provide evidence for a null effect a researcher performs 2 one-sided tests: One to determine if your effect is smaller than the upper boundary of the equivalence range and one to determine if your effect is larger than the lower boundary of the equivalence range. If your effect is both significantly less than the upper boundary and significantly greater than the lower boundary of the equivalence range, one can classify the effect as too small to be of interest. Of course, because these tests all employ the use of p-values, they are subject to a known long-run error rate (the familiar Type 1 and Type 2 errors). (If this brief description was too brief I would take the time to read Lakens' original blog post).

Although I could follow the logic of this equivalence testing procedure, I didn't have an "intuitive feel" for what it means to use p-values to generate support for a null effect. This is probably due to years of learning the traditional NHST approach to hypothesis testing wherein you can only "reject" or "fail to reject" the null hypothesis.  Here is the process I went through to further my understanding of equivalence testing.

First, to get a feel for how p-values can be used to generate support for a null effect it is useful to get a feel for how p-values behave when there is an effect. Let's take a simple 2-group design where there is an effect of d = 0.3. Below is the distribution of p-values from 10,000 simulated studies with a population effect of d = 0.3 and with 50 individuals per group. You can see that 31.8% of these p-values are below .05. This figure is merely a visual representation of the statistical power of this design: Given a certain effect (e.g., d = 0.3), a certain sample size (e.g., N = 100), and an alpha level (e.g., .05), you will observe a p-value less than alpha at a known long-run rate. In this case, statistical power is 0.32.

Let's stick with the example where there is an effect of d = 0.3.  Now suppose the sample size is doubled. In 10,000 simulated studies, increasing the sample size from 50 per group to 100 per group results in more low p-values. As can be seen below, 55.5% of the p-values are below .05. In other words, increasing the sample size increases statistical power. When there is a to-be-detected effect, increasing the sample size increases your chances of correctly detecting that effect by obtaining a p-value below .05. In this case, statistical power is 0.55.

In comparison, let's look at a scenario where there is no effect in the population, d = 0 (which is the scenario that is most relevant to equivalence testing). With no effect in the population you can only make Type 1 errors.  In 10,000 simulated studies where there is a population effect of zero (i.e., d = 0) and a total N of 100, 5.17% of the p-values were below .05. (If you ran this simulation again you might observe slightly more or slightly fewer p-values below .05. In the long run 5% of the studies will result in p-values below .05).  The 5% of studies with low p-values are all Type 1 errors because there is no effect in the population.

Sticking with the scenario where we have a population effect of zero (i.e., d = 0) and we double the sample size from a total N of 100 to a total N of 200. The distribution of p-values from 10,000 simulated studies shows that 4.87% of the p-values were less than .05 (again, in the long run this will be exactly 5%). When there is no effect in the population the distribution of p-values does not change when the sample size changes.

To recap: When there is a to-be-detected effect you can only make Type 2 errors.  With all else being equal, increasing the sample size increases statistical power which, by definition, decreases the likelihood of making a Type 2 error.  When there is no effect you can only make Type 1 errors.  With all else being equal, increasing the sample size does not affect the distribution of observed p-values. In other words, when a null effect is true, your statistical power will simply be your alpha level regardless of sample size.

What does this have to do with equivalence testing?  A lot actually.  If you followed the information above, then understanding equivalence testing is just re-arranging and re-framing this already-familiar information.

For the following simulations we are assuming there is no effect in the population (i.e., d = 0). We also are assuming that you determined that an absolute effect less than d = 0.4 is either too small for you to consider meaningful or it is too resource expensive for you to study.  (This effect was chosen only for illustrative purposes, you can use whatever effect you want.)

To provide support for a null effect it is insufficient to merely fail to reject the null hypothesis (i.e., observe a p-value greater than your alpha level) because a non-significant effect can either indicate a null effect or a weakly powered test of a true effect that results in a Type 2 error.  And, as shown above, increasing your sample size does not increase your chances of detecting a true null effect with traditional NHST. However, increasing your sample size can increase the statistical power to detect a null effect with equivalence testing.

Let's run some simulations. We have already seen that if a null effect is true (d = 0) and your total sample size is 100 that traditional NHST will result in 5% of p-values less than .05 in the long run.  I now took these 10,000 simulated studies and I tested whether the effects were significantly smaller than d = 0.4 and whether the effects were significantly larger than d = -0.4.  As can be seen below, in these 10,000 simulations, when d = 0 and N = 100, 63.9% of the samples resulted in an effect that was significantly smaller than d = 0.4 and 63.1% of samples resulted in an effect that was significantly larger than d = -0.4 (these percentages are not identical because of randomness in the simulation procedure; in the long run they will be equal).

Some of the samples with effects that are significantly smaller than d = 0.4 actually have effects that are much smaller than d = 0.4.  These samples have effects that are significantly smaller than d = 0.4 but are not significantly larger than d = -0.4.  Likewise, some of these samples with effects that are significantly larger than = -0.4 have effects that are much larger than = -0.4.  These samples have effects that are significantly larger than = -0.4 but are not significantly smaller than = 0.4.

In equivalence testing, to classify an effect as "null" requires the effect to be both significantly less than the upper bound of the equivalence range and significantly higher than the lower bound of the equivalence range.  In these 10,000 samples, 27.04% of the samples would be considered "null" (i.e., d = -0.4 < observed effect < d = 0.4).

Now comes the real the utility of equivalence testing.  If we double the total sample size from N = 100 to N = 200 we can increase the statistical power of claiming evidence for the null hypothesis.  As can be seen below, within the 10,000 simulated studies where d = 0 and N = 200, 88.3% of the studies had effects that were significantly smaller than d = 0.4 and 88% had effects that were significantly greater than d = -0.4.  (again, differences in these percentages are due to randomness in the data generation process and are not meaningful), and 76.4% of these studies had effects that were both smaller than d = 0.4 and greater than d = -0.4.  Thus, increasing the total sample size from 100 to 200 increased the percentage of studies that would be classified as "null" (i.e., = -0.4 < observed effect < = 0.4) from 27.04% to 76.4%.

Here are the major take-home messages.  First, equivalence testing is nice because it allows you to provide evidence for a null effect by using the tools that most researchers are already familiar with (i.e., p-values).  Second, unlike traditional NHST, increasing N can increase the statistical power of detecting a null effect (defined by the equivalence range) when using equivalence testing.

These simulations are how I went about building my understanding of equivalence testing.  I hope this helps others build their understanding too.  The R-code for this post can be accessed here (https://osf.io/ey5wq/).  Feel free to use this code for whatever purposes you want and please point out any errors you find.

Monday, March 28, 2016

Measuring aggression in the lab is hard

Psychologists often study aggression in lab-based settings.  However, some people are unconvinced that commonly-used lab-based aggression paradigms actually demonstrate aggression, which, they claim, limits the evidentiary value of results from studies that use those paradigms.  Rather than dig in heels, I have tried to think of ways that researchers can frame their criticisms to make these discussions more productive.
I believe that most criticisms of lab-based aggression paradigms take on one of two flavors: The behavior was not believed to have been harmful or the behavior was not believed to have been caused by a cognitive process involving aggressive cognitions. 
The definition of aggression identifies the sources of criticisms
Aggression is a behavior that is done with the intent to harm another individual who is believed to want to avoid receiving the behavior.  Thus, to demonstrate aggression in the lab requires two factors: (a) a harmful behavior and (b) that behavior must be believed to have been caused by a cognitive process that involved an intent to harm and a belief the recipient wanted to avoid experiencing the behavior (i.e., collectively referred to as “aggressive cognitions” herein).  If both factors are present, aggression has occurred; if both factors are not present, aggression has not occurred.  Conceptually simple, yet hard to execute.
Demonstrating harmful behaviors in the lab
Neither the IRB nor most researchers will allow participants to actually harm another person just for the sake of testing a hypothesis.  So, you cannot even demonstrate an unambiguously “harmful” behavior in the lab.  This is a big deal.  It is like researchers who are interested in the phenomenon of “eating ice cream” and the IRB won’t allow participants in your lab to actually eat ice cream.  For this reason, aggression researchers must use “ethically palatable” behaviors that minimally meet the criterion of being harmful, but really don’t involve people harming one another. 
Some examples of previously-used lab-based behaviors include sending irritating sound blasts to another person (who typically do not exist), selecting how much hot sauce will ostensibly be served to a person who dislikes spicy foods, sticking pins into a Voodoo Doll of another person to “inflict harm,” choosing how long another person will hold an uncomfortable Yoga pose, etc.  It is not that aggression researchers think these are super harmful behaviors; but these are reasonable tasks that can be considered a little harmful, are quantifiable, can be done in a lab environment, don’t put anybody in harm’s way, etc.  In other words, these tasks are pragmatic, not ideal. 
Some people legitimately doubt whether these behaviors meet the “harmfulness” criterion (e.g., is a sound blast really “harmful”?).  And, I would suspect that most aggression researchers would readily concede that these behaviors are artificial, contrived, and open to debate on whether they are “harmful”.  If the opinion is that these behaviors are not “harmful,” then, by definition, these behaviors cannot be considered aggressive.  I sincerely hear and understand these criticisms.  Nevertheless, researchers obviously cannot allow participants to actually harm another person within a lab environment.  
Inferring the presence of aggressive cognitions in the lab
It is insufficient merely to demonstrate that a harmful behavior has occurred; the cognitive process that causes those behaviors must involve, in some (usually undefined) capacity, aggressive cognitions.  If aggressive cognitions were not involved, then the resulting behavior is not aggression, regardless of how harmful the behavior was.
Aggression researchers attempt to create a context from which aggressive cognitions can be inferred.  For example, researchers may tell participants that a specific behavior (e.g., pressing a button) will cause a specific event (e.g., send an unpleasant noise) that has a specific effect (e.g., another person will experience the unpleasant noise).  Thus, observing the behavior allows the researcher to infer the behavior was done with a known intent and with a known consequence.  If the behavior was harmful and aggressive cognitions were assumed to be involved in the cognitive process that caused those behaviors, then the resulting behavior can be assumed to be aggressive.
Some critics point out that several cognitive processes also can produce the same behavior; thus, there is no reason to favor a cognitive process involving aggressive cognitions over these other cognitive processes. For example, participants may perceive a particular task as competitive (rather than as an opportunity to aggress), participants may engage in “mischievous responding,” or participants may intuit the study’s hypotheses and behave according to what they believe the hypotheses are. 
The argument goes like this.  A cognitive process with “aggressive cognitions” may cause a harmful behavior (if a, then b), but observing a harmful behavior does not necessarily imply the behavior was caused by a cognitive process involving “aggressive cognitions” (b, therefore a) because there are several cognitive processes (e.g., competition, mischievous responding, socially-desirable responding, etc.) that also can cause the same harmful behaviors (if x, then b or if y, then b).  Believing that the presence of a harmful behavior necessarily implies the presence of a cognitive process involving aggressive cognitions is a logical error known as affirming the consequent. 
Perhaps an unappreciated idea is that these criticisms cut both ways.  Just because it is possible that a “non-aggressive” cognitive process can cause a harmful behavior does not mean that it did.  For example, just because it is possible that some participants in some instances may think sending sound blasts to another person is competitive (and not aggressive) does not mean that any specific instance of this behavior does not meet the criteria for aggression.  It is possible those sound blasts in this instance were being sent with the intent to aggress against the recipient and, thus, the behavior would meet the criteria for aggression.  Further, if the same context (e.g., experiencing an insult) both causes a harmful behavior in the lab (e.g., sound blasts) and harmful behavior out of the lab (e.g., punching another person), this may cause one to slightly favor the cognitive process involving aggressive cognitions when observing the behavior in the lab.  Ultimately, researchers need to use their judgment on whether it is plausible to infer that a cognitive process involved aggressive cognitions.  And reasonable people will disagree on what is plausible.
Another common approach to inferring the presence of aggressive cognitions is to ask participants why they exhibited a behavior.  For example, you could ask participants to report whether they sent loud sound blasts to be “aggressive” or not.  If they say “yes,” then the resulting harmful behavior may be considered aggressive. 
As straightforward as this approach appears, it has its own limitations.  First, this approach assumes that participants have introspective access to their cognitive processes (which is not a requirement for the resulting behavior to be considered aggressive).  Second, the abovementioned criticisms of the processes causing harmful behaviors also apply to the processes causing participants’ self-reported motives.  For example, participants may report having done a behavior to be aggressive merely because they are being “mischievous,” or participants may intuit the study hypotheses and “play along” with what they believe the hypotheses are.   Simply put, there are many cognitive processes that can become expressed in a response of “I did that behavior to be aggressive”.
As with the criticisms of the behaviors typically observed in lab-based aggression paradigms, I sincerely hear and understand the critiques about whether aggressive cognitions are involved in the process causing those behaviors.  There is no avoiding the fact that inferring characteristics of cognitive processes is hard to do and that different researchers have different ideas of what would convince them to infer the presence of aggressive cognitions. 
Framing and addressing the critiques
It is hard to demonstrate aggression in a laboratory setting in a way that will result in wide-spread agreement.  But hard does not mean impossible.  And disagreements need not be permanent.  Here are things that I believe will facilitate discussions about the value of these paradigms. 
1.    For those offering critiques, be specific about the target of criticism.  Do you not believe the behavior was harmful?  Or do you believe there was an alternative cognitive explanation for the observed harmful behavior?  Or both?  Clarity in the critique offers clarity in the ways in which researchers can improve their methods.  Those who are unconvinced by current methods should state what methods or evidence would be convincingInconvincibility is a conversation stopper.  

2.    For researchers, to demonstrate aggression you need to both (a) demonstrate a harmful behavior and (b) this behavior must be assumed to have been caused by a cognitive process with “aggressive” cognitions.  Thus, you need to both argue why you believe the observed behavior is harmful and you need to argue why you believe the cognitive process involved aggressive cognitions.  Without both of these things, you cannot claim you have measured aggression.  Keep in mind that people might will argue the behavior was not harmful, people might will argue there was an alternative cognitive process that caused the behavior, or both.  Such critiques are OK, it’s called science.  Take these criticisms seriously and use them as motivation to improve your methods.