HKU PSYC2020 and PSYC3052

When you email me - please simply call me Fili (or Gilad). No need for formalities, definitely no need for Prof. or Dr. or any other title.

Please note: As much as I’d like to remember each and every student and which course they’re from, I don’t. As much as I’d like to remember everything I wrote and where, I don’t. When contacting me please provide all background details. Who are you? What course? Which session? What document are you referring to, etc. etc. etc.

Here I'll answer some of the questions sent to me by students via email that of general interest to all pre-registered replication projects.

Article analysis structure

Students asked (via TA):

Do you want them to follow the structure of the examples of your former students or do they have to answer the replication recipe items one by one in the order shown on the guidelines?


It is a combination of the two.

First, please make sure you are looking at the last version of the guidelines (currently, v3). The article analysis should include answers to all of the items in these guidelines so it's best to go through these one by one. It is also important to add a section at the end where you specifically answer the “replication recipe”, because this is now the accepted template in the field. Importantly, for the article analysis stage to answer the guidelines, it is find and expected that you copy-paste between the two sections to answer repeating/similar questions. There is no need to rephrase, but it is important that you make sure you answered all points in sufficient detail.

Therefore, in terms of the examples, see the examples section in the guidelines. The end result could/should look like a combination of the example of a full article analysis report and an example of one of the replication recipe. Again, it is okay to copy paste from one to the other, we're doing both to also adhere to the official template format in our field.

Another student asked:

I would like to inquire about the template of the article analysis if following the article analysis template that the guideline provided is enough or we should follow the replication article template?

To which I answered:

Since we’re doing a pre-registered replication, it needs to follow the templates of both replications (Replication Recipe) and for pre-registrations (the pre-registration template), which is why the article analysis asks you to address both, so that reviewers will be happy that you fully satisfied both requirements.

Students asked:

I was wondering if you could help a little bit with the ethics part of the tutorial presentation due next week. We're working on project XYZ and we didn't know which ethics document to work on.


The guidelines for replication include much of the needed information (see links on the Moodle, Dropbox is on ).

For the ethics part it writes: See template in the Dropbox: “Dropbox\2018-HKU-Replications\Ethics applications”.

  • FSP: Should use “FSP-example-ethical-approval-application-form.doc”
  • ASP: Should use “ethical-approval-application-form.doc”

If you need any other information, you should aim to be much clearer about your question, In any case, it is best you contact the TA of the TA session you’re assigned to if you have further questions, they should be available to meet and help.

Students asked:

Regarding our assigned article… There are no power analysis and effect size calculation, thus it is not possible to directly report from the article itself. If it is required for the course, I can calculate the effect size and power calculation by other means from my statistical course but is that expected from students?

Moreover, from the replication guidelines Q3~4, it writes “The effect size of the effect I am trying to replicate is: & The confidence interval of the original effect is:”. In the original study, there is no effect size, and the confidence interval is unknown. Will we need to come up with a data which fits our participants (from PSYC2020 students) or do we just skip this question?

My answer:

Yes, it’s unfortunate that we didn’t realize we need to report effects and do power-analyses in science till very recently, so most of the old articles do not have that information, making our replication job difficult. Yes, you will need to estimate the effect best you can given the information reported in the manuscript. Based on that effect-size, you’ll need to perform a power-analysis. This is expected and mandatory for this project. The guidelines provide links to tools and videos explaining how that can be done, and the TAs are there to help you and the groups, so please do make use of both.

The effect size calculators on MAVIS also provide confidence intervals for you to include. I’ll also add additional links to calculators that would help calculate confidence intervals given an effect size and sample size.

I’m not sure what you mean in the last question, but for the projects in PSYC2020 the target sample will be the other students in PSYC2020 class. It is quite possible that the required sample size you’ll calculate is larger than the number of other students in PSYC2020, and this will need to be discussed in the final report. Just for your general knowledge, the PSYC3052 are doing replications of the same articles using a much larger sample size collected online, and so examining the projects from PSYC2020 and PSYC3052 we’ll be able to compare the two. But this is not relevant for your own project report.

Students further asked:

It did not mention the confidence interval of the original effect, should we simply assume it as 95%? Also, item 11 on the recipe (I know that assumptions…), how should we respond to this item?

To which I answered:

If Cis are provided/reported and it doesn’t indicate which Cis (95%/80%) then it’s probably 95% If Cis are not provided at all, you’ll need to calculate 95% Cis and report them together with the effect-size.

The students' question:

For the analysis, we were wondering what exactly is meant by the comparison study and what we have to refer to in regards to that section?

My answer:

A comparison study is a study where there is no experimental manipulation, not within-subject and not between-subject design. In these studies, participants simply read one scenario and answer a question, often a choice between two or more options. In such cases, the way we evaluate the effect size and whether participants’ answers are surprising, is by comparing the counts, proportions, or the means of the answers to that scenario to what we would normally expect by random chance. So, for example, if we have two options, if participants answer randomly we would expect a 50%-50% ratio in the two choices, so if we observe a different ratio, we compare that ratio to a 50-50 split using one of the following statistical tests:

  • If it’s counts/proportions: Chi-square or binomial z.
  • If it’s a mean: A one sample t-test.

As an example, you can see one of my former students’ example on the exceptionality effect, either Experiment 1 or Experiment 2, in the Dropbox under directory: “Dropbox\Replications\Examples of replications\Former students\Exceptionality effect - BEST EXAMPLE\Manuscript”.

Here's a good example from a question another student asked:

could I ask how I can compute the subject size for the Cohen's D effect size if there isn't a comparison condition in my study? There seems to be only 1 group of Stanford students doing the questionnaire. However, I am not sure if the second self-assessment they were asked to do to assess the accuracy of their previous self-ratings is considered as a comparison condition (Sheet 2-calculating cohen's D from t-tests when you don't have standard deviations or standard errors)

My answer:

I’m guessing you’re referring to Pronin etal 2002 PSPB #2: Actor-observer bias, and in there what I wrote was:

Statistical analysis: One sample t-test (comparing to 5 the midpoint on scale)

More than that, in the article they wrote:

“Overall, participants claimed both to possess more of the positive characteristics listed (M = 6.44), t(78) = 14.19, p < .001, and less of the negative ones (M = 3.64), t(78) = 10.94, p < .001, than the average Stanford student (designated by the midpoint of 5 on the relevant 9-point scales).”

So, we’re talking about two statistical tests of two answers using a one-sample t-test. The way you do this with no comparison group is to compare the answer to the midpoint of 5.

You can use the following online calculator to aid you in calculating the effects using these statistics:

One more question:

not too sure which values to use to calculate effect size when I only have a t-statistic DF and p-value

To which I answered:

The t-statistic value + the the p-value should be enough to deduce the effect size. If I remember correctly, Royzman & Baron's 2002 had t-statistic and p-values. There are a few options. For example: MAVIS ( p-values to effect size in the effect size calculator can convert sample+p-values to effect size for t-tests. Another option is an Excel in the Dropbox for lacking info: Calculating Cohen D with lacking info.xls in Dropbox\Effect size resources\Computing effect-size

The students asked:

May I ask for the meanings of the following terms with respect to the article?

  • Comprehension checks
  • Attention checks
  • Manipulation checks

My answer:

Not all studies have these. Some only have some of those. You'll need to figure out from the article text and supplementary file (if exists) if there were any, and which ones they are. This is a basic requirement of reading and understanding an experiment.

  • Comprehension checks are questions used to test the participants comprehension (understanding) of the scenarios or experimental manipulation.
  • Attention checks are questions used to test that participatns are really reading and paying attention to the questions.
  • Manipulations check are questions used to test that manipulation worked in the expected way.

Suggested readings:

The students asked:

I would also like to clarify: under Effect size calculations for the article analysis, it is written “For ANOVAs report both F for the overall effect and Cohen's d for the comparison of the different conditions”. Does this mean that for effect size, with three conditions, we compare each pair for effect size? If so, do we do the same for power analyses, or do we calculate a pooled SD to input for ANOVA effect size on GPower?

I am still new to three groups ANOVA, and am therefore a little bit confused and would appreciate the help.

My answer:

Good questions, and the most common answer we love to give in such cases is… it depends. Generally, when in doubt – do as much as possible in order to show that you’ve done the best you can. Not just only for this course, but generally in science so that reviewers, editors, and the scientific community can see that you were rigorous. You would expect researchers would all have easy answers for these, but you’re not alone. This is confusing even to experienced researchers.

In this case, yes, I would ask that you please do an effect size calculations for both the ANOVA and the two-group comparisons, which would be easy given that the means and the SDs are provided.

  1. So, first calculate the overall F ANOVA effect size, and yes – you’ll need to input the pooled SD in Gpower for that. You can use this tool:
  2. Then, calculate the t-test effect size for the contrasts. Combine the two highest numbers.

Notice any differences? Write those down.

Other suggested readings:

A followup question:

I have another question to ask: for the t-test comparisons between each group, do I need to calculate the power of the t-test between each, or would this not be valid? Or is this no longer needed as power has already been conducted for the ANOVA.

If power analyses of the t-tests are needed, would Type I error corrections (e.g. Bonferroni correction) be needed for three groups?

To which I answered:

For highest accuracy and to make sure we’re well-powered to detect effects, yes, please calculate the effects for the contrasts between each condition and calculate required sample size.

About Bonferroni, as you pointed out, this is a type I p-values correction, which is not relevant for effect-size or power calculations. Both effect-sizes and power calculations are not related to p-values. Therefore, there’s nothing to correct.

Students asked:

The article reports of three F-statistics. One on the difference of three means, and two other on 2 means comparison (Pc-Pf vs. NoPc-Pf and NoPc-Pf vs. NoPc-NoPf). Do I need to conduct the power analysis for all three f-tests (or two of them since one is insignificant)?

on the Articles to Replicate doc it says, Statistical analysis: ANOVA with t-test contrasts and/or post-hoc comparisons

I answered:

Let's take a one ANOVA (F-test) with three t-tests comparisons. You’ll need to calculate and report the effect-size for all contrasts but the power analysis for calculating the sample we’ll need should be about the main-effect of interest to the article. In this case, hoping I recall correctly, you’ll need to do the power analysis regarding personal force main-effect (and not about physical contact). Compare that to the needed sample size from the ANOVA analysis.

Students asked:

May I ask if there is a way to know how many participants were in each condition, as (to my understanding) it would be needed for effect size and power analyses?

My answer:

About your other question. It’s unfortunate that up till recently we haven’t been reporting our data and statistics carefully, so these older articles do not do a good job and much information is missing. When information is missing we do what we can giving the missing info and use estimates, but we need to carefully explain these estimates and what we did. In this case, since it’s 91 participants and we assume random assignment between three conditions, then you can calculate 31-30-30, just be very clear in your report what you did and why.

Students asked:

1. For the FSP ethical approval application form, under the “funding” section, which funding source should be checked?

2. Question 11 in the replication recipe “I know that assumptions (e.g.: about the meaning of the stimuli) in the original study will also hold in my replication because”, does that mean I have to explain the steps that have been taken to ensure the replication follows exactly the methods of the original study and justify any modifications made?

I answered:

1. I added an example I’m doing with another student under “Dropbox\Ethics applications\Forms\Examples from HKU students” filename “Gilad Feldman Status Quo Bias Study Ethical Approval Application Form.doc”. I’ll adjust the funding part, for now please write “No funding”.

2. Exactly, well put, thank you. Justify, discuss, or acknowledge as a limitation. I’ll add that to the guidelines: “a. Explain the steps that have been taken to ensure the replication follows exactly the methods of the original study and justify, discuss, or acknowledge as a limitation any modifications made.”.

Students asked:

3. I am responsible for the Arkes and Blumer's study on sunk cost effect. One sample proportion test was employed in experiment 1, is it correct for me to use the function “Proportions (dichotomous)” in the effect size determination program spreadsheet (found in the folder of “effect size resources” in dropbox)?

I answered:

3. There might be an easier option. To calculate the effect-size you can use For power calculations in GPower, use “z-test→Proportions, two independent groups” in which one of the groups is the constant you’re comparing to (1?).

Also, a student pointed out that there's a one-sample proportions test on: if you're only provided with a one-sample proportions compared to a constant (null hypotehsis). Thanks, Denise.

Students then asked again:

- In GPower, no matter using “Exact → Proportions: Inequality, two independent groups (Fisher's exact test)” or “z-tests → Proportions: Difference between two independent proportions”, what should be input as “Proportion p1”, “Proportion p2” and “Allocation ratio N2/N1”?

The reason why we are confused about this because GPower doesn't accept any one of the proportions to be “1” but actually our null hypothesis proportion is 100%.

To which I replied:

When you have questions like that, it’s best to play with GPower a bit, look up answers on the net and try it out. In this case, the solution is rather simple. If you’ll enter 1 in p2, and your comparison proportion in p1, you’ll be able to see an effect. The N1/N2 is what we would expect the ratio to be, and in this case there are no differences, so it’s 1.

For example:

One more question:

I am not sure which of the Proportions, 2 independent groups they should use to calculate sample size as there are different ones under 2 independent groups: McNemar, Fisher's exact test and unconditional.

To which I answered:

Please have a look at my previous two answers in the WIKI, a good option is actually the z-test. McNemar is for repeated, so not that one, but the Fisher test is also okay.

Students asked whether they should only conform to the text highlighted in the PDF.

My answer is that I did my best to highlight what's relevant, but I could have been wrong. If something doesn't make sense - ask. For example, after students asked me about interactions, I made it clear that I did not mean for students to reanalyze interactions. Also, if I didn't highlight something, it doesn't mean it's not relevant. Please read it all, and make a judgment of what info you need in order to complete your article analysis.

Students asked (through TA):

This interaction was significant, F (1, 244) = 70.31, p < .0001, and it emerged for all three biases: self-serving, F(1, 147) = 53.95, p < .0001, halo, F (1, 94) = 13.58, p = .0006, and fundamental attribution error, F (1, 94) = 8.07, p = .007.)

I will like to confirm whether the students need to report and analyze this interaction effect in addition to the main effect? I have also noted that in the document Articles to replicate (updated), there are no statistical analysis method provided for the paper Pronin & Kugler 2007 JESP #1 and Pronin & Kugler 2010 PNAS #1: Actor-observer bias, will you be able to add these information, so the students working on these two papers can have more guidance?

My answer:

This should have been brought to my attention directly earlier, since there was a problem with the files in the Dropbox. You and the students are correct, these were not updated, and for some reason some parts disappeared. Perhaps a technical glitch, I'm not sure.

To quickly answer your/their questions:

  1. There is no need to examine interactions in this replication. A two way goes beyond the scope of this course.
    1. I changed the highlighted section in the PDF.
    2. I updated the “articles we’ll replicate document”.
  2. I added the missing sections about the statistical analysis. Not sure how it disappeared, apologies.

Students asked:

  1. I understand that I should combine study 1 and 1a. For study 1, participants respond to questions either about (a) the self-serving bias or about (b) both the positive halo effect and the fundamental attribution error. For study 1a, participants need to answer the valuation questions about (a) the self-serving bias only. When I design the questionnaire, are the participants answering for (a) the self-serving bias needed to answer the valuation questions in study 1a as well, or all participants (whether answerinag for (a) self-serving bias, or (b) positive halo effect and fundamental attribution error) are needed to answer the valuation questions in study 1a?
  2. I have seen from the wiki Ans page that it may not be possible to get the original materials from the authors, so for the definitions of some of the bias, do I need to find the definitions myself?

I answered:

  1. Adding on top of a replication should be okay, so we can run the valuation question on all scenarios. That’s what I meant when I asked to combine the two.
  2. I contacted Emily Pronin already but still haven’t received the materials for that study. You can try and find some very brief and clear descriptions online (Wikipedia?), it is okay for you to reuse pre-existing materials for this, there’s no need for you to write this yourself and might help address issues with English and comprehension.

Kawai and students asked how to compute confidence intervals for Cohen's D and ANOVA F-statistics f-effect (F and f are NOT the same, and you need to input the f and NOT F into the GPower). :

the original study did not report the sample standard deviation, only sample mean and t-statistics are reported. Is it still possible for me to calculate the confidence interval for the study?

My answer:

It’s very easy to do confidence interval calculation in R, and I found a place where you can run this online without installing R. Goto: And go to the section: “Try the MBESS package in your browser”

There, you can use the MBESS package to report the effects and confidence intervals for either Cohen’s D or ETA square. There is a very detailed explaination here:

To jump to the bottom line, this is the code:


print("starting Cohen D analysis")

# this is used to run the Cohen d confidence interval calculation
# replace ncp with your t-statistic, n.1 and n.2 with the sample size in each condition and run.
ci.smd(ncp=2.39, n.1=100, n.2=100, conf.level=0.95)

There's no reason to be scared of this R syntax, you just need to replace the right values in the right place and run, it doesn't get much simpler than that.

Students asked:

I used the medcalc website that you sent me to calculate the effect size. However, the results only show the CI, chi-squared, DF and significant level. How can the effect size be determined?

I answered:

Chi-square is a measure of an effect-size. To convert Chi-square to a more familiar type of effect-size, like z, Cohen’s d, you can use an online calculator, like this:

Students asked:

For power calculation using GPower, there are two tests for two independent groups: Fisher's exact test and unconditional. Which one should I use? I have attached the screenshot of my attempt, I am not sure whether I put the correct numbers in the programme. Also, I am not sure what the parameter that I have circled in red stands for, i.e.: “Calc P2 from/difference P1-P2/ratio P1/P2”.

I answered:

You don’t need to use the “Determine” tab, you can simply enter the proportions and do “Calculate”. There are a few options to try in GPower, another option to try which will give you a similar estimate is this one: , and it will give you the z, which is another type of effect-size.

Another group asked:

All participants were given the same question, and asked to choose A or B. The null hypothesis was that all 61 (100%) participants should choose B. But it turned out that only 28 (45.9%) participants chose B. The other 33 (54.1%) participants chose A and fell for the sunk cost effect.

I noted, from the wiki page you posted today, that I should conduct either a binomial test or a chi-square test. But G*Power does not accept our null hypothesis of 100%. After some googling, it seems to me that a Fisher's exact test should be performed instead (please see the attached screenshot). May I ask if this is correct?

My answer:

See what I posted just now here:

It includes a link to this option: Which is the binomial z test. Both type of tests should be giving you similar estimates. Either is okay, both are even better (to keep reviewers happy).

So, this also gives you the Z value. The Z value can be converted to other types of effects, using all kinds of calculators like:

Hope that helps. Let me know if there’s anything else that isn’t clear.

What to do when you need to calculate ANOVA effect and have no variance/SD/SE. There are ways to do this with R, but Kawai contributed the following tip indirectly using online calculators which might be more accessible to students with no R background:

Daniel laken's statistical calculator: Take the converted ETA-square Partial η² and put this in the Dropbox\Effect size resources\Converting effect size filename “Coster 2012 Converting effect sizes.xls” And then you get the f which you put in the Gpower

and a second way:

Use calcualtor: And convert the d to the f

Some students asked how to determine sample size from degrees of freedom (DFs) which are reported in the statistical tests in parentheses “()”.

To understand how that works, I suggest you take a look at the very friendly resource:

Students asked:

Firstly, it is said on the article that there are 20 items asked in the questionnaire but there were only 19 items. This will affect how we calculate statistics. So we would like to ask, if we should go on with 19 items?

Secondly, we only have 1 set of results but there are supposedly 4 conditions. We need at least 2 means and SDs to do a t-test analysis. We are not sure how to proceed with this. But we assumed the effect size is 0.2 (as suggested in the guideline specific to our experiment) and we conducted a power analysis based on that.

My answers:

About the 19 items. What can I say? You’re right. That’s a good catch! This goes to show you how important these replications are, scholars make all sorts of mistakes in these articles, some of which no one notices until we try to replicate. Some just don’t provide much information, which leaves replicators clueless as to how to replicate and what those studies really mean.

In this case, there’s no other choice – in the analysis you’ll need to analyze their results using the 20 items, indicate that there was a missing item, and in the replication we’ll use the 19 items that were provided. Since I know Joshua Knobe quite well, I’ll also write him and see if I can find where the mysterious 20th item disappeared to. I’ll keep you posted.

About your #2 – yes, that’s why I detailed everything in the way that I did in the articles to replicate. Ka Wai brought this to my attention and I therefore clarified it all. Please follow power analyses for an effect of 0.2.

Students asked:

1. In the original experiment, they conducted an unrelated experiment (a computerised Stroop test) before the participants did the questionnaire (p. 103), so we are going to ignore this part? Do we need to acknowledge this in answering the replication recipe #11?

2. For the results section (p. 104), it reads “Agreement did not differ between the conditions with definition (.79) and without definition (.79), between actor perspective (.77) and observer perspective (.80), or between San Jose (.81) and Stanford (.78)”. But I did not quite understand this part, are the figures in parentheses referring to the correlations? 3. Again for the results section on p. 104, the authors wrote they had conducted 3 pairwise t tests - so pairwise t tests = independent samples t test?

To which I answered:

#1: You need to acknowledge that when discussing the differences between our replication and what they reported in the original article. We ofcourse will not be administrating a stroop test to our participants. #2: If I recall correctly, I think they’re referring to the Alpha Cronbach to check the reliability of answers in between conditions. This is to show that the participants in these conditions perceive these items as related (and, one can argue, have the same meaning). #3: Yeah, I think so. I believe it’s a fancy way of saying they ran 20 (items) x 3 (comparisons between groups of school, perspective, and definition). You should note, that we will not be administrating this to two schools, but online to one type of sample, which is another difference between the original article and the replication.

I also asked the TAs to send this to those replicating Malle and Knobe 1997:

I want to clarify the following things:

  • Originally the design
    • Looking at two factors: 1) actor-observer and 2) with/without definition. This means that there were 4 conditions.
    • They conducted alpha Cronbach for each condition.
    • The t-tests are contrasting either factor 1 (combining the 4 into only 2 conditions) or factor 2 (combining the 4 into only 2 conditions)
    • They did a series of 20 t-tests for each of the items in the scale.
    • There was a third factor, school, but in our design there is no need to examine differences between schools, since we have one sample, the HKU students.
  • The students should:
    • There is no need to conduct alpha Cronbach test for each condition. Please only run Cronbach on the overall sample.
    • There is no need to calculate correlations.
    • Instead of 20 pairs of t-test for each item in the scale, please:
      • Average all the items together
      • Conduct one t-test for the average of factor 1
      • Conduct one t-test for the average of factor 2
    • This study is confusing because the expected effect size is zero, the scholars found no differences between actor-observer and/or with/without-definition.
    • Therefore, in your effect-size calculations you should say that your hypothesis is that the effect will not be significantly different than zero (meaning, 0 is included in the confidence intervals) and/or that the effect will be lower than weak (Cohen’s d < 0.2)
  • In your power-analyses, you should calculate based on a Cohen’s d of 0.2. Because this is a weak effect requiring a very large sample that I have no resources to get, they should calculate the require sample size assuming power of 80% and not 95%.

I also added this to the “articles to replicate” document on the Dropbox.

Article analysis versus pre-registration: Which is which?

The students asked:

the article report has procedures, planned sample, exclusion criteria parts. We assumed it was an article analysis and to report on the reported sample, however, according to the sample article reports, they are written as if it was their own planning and their future replication. We're a bit confused as to whether we're writing it as analysis or our plan for research or both combined?

My answer:

#3 – Yes, you’re right. Your article analysis will later be used in the pre-registration form, and the example I put together in a haste for you was from a section of a previous pre-registration. The analysis should be of the original article, the power analysis should be an analysis of the effect in the original article regarding what sample size we’ll need for our future replication study. In short, there is a strong link between the article analysis and the pre-registration and final report. It needs to be an article analysis with the understanding that it is the basis for your pre-registration and final report.

Students wrote:

is there any hypothesis to fill in the ethical approval application? As regards to the article, it did not have any specific hypothesis. Is the hypothesis base on the result of experiment 1b or just ' The more personal force and personal contact, the less immoral acceptability.'?

To which I answered:

Yes, the hypothesis is about the main finding from the study you’re replicating. Your main aim is to replicate that finding, hence what they found is your hypothesis. Typically, they also hypothesized that before finding it, which is why they conducted the study to begin with. So, you’ll need to be very clear about what the hypothesis is and frame it in a scientific way that would show me you understand it (things like, is it a positive/negative relationship, think correlation/causality).

Students asked:

In our replication recipe, we are required to come up with an effect size to replicate. For our experiment, I think we need to integrate multiple experiments with our replication task.

Shall we replicate the largest f-value (effect size)? And also, should we report all of the confidence intervals across these experiments in the original study? Moreover, for our sample size, should we take the least number of required total sample size or should we report the largest one? (Attached is our table of calculation results. Do we need to screen capture every single calculation involved in this process?)

To which I answered:

You need to aim for a sample that would be able to detect all of your effects power 95%, which means aiming for the smallest/weakest effect size, which means the highest required sample size.

And, yes, please report the CIs for all the effects you calculated.

Students spotted an error and asked:

We received a suggestion that we could use the difference from constant test as we couldn't identify a comparison/control group.

We've actually re-interpreted the study design and we wanted to confirm whether our new understanding was right. Would we be going in the right direction by doing that?

To which I answered:

This is a tricky article. Honestly, it’s even hard for experienced researchers to understand what exactly they did in this study. Going back to re-read the article, and having a look at the main “articles to replicate” file I realized that there is probably a mistake in there. For some reason I can’t retrace, the note and the description in the word files was copied from Study 3 to Study 2, and it might not apply. My apologies, I’m not sure how this happened. I wish you brought this to my attention earlier, so we could remedy this and save time.

Your question does show that you’ve been thinking about this, and it’s great that you were able to identify that what was written in the document was the wrong analysis. Due to this oversight, if you require longer to correct this before submission, please inform your TA, and I will approve a late submission by an additional 3 days. I realize this is Chinese new year, but I’m hoping this would be enough.

There are several ways to approach this. My understanding is that this is on a 1-5 scale:

  1. A is very much wrong
  2. A is a little wrong
  3. Equal
  4. B is a little wrong
  5. B is very much wrong

Since there are 8 scenarios, you could calculate obtain an average of the 8 for each participant, and then run a one-sample t-test comparing to the mean, in this case 3. Regardless, you’ll get feedback on your submission, and if there’s some problem, you would need to correct it before the pre-registration, which is the critical stage.

Students asked:

According to 2.2 results, subjects rated the morally bad doctor to have ended the patient's life, which significantly more than the morally ambiguous doctor (t(298)=4.3, p<.001). However, the mean score of morally bad case(Do/Allow), M=4.55 is higher than morally ambiguous case (Do/Allow), M=2.99, which are not possible to reach the results in the t-test since with a smaller mean means that the result leans towards doing instead allowing. Are the datas in Table 1 switched between morally bad case and morally ambiguous case? Thank you for your help and wish you a happy Chinese New Year !!!!

Cedar added:

I have also noticed a problem with the graph. For the two bars under Ended/Allowed (on the left), the morally bad bar has is longer than the morally ambiguous bar yet on p.283 at the bottom it says that 1 : ended and 7 : allowed to end so shouldn't the morally bad bar be shorter as it should be closer to 1 since 1 represents the 'ended' end?

To which I answered:

Good question, these articles can get confusing, I agree. It really goes to show you that we usually don’t really look at articles in-depth and so don’t realize the weaknesses, but there definitely some problems with clarity.

You’re both right. One of these columns isn’t right, and there’s a misalignment with the table. That’s a good catch, well done. I know Joshua Knobe, and previously talked to Cushman, so I’ll try and get in touch with them and ask.

For now, you can assume that the reporting of the results is as described and that the descriptions, tables and/or figures are messed up. One of the many good reasons for a replication is exactly that, checking these papers.

Student asked:

for the section of “type of study”, I would like to inquire about whether we should list more than one experimental design for each experiment or not, provided that the original study use more than one kind of design for each experiment.

Also, according to the word text about which experiment to replicate, should we just write analysis based on the experiment required to be replicated in this course or should we write analysis for all experiment in the original study.

I answered:

you should only do an analysis on the experiment that has been assigned to you, NOT all the experiments in the article. You should include all the details necessary to show that you’ve analyzed the target study in-depth. The focus is the main-effect that you’re replicating, so if there are more than one analyses that are important for the main point of the article and the empirical demonstration of the phenomenon, then you’ll need to report them both. Sometimes they included additional analyses that are not directly relevant for the main point, and these can be reported briefly and atleast acknowledged in your report without a full analysis.

Student asked:

according to the original study, they have reported result from ANOVA while it is stated that two independent means t-test for different DV's has been conducted according to the file in Dropbox. In this case, I wonder what kind of experimental design should I list out in my article analysis?

My answer:

If you’ll check the latest version in the Dropbox it writes:

  • Statistical analysis: Two independent means t-test for the different DVs. Although the results are reported as an ANOVA, please report a Cohen's d.
  • Note: There is no need to conduct an interaction analysis. This goes beyond the scope of this course.

They conducted a two way ANOVA to examine the interaction. I do not expect you to do interaction analysis, that goes beyond the scope of what’s expected from this undergraduate class, but if you wish and you think you can, you may ofcourse add that. What I was asking for is that although the main effects are reported as an ANOVA F-statistic, that you calculate a Cohen’s d for a two independent samples t-test, which is the same as a one way ANOVA. There are several calculators in the Dropbox you can use for that, such as the Excel in the Dropbox for lacking info: Calculating Cohen D with lacking info.xls in Dropbox\Effect size resources\Computing effect-size.

Students wrote:

I have realized that the original author has used the degree of freedom for the error term as 24 when they reported statistic such as following: (a) when thinking about their own futures, they chose that options more often than they chose desirable options alone (M=31%)F(1,24)= 9.74, P=0.005; (b) By contrast, they did not show those pattern of choosing the conjunction more than desirable options alone when judging a peer (M=43%), F(1,24)<1, NS.

Due to these analysis, I was thinking if I have to find all cohen's d to estimate the sample size so to make sure the smallest effect size among them has enough power to be significant, then I need to use repeated ANOVA statistic to calculate their cohen's d.

Accordingly, if i replicate the study using two-independent-sample study to compare self-other difference, then I would need another two-dependent-sample study to compare within their view of their own future and within their view of their peer's future (Or I should do a three-level ANOVA and a post-hoc?).

I replied:

Students wrote:

The reason for using SD is because that I would like to double check on the effect size using g-powerthat I have calculated using the cohen's d calculator, but it seems that I could not do so since SD is not provided.

I wrote:

About double-checking, you can double check with other calculators, there are lots of links and tools provided, and it’s good you’re trying to revalidate. If you can’t find another tool ,that’s alright, it’s not necessary for the scope of this project. As long as you clearly explain what you did, with what tool, and add screen captures, it should be alright. And since we have two students on each article and another student group from another course, it should hopefully be enough to alert us to possible mistakes when there’s a misalignment.

Students asked:

I would like to ask about the (section C) materials used in experiment. As we have only had the description of materials from the method section of the original article, I wonder if we should include questionnaire like those in the example? If yes, how can we access to the original materials?

I answered:

Good question, that’s tricky, many of those articles do not provide their materials. When we don’t have the original materials, we need to try and build those ourselves based on the closest estimate and all the details that are provided. It’s not ideal, but this is what we usually do. Read all the sections carefully (including at the end of the article, and the supplementary materials, if there are any) and try to rebuild the questionnaire in the best way you can given the information you have.

With that said, with some of the articles I already contacted the authors and asked for the materials, but have not received them yet. If I do receive this in time for our replication, I’ll ofcourse forward this to you. But from my previous experience with replication, it is quite rare to hear back or receive full cooperation on such matters.

Students asked:

In ,y study, participants were given booklets with descriptions . However, I cannot find these descriptions anywhere. Do you have the descriptions of the original study? If not, can we email the authors?

I answered:

Yeah, tricky stuff, it’s hard to reconstruct old articles, since they share so little about their methods and statistics.

With many of these articles, I already contacted the authors, but have not heard back from most.

I would generally assume with most replication projects that we will not have access to the original materials. Unfortunately, it’s very hard to get original materials, procedures, and data from authors, and that’s part of the tragedy of science, that these things are not open and available together with the article. You should go ahead and try to do the best you can given that you do not have the exact materials.

Students asked:

The article says there were a total of 91 participants, but the df report of ANOVA says (2,87) since the second dF is (N - # of groups), can I assume that 1person was excluded in the final report and that each group had 30 participants each?

I answered:

Yes, you can, it’s great you’re able to use the DFs to deduce that. The most important part about the replication is for you to explain everything very clearly and openly in your report, so do add this to your report (why 90 and not 91).

Here's what I answered:

Since you seem to be tackling R, I’ll offer you a solution using a different R package, called psych. Goto , goto the “Try the psych package in your browser” section and enter the following code (based on the stats you gave me):


The result is:

         lower    effect    upper
[1,] 0.0908631 0.7045434 1.310351

Here's what I answered:

Since you seem to be tackling R, I’ll offer you a solution using a different R package, called psych. Goto , goto the “Try the psych package in your browser” section and enter the following code (based on the stats you gave me):

library(psych),n1=32, n2=72,alpha=.05)

The result is:

         lower effect     upper
[1,] -0.220938    0.2 0.6177689

Many students seem to have trouble computing this.

All can be done on using the R package psych

By entering the following to the Try the psych package in your browser box and pressing “Run” (Ctrl-Enter)


For one sample t-test with a sample size of 43 and Cohen's d of 0.7 the command is:

library(psych) = 0.7,n=43,n2=NULL,n1=NULL,alpha=.05)

The result:

          lower effect    upper
[1,] 0.07941619    0.7 1.312562

For two independent samples t-test with a sample size of 43 in each condition and Cohen's d of 0.7 the command is:

library(psych) = 0.7,n=NULL,n2=43,n1=43,alpha=.05)

The result:

         lower effect    upper
[1,] 0.2480199    0.7 1.144558

Can you see the difference? Although the effect size is the same, the two-samples has twice the sample and therefore the confidence intervals are narrower.

I wrote:

If you have the effect size, you don’t need to recalculate it using GPower (so skip the “determine” part), just enter your effect size, alpha, and run “calculate”. Something like this (assuming your calcualtions are correct, I didn’t verify those):

Students asked:

In my classic experiment, the researchers transformed the data because there were large differences in the variances across the conditions. Should I use the data after the logarithmic transformations to calculate the effect sizes and sample sizes? There was a test that was significant for raw data but insignificant for the transformed data.

I answered:

Ideally, I would like you to do both, but when raw values/stats are not available or there are too many analyses and calculations, then please focus on what’s reported - follow and do your analysis based on what’s provided (the transformed values).

Data collection in a different country from target article

Students asked:

1. From your blog, I learnt that MTurk allows us to specify which countries I would like to collect from. As my original experiment recruited undergraduate students in Hong Kong, is my replication also going to recruit Hong Kong people via MTurk?

2. If not Hong Kong people, does that mean I cannot replicate the survey questions in the original article that asked questions based on Hong Kong and I have to revise the survey questions to fit in the context of US?

I answered (for PSYC3052 running data collection on MTurk):

  1. Yes, indeed, we’ll be running this with American participants only. There aren’t enough HK people working on MTurk.
  2. Exactly. You’ll have to make adjustments to make this replication relevant for the American sample. It will be in English, and might require some cultural translations (like replacing “maan” dollars). Do the best you can to adjust that, and I’ll provide feedback if I think there’s a problem.
  3. As I wrote in the original “articles to replicate”:
    1. Note: This was originally administered in Chinese and using “maan” dollars. We will be trying to replicate this in English.

Students asked:

  1. In the original study, in order to minimize the possibility that prior knowledge would dilute the effect size, there was a pilot study to test the participants’ knowledge about the cost of a double decker (the numerical question for dependent measure). If I adjust the questions according to my own decision, will my replication not be trustworthy enough as I might not know how well American people know about certain facts?
  2. In original, some participants gave wrong responses but the authors did not mention whether the wrong responses mattered. Does it matter that I should try to design a comparative question that is familiar to American people and a numerical question that is not familiar?

I answered (for PSYC3052 running data collection on MTurk):

These are all good questions that should be elaborated and mentioned in your replication report. Now you can see that replicators need to make all kinds of decision and adjustments. Its very hard to do a good replication.

In general, I would expect that you adjust these to a context relevant to Americans. This could be the JFK runway instead of Hong Kong. The location could be asking which city it is or isn’t located at.

Regarding familiarity with cost, you should do something to address this issue, and it could be by adding another measure to test familiarity.

We won’t be able to run a pilot study. Do the best you can, try and use a question that participants cannot simply Google (try and Google this yourself to see if it’s too easy), but will be able to make an assessment for based on their intuitions. The TA and I will go over this.

What I suggest here, because these questions are VERY short, is that for the dependent variable you include both the double-decker and your other suggested DV that’s more suitable for the American context. We can then compare the two.

In any case, you’ll need to discuss and provide enough details about this in the pre-registration to explain why you did what you did and how it differs from the original study.

Great question about what to do with those who answered incorrectly. Looking at the stats it doesn’t look like they excluded those, but rather just reported this for the readers’ interpretations.

What you wrote would be a very interesting “additional analysis” that could be added to the pre-registration, to examine what happens when you exclude those.

Students asked:

1. Since data is collected through Amazon MTurk, do I have to change “Harvard students” to “University of Hong Kong students” for the questionnaire?

2. In addition to that, is it possible to find undergraduate students only as participants through Amazon MTurk, so as to follow the original sample as closely as possible?

I answered:

  1. You’re right, this was meant for the PSYC2020 course who is running groups replicating these on HKU students. Thanks for showing me this so I can correct it. I changed it the “articles to replicate” file to:
    1. “Note: Combine the two questions from 1 and 1a. Replace “Harvard student” with “University of Hong Kong student” (PSYC2020) or ” MTurk workers“ (PSYC3052).”
  2. I wish we could, but that’s going to be a bit tricky for us to do here, so the general answer is going to be a “no”. We’ll run this with the general population on MTurk, and you’ll need to acknowledge this as one of the differences between the original and the replication.

Students asked:

I recognize that we need to state our hypothesis, purpose or the variables in several documents throughout the preregistration and replication process, including in analysis report, ethic request application, preregistration etc. In this case, are we supposed to rephrase those eg. hypothesis everytime with the concern that it will be problematic if there is some softwares (eg. turnitin) detect our repetitions and claim it as a kind of plagiarism (self-plagiarism, to be specific).

My answer:

I just checked and its written in various places in the replication guidelines. For example - “you can copy-paste from other sections, but you will need to address this format, since this is the current academic template for explaining replications”

So, the answer is no, you don’t need to rephrase, and yes, it’s allowed to copy-paste. At the end, the pre-registration template, the replication recipe, the Qualtrics, it will all be one large submission, we just structured it in layers so you can get feedback on each stage as you make progress. I do not want you to spend valuable time having to rephrase things, it’s hard enough to think how to get it right one time. Copy paste in such documents is expected and needed. However, very important, you should only copy paste from your own work in this project, not from other projects you did, and not from others.

Students asked:

Very sorry to bother on a Friday night and over the weekend, but our group from PSYC2020 were wondering if it's at all necessary to tweak the template for the Consent form and Debriefing section of our survey to the experiment we are conducting?

Please note, this is different for PSYC2020 and PSYC3052, the following answer only holds for PSYC2020:

Good you asked, I realize now I should add this to the guidelines and make this clearer. For PSYC2020, I’ll combine all of your experiments to one long survey. Since you’re the participants, there is no need to modify the intro/consent and debriefing.

Some students asked about conducting an effect-size calculation and power analysis for repeated-measures within-subject design:

A few options:

  1. You can also use GPower “t-test→Means: difference between dependent means” (
  2. Daniel Lakens comes to our rescue once again with a wonderful guide and very easy Excels you can use: Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs :
    1. You don’t need all the statistics, just this: “The spreadsheet that you can use to calculate effect sizes can be downloaded from:” Also added to the Dropbox under effect-size calculations

To simplify your lives, I add the following example, using the wonderful R package powerAnalysis

If you go this website - , and goto “Try the powerAnalysis package in your browser”, you can enter code. The code you are expected to enter for paired analyses converted from the t statistic is (let's say you t is 4.64 and your dfs are 28:


ES.t.paired(t = 4.64, df = 28)

Then the results is:

     effect size (Cohen's d) of paired two-sample t test 

              d = 0.8768776
    alternative = two.sided

NOTE: The alternative hypothesis is md != 0
small effect size:  d = 0.2
medium effect size: d = 0.5
large effect size:  d = 0.8

Students asked:

For the sample size, it is recommended to have 2.5 times more than the number of participants of the original study. However, according to the result of power analysis, the sample size recommended from the smallest effect is just 84 while the original study has 50 participants. In this case, should I follow the 2.5x rules or go for whichever has the larger sample size?

I answered:

About the power analysis. This is a recommendation from Simonsohn for how to achieve higher power, as a rule of thumb, looking at detectable effects. Since we’re aiming for power of 95%, if you did your calculations right, it would be good enough. Furthermore, in practice, I aim to combine several experiments from PSYC3052 together and run all experiments with larger than needed sample size to address the largest sample size needed of all combined experiments. I’ll add this line to your pre-registrations before I submit those, you don’t need to worry about that.

Both your TA and I will go over your power calculations and provide feedback, in PSCY3052 you’ll also receive peer review of your pre-registration from a fellow classmate. So, you’ll have atleast 3 people looking at your materials to try and catch errors, and when we submit this for the pre-registration challenge the OSF people will also take a brief look to generally see it makes sense.

Students asked:

In the original study, Malle & Knobe set the # of participants in the actor condition to be 32, and the # of participants in the observer condition to be 72, should I follow this proportion? Or I can just choose to evenly present the conditions, i.e. to be half half? (This is what I have set on Qualtrics now)

I replied:

Yeah, that’s curious, I don’t know why they did that.

Our sample size will be based on your power analysis (if I remember correctly, to address the d = 0.2 weak effect), and we will randomly assign them to one of the conditions evenly balanced.

Students asked:

For the working definition of intentionality, should I add a comprehension check for it to make sure participants having read the definition in the way presented in the attached? Is this considered to be “different” from the original study procedures?

Please note that my next answer is especially relevant for PSYC3052 where we run things on MTurk with online workers:

Good question, I think such adjustments are good practices when dealing with MTurker workers. I suggest you add this and then discuss this as one of the adjustments made in order to address our target sample and differences between the original and target.

Students asked:

the questionnaire was mainly designed for students (as half of the participants were ask to think in the case of their roommates). As our participants will be recruited through M Turk, I wonder if it's possible to set requirement to target on undergraduates student only (and mention in in exclusion criteria) or I should create a questionnaire that are more applicable to general public?

I answered:

Generally, the answer is no, we will not be able to restrict the sample to undergraduates on MTurk, but I can try and ask for fairly younger people.

For example, in your Pronin & Kulger replication, you should try and change that to match the questions to the MTurk sample. They could compare themselves to other Mturkers and write about their choice of occupation, their online work on Mturk, etc. But ofcourse it would be best to stick to the original questions as much as possible. Do the best you can, we’ll give you feedback on that afterwards.

Students asked:

there are some questions asking about the location and the setting of experiment, if it is not mentioned in the original study, is there any way that I can address this problem to answer the questions?

I answered:

It’s okay to write that you don’t know, but also add that it’s probably safe to assume that if ran on undergrads it’s usually the first author’s affiliated university. In this case for Pronin and Kugler, they wrote a bit at the end, and I think it’s clear Emily Pronin did things in Princeton in her lab. Obviously, this is different than online on MTurk.

Some of those replicating Experiment 1 and 4 were wondering whether to adhere to the following sentence at the beginning:

No subject responded to more than one question.

I'll accept both, but if you ask me for what I was aiming for it was that each participant will have to answer both Experiment 1 and Experiment 4 and the order of the display of the two experiments will be counterbalanced, thereby controlling whether one had an effect over the other. To aim for the ideal, if I were to run it, I would actually randomize one of the four :

  1. Experiment 1 alone
  2. Experiment 4 alone
  3. Experiment 4 then 1
  4. Experiment 1 then 4

But that's not a must for this course. If you do, that's beyond the call of duty, and will be appreciate by myself and science :)

I asked to combine 1 and 1a, students asked me through the TA whether they should perform the same self-other analysis in 1a as they are doing in 1.

My answer is yes, please do. It is the main point of the whole article, and therefore important for the replication.

I was asked:

I would like to ask if the power analysis is for calculating the required sample size only or students also need to calculate the achieved power of the main effect found in the original article?

I answered:

No need to calculate achieved power (posthoc), only need to do the a-priori power analysis for determining the target sample size.

I keep getting simple questions on doing basic things in Qualtrics. Qualtrics is one of the most intuitive well-documented well-supported available, and you should be able to find solutions to everything you're asking by doing one of the following:

  • Look it up on Qualtrics support
  • Look at the Dropbox Qualtrics and MTurk folder.
  • Look at the examples provided (QSF files) from other experiments conducted by my previous students in the Dropbox. Closely examine their design, survey flow, etc.

Only once you've done your best to look up an answer and still can't figure it out, then, by order:

  • Contact your TA
  • Email me.

To be clear, this is about simple things on Qualtrics. If you have a specific advanced question about implementation for your survey, email me.

You should note, BTW, that Qualtrics has an amazing support, and they are very responsive. They have never failed me.

I am trying to open up the sample Qualtrics surveys on Dropbox for reference. However, I am unable to view it because it is a .qsf file.

Could you advise me on how I can download/preview it? (Do I have to link it to my Qualtrics account in order to see it?)

I answered:

I generally expect students to be able to solve these sort of issues on their own. This is such a simple question that a 1-second Google lookup can resolve.

Qualtrics has terrific guides:

Students asked through the TA:

In Royzman and Baron's study 3, it says that subjects could not respond “indifferent” more than once. May I ask how I can set the survey in a way that subjects can only choose an option once?

I didn't know the answer, so I contacted Qualtrics, who answered very fast and efficient:

Thanks for reaching out!

I've created a short video that outlines how you can ensure that respondents can only select one option one time.

If the link becomes inaccessible, I added their video to the Dropbox: Dropbox\Quatrics and MTurk\Videos help

More question from a very proactive group of students:

The first problem: There are three scenarios mentioned in Study 3, one was illustrated as the example, the other two were mentioned at the end. So can we use the example one about the forest?

The second problem: As in P.176, “Within the three questions concerning omission bias, they subjects were not allowed to respond more favorably when the outcome was worse”, that is to say, the possible ways of answering the three questions within one condition are as follows:

So our question is, should we set constraints for this in the same way as for the indifferent option, or we can simply inform the participants about it at the beginning of the survey?

My answers:

About your first question. Great, yes, I somehow missed that when I was looking whether to replicate this or not. Please do. If they had three, please do the three. The more we can do here, the better.

About your second question, my answer is that you try and implement that in the Qualtrics. Forcing participants to respond in the right way is far better than asking them to. Participants, even if our wonderful students, often fail to read and/or follow instructions.

In this case, yes, you can use the same trick as the “indifferent” video you received from Qualtrics plus another trick. When they answered “No” for the first question, you can actually skip the next two questions because they have to be no. When they answered “indifferent” in the first question and “No” in the second, you can skip the third. You can achieve both in Qualtrics using question level “Display logic”. BUT, you’ll need to note and remember that in your data analysis answering 1 in the first question means you need to set 2 and 3 to no as well, etc.

Another question:

we are not sure how to ask the questions for the third scenario, the infected children one. Because in the example, the outcomes can be less harmful, equally harmful, more harmful based on different options, this perfectly works for the second scenario, the aircraft one too. But in the infected children case, the outcome can only be better if the participants choose to act (either indirectly or directly), so there is no more harmful case. The worst situation is the omission, in which more children than the currently infected ones will die. So we cannot compose 3 questions for each condition (indirect action and direct action), but only 2.

I answered:

We should do the best we can with what we have…

The issue here in this scenario is that we don’t know how many are affected now and we don’t know how much can be affected if we do nothing, right? So, in order to make this equivalent to scenarios 1 and 2 we need to make those odds clear in the question.

For example, a vague version would be:

  1. Would you do this if this action would cut the number of expected deaths in half? (Yes No I would be completely indifferent)
  2. Would you do this if this action would lead to the same number of expected deaths? (Yes No I would be completely indifferent)
  3. Would you do this if this action would double the number of expected deaths? (Yes No I would be completely indifferent)

A clearer one, perhaps, would be

  1. If the number of affected children is half the number of children expected to be infected if not immunized, would you take the action to immunize? (Yes No I would be completely indifferent)
  2. If the number of affected children is the same as the number of children expected to be infected if not immunized, would you take the action to immunize? (Yes No I would be completely indifferent)
  3. If the number of affected children is double the number of children expected to be infected if not immunized, would you take the action to immunize? (Yes No I would be completely indifferent)

I was trying to look up for different function in qualtrics in order to find some function to randomly assigned participant to two different questionnaires.

However, I could not find any relevant way to do so till now. In this case, is it okay to create two questionnaires separately? It i could do so, how can I confirm that participants are randomly assigned different questionnaire either about themselves or their best friend to avoid systematic difference?

My answer:

Everything needs to be done within a single questionnaire. Qualtrics should be able to handle everything. Please see above, a simple Google will reveal the way.

Ka wai and students asked me how to do the randomization in Experiment 2, of showing all scenarios counterbalanced but displaying only one condition for each of the scenario.

I quickly set up an empty Qualtrics demo, with a top-level randomizer (for the scenarios) counterbalancing 3/3 of lower-level randomizers.

See here:

Email me if you have further questions on a setup like this.


shoud I enable the back button to allow the participants to go back to the previous questions


In general, as a rule of thumb, you should not, to prevent one answer being contaminated by following answers/manipulations. For the participant, what's done is done. That's one major advantage that an online survey has over pencil and paper (together with many other things, like validation, forced answer, timers, etc.).

Studens asked:

In the article, in study 1, one group of students are presented with bias 1 and then asked to answer questions from either the “self” category or “others category”. And another group of students are presented with both bias 2 and 3 and asked to answer questions from the “self” or “others” category.

For our survey, do we do the same as what was done in the article? Or do we present all 3 biases to each student? Alternatively, we discussed a third option which was to present each student with only 1 bias and get them to answer questions either from the “self” or “others” category.

I answered:

Good question to ask, glad you emailed in. I guess you’re referring to this: “They responded to questions either about the self-serving bias (n = 150) or about both the positive halo effect and the fundamental attribution error (n = 97).” I’m not sure why you asked this, but this is a very important question to ask. This is a case where you could make various decisions, but there is a good case to be made for showing all participants all the scenarios, to increase power (sample size) for each of the scenarios (and, not necessary for your project, comparing the effects in the different scenarios). Each participant will do all three scenarios, and in each scenario will be assigned to one of the two conditions – either the self or the other condition.

Given this, what I would recommend is to present all three biases to each of the students in randomized order (see similar example here:

You would need to explain this in detail as a difference between the original article and your replication. Please also insert a quick note that you asked me and this is what I recommended so that the TA and reviewers understand this decision was approved by a professor.

Students asked:

Malle & Knobe (1997)

When I was going through the original article again, I felt like their instructions for participants who were given the definition of intentionality, might not be super clear regarding what to do with the definition - the participants could have simply ignored it when doing the survey. Thus, I added a question for comprehension check. I am now wondering, in the instructions, if I should add further the phrase in bold, “Please look at the 20 statements below. Each statement describes you doing something. With reference to the above definition, your task is to rate whether you would do that intentionally.” (In the original questionnaire, it was written as follows: “Please look at the 20 statements below. Each statement describes you doing something. Your task is to rate whether you would do that intentionally.”

However, I am not sure whether this added instruction would deviate from the original study, because it seems manipulative and explicit on whether the participants are going to use their folk concept of intentionality or follow the definition (although it still might not affect the judgment). On the other hand, if I do not add this phrase (‘with reference to the above definition”), we will not be sure whether the participants have taken this definition into consideration (but in the end still opt for their folk concept, if there are no differences between the two groups - with/without definition).

I replied:

yes, that makes sense, please made that modification. We aim to replicate the findings, and as closest as possible to the original methods, but sometimes we identify that something in the original methods may have contributed to an issue. Here, your slight suggested modification makes perfect sense, and helps to clarify things for the participants without any apparent implications. The manipulation is meant so that participants will take that reference into account, and that addition helps do that.

A student asked:

Sorry I am getting confused upon discussing the template for the article analysis with my friends - so for the format of the analysis we should follow the pre-registration template by van’t Veer, instead of following the guidelines on p.3 of the “guidelines for replications” (see the attached photo)? Because I thought article analysis is mainly about describing the original article, but for van’t Veer’s pre-registration template, it is for pre-registering our replication project? However, for the exceptionality effect, the student did use the pre-registration template as the format for the article analysis (however his/her article analysis mostly described what would be done in the replication study instead of describing the results of the original article).

For me, I separated the pre-registration (following van’t Veer’s template) from the article analysis (format as in the attached photo) and replication recipe (Brandt), did I do it wrongly?

I answered:

You did okay. If you didn’t, we would comment about that. Generally, each of the submissions in this project is suppose to be built on top of previous submissions.

In terms of content, a pre-registration is basically an article analysis + ethics request + Qualtrics + data analysis plan, so content wise the main difference is adding the planned data analysis.

In terms of format, your pre-registration of your replication project should follow the pre-registration template and include explicitly include answers to the replication recipe at the end (which can include references back.

The easiest way, perhaps, would have been to follow the pre-registration template format from the beginning and making sure you address each and every bullet in the list provided in the guidelines. By you following that list, you made sure that you really answered everything, and some who simply followed the template and ignored the list missed some items. All you now need to do by moving from the article analysis to the pre-registration is adjust it to the template, make sure that it’s written as a planned execution of a replication, and add the missing sections you did not address before (especially planned data analysis).

Another different student asked:

I also want to clarify the difference between article analysis and pre-registration. Are these two reports exactly the same? As far as I am concerned, Boley said the preregistration report is only the combination of the article analysis, Qualtrics and replication recipe. That means we don't need to change any information in the article analysis.

I answered:

This is addressed in the replication guidelines and specifically in the WIKI. What you wrote is not complete. The pre-registration is your revised article analysis + planned data analysis + survey design + fuller replication recipe, match for the pre-registration template format. Please see guidelines and WIKI. You need to address all the comments by the instructor and the TA (and your peer, if it makes sense and is helpful, you’ll need to decide), and it needs to be in the right format addressing all the elements in the pre-registration template (so, needs to have things like hypotheses etc.)

Students asked:

how can we justify for not doing the 20 t-tests and the correlations?

And for the replication recipe #25: I have taken the following steps to test whether the differences listed in the previous item above will influence the outcome of my replication attempt, I wrote in #24 that the difference in age in the participant populations might influence whether I can successfully replicate the results. I’ve been looking through the examples on dropbox, but I am still not sure, what procedures I could take at this stage to test whether this difference will influence the outcome of my replication attempt, do you have any suggestions?

For the hypothesis, since we need to predict the alteration in perspective/definition will affect the ratings of intentionality, do we have to state the relationship (whether it is positively/negatively influencing the DV)? Because from the original study, there wasn’t a significant effect and thus we cannot predict the direction of relationship?

I answered:

If there were no differences found across any items, there is no reason to expect differences in the mean of those items.

About #25, that’s a good question. There are all kinds of things that can be done, but these are a bit more advanced than what I typically expect from an undergraduate class. Since we are not testing these, simply indicate that we will not be testing these. About directionality, good question again.

In your case, you actually need add two hypotheses, one for a positive and one for a negative. We will need to show that both these hypotheses are not supported, and that d < 0.2 and d > -0.2.

A student asked:

I am working on Cushman et al. (2008)’s Moral Appraisals Affect Doing/Allowing Judgments which we will have to conduct 2 independent samples t-tests, but I am still confused about the IVs and DVs. Since the original authors conducted t-tests to investigate the effect of moral appraisal on 3 items (i.e. doing/allowing, how morally wrong and attitude towards euthanasia), and correlation tests between doing/allowing and attitude towards euthanasia for both conditions. Could you kindly enlighten me:

1. Is participants' attitude towards euthanasia an IV or DV given that the main-effect of interest is ratings of doing/allowing?

2. If so, then should the two t-tests be conducted on (1) ratings of doing/allowing for both conditions and (2) ratings of morally wrong for both conditions?

I responded:

My initial reaction was to respond that this is an issue you need to tackle on your own. After giving this some thought, I decided to clarify a few basic things, because these are important. However, I do expect that you will be able to tackle such issues on your own in an advanced course (PSYC3052).

Generally, this is a good exercise for you. This is part of the challenge of figuring things out in articles we read. It is a good demonstration of how complicated these articles are, articles that we typically read for class and think we understand, but when you try to analyze those in detail you realize that things are not clear and that even the definition of IV and DV are suddenly not obvious.

What’s missing for you here is how to identify an IV and a DV. The questions you’re asking show that you’re not clear on that in experimental settings. IV in an experiment like the one you’re doing is a predicting factor that is manipulated between conditions. DV is the factor that is predicted and that typically does not vary in the measure across the conditions. The analyses are to help assess whether changes in an IV predicts (/affects) responses on a DV.

Here, the manipulation is of moral valence, bad versus ambiguous, and that is your IV. The measured predicted factors are the 3 items, and these are the DVs. Now, some of these are meant to examine whether the manipulation worked, and those are called “manipulation checks”. So if the IV manipulated valence, and the DV is for valence, then the purpose of that DV is to examine that the IV manipulation indeed manipulated what we expected it to manipulate.

Now, I’m not sure what you mean by two, and why two and not one or three. If you’re referring to what’s in the “articles to replicate” document, then first – you need to doubt everything I write in there and make up your own mind, and second – I wrote two independent samples t-test meaning two samples not two t-tests. Essentially, you’ll need to run a t-test on each of the DVs of the differences between the two independent samples (bad and ambiguous).

Adjusting analyses from article to replication data analysis

A student asked:

I met a problem with the article analysis: in the Royzman & Baron (2002) paper, the original study used 8 scenarios, therefore the degree of freedom was 7, but we will use 2 of them only, so we need a different t-value for it. I was confused about how to calculate the t-value we need for the article analysis. I asked Boley before but the problem was not solved. I am sorry that I didn't ask you in the last week's class because I was sick. I should be more responsible for the project afterwards.

Also, I want to explain the problem with the ethics form. I was the one who put PSYC2020 in the data collection but it was not the copy and paste problem. As PSYC2020 is also social psychology course, I misunderstood that they were indeed the participant pool. I am sorry that I didn't confirm this.

I replied:

First, Royzman and Baron provided us with the full materials. It’s not a must, since I did indicate 2 scenarios based on what I thought was in the article (there are actually 3 in there, I later realized), but if you want you can see whether you want to incorporate all 8. You’ll need to see how long those are, it might be a bit too long for MTurk.

In case you proceed with the 2 (which is what is expected), then yes, analysis needs to match what is analyzed. To be clear, the article analysis is about the article itself and what’s reported, you don’t need to adjust anything. For the pre-registration and your own replication, you’ll need to adjust the data analysis to whatever it is you’ll be running. Therefore, for the article analysis you’ll need to report the 8 that are in the original article, but for your own pre-registration and data analysis plan you’ll need to adjust the DFs to your analyses.

Student asked:

I am a student in PYSC3052, and am responsible for the article Pronin & Kugler 2007. I have a question about the analysis plan in pre-registration.

As I understand, Study 1a has a 2×2 experimental design and the authors conducted two separate ANOVAs, comparing introspective and behavioural information, one for the self condition and one for the others condition (indicated by blue arrows)

I understand that I should conduct two independent means t-tests for analysis. Should I conduct the t-tests comparing introspective vs behavioural information for each of the two conditions: one for self, one for others (which is similar to what the original authors did); or should I conduct the t-tests comparing self vs others for each of the two conditions: one for introspective information and one for behavioural information (as indicated by red arrows); or both?

I responded:

This is an important question.

When the original article reported an interaction between two variables, it means that we’re interested in comparing the conditions in either one of the two IVs. Therefore, I would urge you to pre-register and finally analyze both of these comparisons.

It’s not a must for you in the analysis, and I have no expectations for you to analyze the effect size and power for an interaction, but it would be great if you could also indicate in your pre-registration to examine the interaction (two-way ANOVA). Perhaps at a later stage I or someone else would be interested in following up on that and checking this interaction, and it’s good to have that preregistered.

So, for example, if you have two IVs:

  1. Self versus other
  2. Introspective information versus behavioral information

Then please do the following

  1. (Extra: Compare self versus other for average (regardless of introspective-behavior conditions). This is the main effect.)
  2. Compare self versus other for introspective.
  3. Compare self versus other for behavioral information.
  4. (Extra: Compare introspective versus behavioral for average (regardless of self-other conditions). This is the main effect.)
  5. Compare introspective versus behavioral for self.
  6. Compare introspective versus behavioral for other.
  7. (Extra: Pre-register examining the interaction)

Extra means: it would be ideal to analyze and report those as well, but these go beyond the call of duty for an undergrad class.

Differences in sample between article and replication, testing and/or excluding?

Students asked:

“I have taken the following steps to test whether the differences listed in the previous item above will influence the outcome of my replication attempt.”

The difference that i have mentioned is the demographic characteristics of the sample, the location of the survey conducted, the remuneration and the sample size (power). For these relatively general differences, is there any way that I can test these differences or they are already justified as they are general?

Second, for the exclusion criteria of the participants, although I have asked a similar questions before, I would like to ask if it is better for me to exclude participants from the age group different from the original study to ensure the replication is as close as the original. I am kind of confused about whether we should conduct replication to support the thesis as a whole that targeted at a wider population or we should conduct it in a way that is just similar to the original study let say the limited age range, occupation(student), the country (US) such and such.

I replied (to a PSYC3052 student running this on MTurk):

So, about #25, that’s a good question. There are all kinds of things that can be done, but these are a bit more advanced than what I typically expect from an undergraduate class. Since we are not testing these, simply indicate that we will not be testing these. But, since you’re asking about differences between ages below, one thing for example that you can do is split the sample into young and old (you need to state how you split those in advance), and then compare the two groups on responses to the DV to test if there are any differences.

About the target sample and the exclusion. We, unfortunately, need to adjust our target sample to MTurk. Our assumption here is that the undergraduates in the original sample are representative of the wider population. Although there are differences, exclusion here is not the answer because we already know the samples are different, and this is the sample we have. One thing you can do, which I mention above, is to test differences between ages. You can pre-register this as an additional analysis. This would go beyond my expectations of you from this course, but would show good studentship and science.

A student had a question regarding a 2×2 mixed-design (in Pronin & Kugler, 2010) combining an IV that was manipulated in a between-subject design (self versus other) and one IV in a within-design design (past versus future).

I answered:

I think there’s a bit of confusion here.

  • First, in the original article – was there an IV that was analyzed as a within-subject variable? (time, past versus future). If there was, then it needs to be specified.
  • Second, there are a few types of ANOVA, and ANOVA can be on either a between-subject variable or a within-subject variables (called “repeated measures”), or both (then it’s a two-way). Which one was this?
  • Third, you labelled the DFs as between, within, etc. and those might not be the right labels here in the ANOVA analysis. If you don’t know the right label, that’s also acceptable, you can simply write it as they did in their article (F(1, 48)).

May I ask what remuneration will be offered to the respondents on the MTurk? Thank you.


Can’t say at the moment, I might combine a few of these experiments and run them together. You should indicate a “typical” payment in the pre-registration.

Maintaining original article's participant ratio in the replication

Students asked:

I am responsible for the Arkes and Blumer's study on sunk cost effect. For the replication study, I was asked to combine the experiment 1 and 4 together, as participants only need to respond to one question only. The target sample size of experiment 1 and experiment 4 are different, i.e.: target sample size for experiment 1: 24, target sample size for experiment 4: 426 (213 per group).

My question is: is it necessary to follow the sampling ratio (24:213:213) when designing the Qualtrics survey?

I replied:

The way I’d rather run both parts on all participants, aiming for the largest sample size among the two parts without taking into consideration, without taking into consideration the sampling ratio. No need to follow sampling ratio. Just display part 1 to all participants and in part2 randomize participation between the conditions (evenly presented).

I responded to students:

Do we need a separate section for addressing replication recipe items 17-29 in the pre-registration report?

[GF>] Yes, please, the replicate recipe needs to be separate from the pre-registration template. You can copy-paste things from the pre-registration to the replication recipe, or include references from the replication receipt to the pre-registration (include headings or page as references).

Do we need to include Reported Statistics and Reported sample (From our article analysis) in our pre-registration report?

[GF>] Yes, please. Everything in the article analysis that does not fit anything in the pre-registration template should be added as a supplementary.

May I ask if we need to register our project in OSF, please?

[GF>] I will be pre-registrering this, BUT you need to add vital information to your pre-registration for me to do that.

This is the relevant section from the replication guidelines: Pre-registration package

  1. Write the pre-registration package using templates and examples
    1. Each student creates a profile on the Open Science Framework (
    2. Each student creates a Researchgate ( profile.
    3. Add the student profile links from both OSF and Researchgate to your pre-registration submission.
    4. Create an edit share link to the Google Doc, add the link to the top of the Google Doc, export the Google Doc as a WORD document and submit that to the Moodle.
    5. [ASP only] Provide peer review to your assigned submission.
    6. [ASP only] Respond and revise your submission based on the peer review received
    7. Submit your [ASP: updated] pre-registration [ASP: and response to peer review] to your instructor
  2. Pre-registration [INSTRUCTOR only]:
    1. Instructor will pre-register the experiments on the Open Science Framework website.
  3. Data collection [INSTRUCTOR only]:
    1. FSP: Instructor will combine all the experiments from all groups, with random display order, and prepare a link to be distributed to the students.
    2. ASP: Instructor will collect the data using Amazon Mechanical Turk running on TurkPrime.
  4. [FSP only] Students take part in the class survey, answer all questions seriously and honestly.
  5. Each project will receive the data of the experiment

Would you explain the item B1b in the pre-registration template about the relationship between the IVs and all their levels please.

I responded:

It writes: “Independent variables with all their levels / the relationship between them (e.g., orthogonal, nested). ”

it asks whether the conditions are related to one another in some way (and therefore, dependent) or completely separate (and therefore, independent). In all of our designs in this course, the conditions are independent (what here is referred to as “orthogonal”).

May I ask one more question, please? In the pre-registration report, do we need to hand in item1-item16? For item1-16, I mean items in the replication recipe.

I answered:

You’ve addressed 1-16 in the article analysis, and the pre-registration builds on and should contain all the information from the article analysis. So, yes, items 1-16 and 17-29 should all appear in a replication recipe appendix.

I would like to ask about the part 1 of the replication recipe item of the pre-registration. I would like to ask whether the meaning of “instruction” is the instruction given by you to our replication study or he instruction given by us to our participant population?

The item I would like to ask is “The similarities/differences in the instructions” in the replication recipe part. Moreover, I have further question on item “The similarities/differences in the measures”, is the item asking for the statistic part of our study?

I answered:

First, in your Dropbox is the "Replication Recipe" article by Brandt et al. (2014, JESP), if you wish to know more about what the replication recipe is about and what it's for.

The replication recipe is about the replication you're planning and how it relates to the original article you're trying to replicate. These items about differences are asking regarding the differences between the original article and your planned replication. The items you're referring to are under the section “Documenting Differences between the Original and Replication Study”.

When it comes to instructions it refers to the instructions you are going to give your participants in your Qualtrics replication (PSYC2020: HKU students; PSYC3052: Amazon MTurk workers), and whether - as far as you know - those instructions are different from the instructions the authors in the original article gave their participants.

Measures is regarding the way that the Independent Variable (IV) was manipulated or the way that the Dependent Variable (DV) was measured. Are the questions the same? are the answer types the same? the scale? etc.

I included comprehension questions in my questionnaire, which original article did not include. Supposedly, comprehension questions are questions checking whether participants fully understand the described scenario. I am wondering would the participants see those comprehension questions as “leading question” and affect their response. Also, my peer suggested that I should consider the effect caused by the comprehension questions and suggest me to recruit some participants via MTurk to complete the same questionnaire without comprehension questions to figure out whether the comprehension do make a difference in participants' response. Would this suggestion be feasible?

I answered:

Yeah, good questions. I get these questions from my reviewers as well.

My answers are typically to explain that if the comprehension questions are “leading” then the scenario is “leading”. If the participants do not notice and do not pay attention to these crucial details in the scenarios, then what’s the point of the manipulations? So, definitely, the comprehension questions not only test understanding but also focus attention on crucial details in the scenario, and I think that’s absolutely vital for experimental design. Some reviewers were not convinced by my arguments and forced me to rerun experiments without comprehension questions, a bit like your peer reviewer suggested here. In all of my tests so far on MTurk the results were very similar with compared to without. This suggests that Amazon MTurk pay very close attention to details, which is impressive. It might also suggest that for a direct replication we can maybe not use the comprehension checks.

If you ask me for my personal preference, I prefer including those, but I’ll accept either way, ofcourse. I have no expectations of you going beyond the original design and adding things, and I see no reason to over complicate your projects here if you do not add comprehension checks.

First, we would like to confirm in which order should we arrange all the materials in the pre-registration Google Doc. Shall it be in the order of 1. revised fixed article analysis –> 2. pre-registration template –> 3. replication recipe #1 - #13, #15 - #29 –> 4. revised fixed Qualtrics WORD export with all options checked?

For the “B. Method – description of essential elements – planned sample” part in the pre-registration template, we are not sure about where and how (via which platform) will the data be collected. Most of the examples on pre-registration stated that MTurk will be used to recruit participants. Are we going to use the same one? Could you please explain what is the relationship between Qualtics and MTurk?

I replied:

Still confusing, eh? 😊 Thanks for letting me know. Glad you asked.

The main deliverable is the pre-registration. The pre-registration includes/contains the article analysis. Move everything from the article analysis to the pre-registration template (if you haven’t already), if there’s somehow something in the article analysis that doesn’t fit anywhere in the pre-registration template, add that as an appendix. After that add an appendix with the replication recipe, then finally an appendix of the revised Qualtrics WORD export with all tagged.

Qualtrics is the survey software, you’re obviously using that. MTurk is a labour market platform that researchers use to find participants. The participants I recruit on MTurk answer my Qualtrics surveys. However, we are not using MTurk in PSYC2020. In a different advanced course, PSYC3052, we are, so some of my guidelines/WIKI/etc. include reference to MTurk. Again, to be clear, no MTurk in PSYC2020.

Quite a few students asked about the Recommended elements from the Pre-registraton template

I answered:

There’s no need to address the recommended elements, some of these items are more advanced than what I would expect from an undergrad class. However, I would appreciate the students trying to answer as much as they can given their knowledge/skills in statistics and experimental design. I will award bonus points to effort that I believe went above and beyond undergrad course level and showed initiative.

One group asked me what they should put for #29. Since you will be registering on their behalf, I wonder what they should put for that.

I answered:

Good point. I’ll take care of that item.

Things that don’t seem relevant or they’re not sure of, they can clearly indicate that in their submissions. Because I’ll be going over everything in detail, I’ll make adjustments I feel are necessary and fill in gaps.

Kawai has been challenging me to clearly explain the best way to address an ANOVA design with 3 conditions. There have also been some issues with some of the code I added above here to address confidence intervals for an ANOVA.

The ideal would be the following:

  1. For Confidence Intervals (CIs) of the overall effect: Please convert the f statistic (not F ANOVA but Cohen's f from GPower) to Cohen's d statistic (lots of converters online and in the Dropbox) and calculate the Cohen's d confidence intervals using the R code examples above in this WIKI.
  2. Calculate Cohen's d effects for all contrasts between the 3 conditions 1-2 1-3 2-3.
  3. For your power analyses: If there are 3 conditions, there are typically two conditions that represent the best answer to the main effect bias in that article. Please conduct a power analysis on that contrast. If you're not sure which contrasts is the most important, or you think more than one are important, then report both or all. Typically, in these articles, the contrasts that are “significant” in p < .05 tend to be the more important ones.

Should I input the sample size (n) as 91 or 79 (df+1), because when calculating the cohen's d

I answered:

If you see a difference between the reported sample size and the sample size that you identified based on DFs, make a clear note of that difference, and calculate based on the sample size from the DFs. Just be sure that you understood that correctly. If you're not sure, you can report both, sometimes even experienced researchers find it hard to understand these articles.

I am writing to ask about the Appendix part in pre-registration. Since our results conclude both coven'd and power analysis, should we separate the two kinds of results or put them together in one appendix (i.e. two appendixes or one appendix)?

Besides, in the example of “pre-registration of mere-ownership effect”, there is appendix 2 - data analyses protocol for experiment 1, do we need to include this appendix as well? But we don't quite understand what this experiment 1 for. Is it from the original study or your own study?

I answered:

The templates do not go into things with that level of details. All I can say, is that I want it to be very simple to find things in the document, and I want it to have structure that makes sense. Whatever you can do to add structure and clarity would be very much appreciated by myself and the academic community.

Some of you running ANOVA were asked to convert Cohen's f (used in GPower) or the etq2 to Cohen's d and report the CIs around that d.

Let's take an example: If you calculated a Cohen’s d of 0.5 and the overall sample size is 100 then in (MBESS), you enter the following:

library(MBESS), N=105, conf.level=0.95)

Which would give you:

[1] "The 0.95 confidence limits for the standardized mean are given as:"
[1] 0.2959711

[1] 0.5

[1] 0.7018838

For Pronin & Kulger (2010), the main effect of time should be a paired t-test. I can calculate the ESCI of it with the replication study data due to the sufficient information of Mean and SD. However, in the original study, the statistics was reported in F, so even that I have found calculator about paired sample t test ESCI calculator online, i could not find one which is based on F statistics and requires no SD. In this case, I would like to ask for you advice on this issue

I answered:

About the paired. I answered something similar here:

But if I understand correctly, I think you mean that you don’t have the t-statistic, only the F. Generally, there isn’t much difference between the two effect calculations (paired and non-paired), the main difference being that you control for the correlation between the two measures when paired. Since the correlations are not reported, and you do not have the raw data, there’s not much that you can do here. Therefore, you just need to explain that since those are missing, you did the best available estimate, which is the same as the non-paired (independent) samples.

Students from PSYC2020 asked:

Would you mind to provide more details of the experiment participation? Are we going to complete the questionnaire during PSYC 2020 class? Will everyone complete the Qualtrics questionnaire in class instead completing it alone? Also, will all the questionnaires from different groups be packed into one questionnaire?

I answered:

Yeah, I only talked about it briefly and some of it is written in the replication guidelines, but good you asked.

I’ll combine all the experiments into one survey. The order of the experiments will be randomized. After each experiments, students will be asked if they are the ones who designed this experiment. In an anonymized survey, this is the only way for us to control who did what. Students will be sent a link to the survey, which I estimate will take 1 hour, and will be asked to do it all in one session by a certain deadline. I will then prepare the datasets for each groups from the overall dataset.

From PSYC3052 running on MTurk:

For exclusion criteria, as the survey template will be used, should I list more criterias other than those in the sample from your former student in order to be as informative as possible for the later clarity in the process of excluding participant.

   eg. missing, erroneous, or overly consistent responses (example from template), participant who does not complete survey.

Also, as template will be used, for the rating of self-report English level, should I change the criteria from (self-report<3 to self-report <5) to fit the original scale as I have transformed the original scale.

I answered:

Generally, there’s no need to overdo these. Even what my former students did is not really needed with these MTurk workers, they’re very proficient in English and very serious. I generally recommend minimum exclusions, if any, when doing this on MTurk, there’s just no real need for that. However, you’re free to add and change to whatever you think is reasonable or more suitable for a scale. Just keep in mind, whatever you pre-register you’ll need to analyze later on.

Which procedure - original article or mine?

I was concerning another issue about information that have to be mentioned in the pre-registration. While I was writing the procedure part, I was thinking whether I should describe the procedure in the original article or the procedure that I will be doing. For example, in the original article the author didn't mention about reading the paragraph in Qualtrics. However, we will require our participants to undergo in Qualtrics.

I answered:

You’ll need to do both. It should be very clear what the original article did, what you’re doing, and what the differences – if any - are. You’ll also need to explain any such differences.

My last confusion is about the analysis as suggested in the pre-registration template. In this section, the template suggested that we should describe the analyses that will test each main prediction from the hypotheses section. Based on this instruction, I was thinking whether it;d identical to the design part that will be mentioned in the previous section. In other words, do we need to direct copy paste all the information from the design section to the “analysis plan” section?

I answered:

Design and planned data analysis are not the same. The first relates to design and procedure, the latter to the statistical analyses on the data you’ll receive. You’ll need to specify with as much detail as possible what tests you’ll run, in what way, and any other relevant information (the assumptions you’ll test, possible exclusions, if relevant, etc.). The best would be to provide R code in the exact commands you’ll run in your analysis, but for our course at this stage you need to include as much detail as possible about your plans for analysis.

For manipulation check, I wonder if it is compulsory to be included in our pre-registration? I have attempted to search for guidance through the internet, but not much information provided online. Thus, I would like to receive your guidance for this section.

I answered:

If you have a manipulation check in your design/Qualtrics, it needs to be pre-registered and you’ll need to explain why it was included and what you’re expecting it will show.

According to my understanding, the simple effect is usually found through post-hoc t-test.

However, for the 2×2 mixed-design ANOVA , the simple effect is reported in F-statistic. In this case, can I state that the original authors performed an One-way ANOVA for these simple effects?

If the original author has use dOne-way ANOVA for these simple effects, are these tests counted as post-hoc test?

I answered:

The best would be to pre-register both what the article originally did and your own simplified analysis. For your own report, you would only need to do the simplified one, but perhaps one day later for a publication either you, me, or someone else, would like to follow up with the more advanced analysis, so it will be good to have that pre-registered. It will also show readers/reviewers that you’ve done a good job understanding the paper in advance.

Moreoever, statistically, there’s not much difference between a one-way ANOVA with only two conditions and a t-test comparing two conditions. You could run that test and see for yourself when you get the data.

In the Guidelines for replication (v3) p.9 (Process → 3. pre-registration package → h. [response to peer review])

You mentioned about the provision of “response to peer review”. What are students required to do? In your latest announcement on moodle, it is said that “Respond and revise your submission based on the peer review received”, and I understand it as incorporating the changes into the final pre-registration report, without an extra task for students to write explicitly how they respond to the peer review.

My answer:

No need for the students to respond to their peer review per item, but I do want to see students implement all the improvements TA and I suggested, and the peer, if it makes sense (students need to make a decision what is valuable and what isn’t).

I would like to ask about the format of the movie analysis. As it was not explicitly mentioned in the course syllabus, does the same font size (font 12) and spacing (double spacing) apply to the movie analysis as well?

My answer:

Exactly the same, yes, double spaced, 12 font.

What exactly is meant by detailed descriptions of social psychology principles. Does this refer to the principles such as cooperation, manipulation or morality that have been mentioned in the lectures? Also since I was planning to do my analysis on George Orwell's 1984 and had read the book a while back, would it be alright if one of the scenes referred to is from the book rather than the movie?

My answer:

A principle is either a phenomenon (an experiment we discussed) or a theory in one of the topics we discussed in class or is in the course book.

Since the person checking your analysis may not have read the book, please use scenes from the movie, and indicate when the scene is playing in the movie so that the person checking your submission could go have a look if needed. There are plenty of scenes in 1984 that are related to our course topics.

I would like to ask may I have a separate third page containing all my citations?

My answer:

Yes, that's fine. Citations aren't counted in the limits.

you stated that the description of one scene should not be more than one page, could i exceed the limit of 2 pages to include an introduction and conclusion?

My answer:

There is no need to do an introduction and conclusion. Just stick to the format, two principles, scene and principle description, and then explain the link between the two. No need to overthink it.

Some students asked me about whether it's okay to do motivation/goals or conformity/obedience, even though we did not go through these in detail in class (motivation/goals was cancelled, conformity/obedience was not discussed).

My answer: It is advised that you stick to topics that we did discuss, because this way there's a better chance of you addressing the things that are important. However, if it was in the syllabus and is covered by the course suggested readings (course book) then it is acceptable. Both conformity/obedience and motivation/goals are okay, if you can show a direct link to the things covered in the course book.

Student asked:

Does bad ethics/morality count as a principle that I can explain for the movie analysis?

I answered:

It was one of the topics we covered in class, but it’s a very broad topic, not a specific principle. You’ll need to go much deeper than that.

In general, please note:

I am unable to help with every little question on every analysis in every project. You need to do as much as you can by yourself, make decisions, run analyses. Do the best that you can. If you have a major question, you can email me. TAs are available to help you with minor things.

Kawai indicated that the CSV files do not open well in Excel for Chinese-based Office installations.

The explanation and fix for this can be found here:

I had just ran the data analysis using SPSS today.

Is this ok? Or should I use jamovi to do it again?

I replied:

Please re-read my last email carefully about which software to use and how it should be submitted.

Generally, SPSS output doesn’t help make sense of this. Your analysis needs to closely follow your pre-registration and only that, should be documented well, and easy to understand for someone not familiar with your research.

Also, general tests (which is what SPSS runs) with p-values are of very limited value when we have a low sample size. The most important that you get easily with JAMOVI/R is Cohen d effects and their confidence intervals, which you can compare to the original effects. SPSS is horrible with those. Sticking to SPSS you’re creating a mess for yourself to calculate these things and put them in tables, where in JAMOVI it all comes out of the box.

And… don’t forget to exclude your own group before you run your analyses.

In an email I got it wrote:

All the participants met the exclusion criteria and don't know if should just ignore the exclusion criteria or else wouldn't have enough participants.

My reply:

Terrific. Then that means there’s a problem in there somewhere, either with the article, the exclusion criteria, the survey, or otherwise. Exclusion criteria is not meant to exclude all participants.

Definitely need to run it on the full sample, and then revisit their exclusion criteria to see what makes sense and what doesn’t. This doesn’t. Part of the process is figuring out what’s going on and what it means. You don’t blindly follow the possibly flawed article or pre-registration to simply jump off the roof.

Yeah, I changed all kinds of things, so you'll need to address that.


We compare the final version of the Qualtrcis and our old version and find one problem. We added a timer to record all the time one participant use to do our experiment as we want to exclude those who took too long or too short to finish. However, in the excel of data in the dropbox, we did not see any data related to the time recorded in our experiment. I wonder if it is because I set the timer in a wrong way (so it did not work) or you delete it on purpose?

I replied:

Yeah, I combined experiments to one big experiment, so there’s no duration per experiment. Since I allowed taking breaks in between seeing the overall time for a survey doesn’t help much. But, some of the datafiles have the overall duration. There’s no need to analyze that. I would generally like to assume that since this was voluntary and anonymous the students who did participate did so seriously (and you have a measure for that).

If you pre-registered that, you simply indicate that the course instructor made design changes that removed the timer allowing you to run this analysis, and add the explanation above, if you’d like.

Another question:

since you have made some changes to the qualtrics survey, does it means we need to work on our final report based on your version, explaining why all these changes were made? Like the structure of the scenario changed, some questions were added (e.g. question on improvement and guessing the purpose) and some questions were deleted (e.g. comprehension check question).

I replied:

Yes, I made changes, and you will do address those, even if they deviated from your own plan. The changes you mentioned about comprehension checks and funneling are easy to address, and don’t need to be discussed in detail. But changes in the scenario are important. I can’t recall making such changes, if you noticed those and not sure why they were made, please write me back and ask.

Students wrote:

Since in our sample, the sample sizes are not equal between groups, will the assumption of homogeneity of equal variance violated?

According to research online, I only know that there might be a problem in ANOVA calculation with unequal sample sizes. Should we just disregard it and analyse the data as stated in our data analysis?

I replied:

Assumptions can be tested statistically. If you have a concern, test for it. And, there are tests (Welch t-test, or non-parametric test, for example) that have lower reliance or sensitivity to such problems, if you do detect those.

Also, what’s not equal between groups? I can’t imagine the differences could be very big, the Qualtrics randomizer usually does a fairly good job. Recheck that.

Things I want to clarify about the data collection for this article:

I have a feeling that participants messed up the answers to your Wong and Kwong, in that some provided answers in 10000s and some very obviously did not. I’m not sure how to best deal with this, but the key point in this experiment is that the expected value is somewhere between 7.3 and 7300 and so the anchor shifts the evaluation either up or down. Therefore, if the analyses do not show anything, I suggest you also calculate a new variable where evaluations over 10,000 will be divided by 10,000, based on the assumptions that the participants misunderstood or did not pay attention. You can check if the results make more sense then.

Regardless of the findings, this would be something to discuss when you present the findings. I see this as a horrible weakness in the Wong & Kwong experimental design. Although I cannot read Chinese so I don’t know what they ran, I seriously doubt their students understood these well. Either way, you should try and report both analyses, to try and address their weak design.

Students asked:

We had three positive (dependability, objectiveness and consideration for others) and three negative characteristics (snobbery, selfishness and deceptiveness) that we tested. While running the t-tests, the group was unsure about whether to calculate the results for the the all the positive/negative characteristics together, or a separate test for every characteristic.

Moreover, since the survey was made to collect data on a 9 point Likert scale, since the positive characteristics have higher values and the negative ones have lower values, an average of higher (for the positive) and lower (for the negative) than 5 was taken when inputting the hypothesis. I have included a screenshot of the concerned problem form Jamovi, could you please advise on whether this has been done accurately?

I responded:

This depends on what you pre-registered in your pre-registration and what the article reported. First, you need to follow the data analysis plan in your pre-registration, and attempt an exact replication of the target article’s analyses, and see whether your results are consistent with theirs. Then, you can add additional analyses.

Regardless of what the article did… To me, combining the three domains together into one measure for positive and one measure for negative sounds reasonable. Then, probably, these need to be compared to the midpoint (either higher or lower, depending on positive negative).

It is up to you to check whether things have been done accurately. I’ll only be able to provide feedback on that when I go over your reports. You can consult the TAs if you feel like you need more guidance.

I would like to know how to conduct the overall t test result, such as overall effect of

1) Cognitive bias: self vs hku

2) Personal limitation: self vs hku

Because we do not find a way to do it JAMOVI. Thanks.

My reply:

Yep, that’s part of the challenge. You’ll need to figure it out, JAMOVI or R are fine.

You can meet with the TA to ask some questions, but a question as general as what you’re asking here is not helpful. T-test is perhaps the most basic stats, JAMOVI is perhaps the most friendly stats software, and there are lots of guides and videos out there. There are also other classmates I know are doing well with this. You’re resourceful, I’m sure you can do it.

We have started to do the data analysis for our replication study. When we reported the results, we used a similar way as was done in the original study.

For one of our t-tests, we reported the results as such:

Group 1: Participants (HKU students) rating of their own susceptibility to Self Serving Bias (self condition) Group 2: Participants (HKU students) rating of the average HKU students susceptibility to the Self Serving Bias (others condition)

Participants did not show a significant bias blind spot for the self serving bias. They did not report being significantly less susceptible to the bias than their collegiate peers MS=( 6.09 vs. 6.78), t(44.0)= -1.77, Cd= -0.523, p(0.083) >0.05.

We wanted to know whether we also had to add further discussion to the above when reporting our results?

I replied:

You need to be very very careful with how you describe your results, and in that small section there are already some phrases that are not careful enough, and do not reflect the way we try to summarize replications.

I’ll explain briefly.

First, the main emphasis in a replication is effect-size and confidence intervals. You would need to compare the effect-size you calculated and the effect-size here. P-values mean very little when we have such a small sample, and 0.05, if you did the two articles about the replication crisis, is not a definite threshold for anything anymore. Even if it were, we do not write “they did not report being significantly less” but rather something like that the two independent samples t-test analysis comparing X to Y failed to find support for rejecting the null-hypothesis under the 0.05 criteria, with an effect size X.XX, CI [X.XX, X.XX]).

Importantly, even if you try and replicate what those older articles did, you need to go far beyond that in transparency so that those analyzing your article in the future wouldn’t fact the same problems you did in your article analysis. For each condition, report clearly N, M, SD, which condition is which,

Then, you should carefully stick to the APA style guide for reporting results.

Beyond that, I want you to add as much information as possible about everything. You should aim to be as exhaustive as possible. Reporting everything about the process and results so that there would be no question and no missing information. When in doubt – report. Not sure if it’s important enough – report.

Another similar question:

After running ANOVA, we found the results of two of our scenarios to be insignificant. Do we need to run further calculations such as the post hoc tests, other calculations: effect size , confidence intervals, power analysis, f value and cohen's d? According to the article, it is recommended not to run any further statistical tests upon insignificant ANOVA results.

I replied:

Your understanding of NHST and that ‘article’ is not complete, and it’s not your fault. We are so brainwashed in our current training to think about ‘pvalues’ and ‘significant’ that we don’t stop to think what those mean. For you to only share about your findings that something is ‘significant’ or not is almost meaningless. Our sample was so small, that our ability to detect effects is severely hampered.

In a way, it is FAR more important for you to calculate effect size + confidence intervals and report your findings in great detail together with meaningful plots than whether some test you ran based on some (problematic, to say the least) p-value criteria of .05 has been reached. It is also important for you to figure out and discuss in detail what the results mean, and what conclusions you are able to draw from those.

Comparing the target article effect to our findings' effects

Students asked:

32. The replication effect size [is/is not] significantly different from the original effect size?

Does this mean we have to perform statistical analysis on all our effect sizes as compared to those of the original study to see if they are significant

I answered:

Good question, glad you asked. I wish I had a good answer. There are all kinds of debates regarding what that comparison means, and how we assess what a successful or failed replication is.

For the purpose of this course I am -not- expecting you to perform a statistical test to assess these differences.

If the effects that you found in your replication are in the same direction as the original effect and confidence intervals do not include the null (which is close/similar, to some extent, to findings a significant effect in your findings), then there are those who would summarize this as a successful replication. You should also compare the effect-size, and make an assessment of the differences in magnitude (e.g., you found d = .2, which is considered a weak effect, compared to maybe original findings that found d = .8, which is considered strong).

Some argue that an even more convincing successful replication is if the effect-size in your replication falls within the range of the original article's confidence intervals. There are some issues with this approach, but it is something valuable to note.

A followup:

for reporting the result, should we mention the comparison between the original study and the replication? Or just do it normally without comparing the two?

to which I answered:

Not sure what “normally” is and what isn’t. But everything needs to be in regards to this being a replication. Comparing your findings to the original findings is at the core of what your project is about.

Others asked:

our data analysis got a an effect size of 0.304, while the original Cohen’s d is 0.629. So does this mean that “the replication effect size is significantly different from the original effect size”. We are confused about how to define the “significant difference” here, how to compute it, how can we define that the replication is a failure or a success regarding the difference? We have searched online, some articles suggest Bayesian evaluation, but we don’t know how to do it.

Also, though we got a significant result for our effect, the confidence interval is not similar to the original one, as we have 95% CI [0.150, 0.603], but the original study had [0.2776054, 0.9781213]. How can we interpret this, and does it mean that the replication is a failure?

I answered:

Many would consider this a successful replication, although, as you noted, the effect-size is about half that of the original findings. So, based on your own findings, the effect is close to being considered “weak” than to being “moderate”.

Why successful? It’s in the same direction, and the confidence intervals do not include the null. Even more than that - the detected effect size (d = 0.30) is within the range of the original findings d CI [0.28, 0.98]. So, looking pretty great.

Student asked:

The initial data analysis shows that the mean intentionality rating between actors and observers differ significantly (p = .003; d= -.206) (a side-note: the intentionality ratings between definition and no definition do not differ significantly). I just wonder, as you said p values cannot tell everything, but given such small p-value and also CI does not include 0, can we say we are unable to replicate the results from the original study? On the other hand, I am confused which conclusion I can draw as the effect size of the replication DOES fall into the CI of the original study ES (which we assume to be d=.2 - the CI of the original study = [-.221, .618], as noted in the pre-registration).

I replied:

If what you wrote here is accurate, it shows that the actor-observer bias does in fact occur, but that the effect is rather weak (d = 20). So, your findings are in support of the alternative hypothesis – that there is a difference, even if somehow the original article failed to find that and mistakenly concluded that as support for the null hypothesis that there is no effect. Based on that, you can say that your findings are no inline with the original author’s findings and conclusions, but are in support of the alternative hypothesis and the probably reason for that is underpower in the original design.

When I was trying to work on the data analysis, I found that one participant reply 1983 on the question asking his/ her country of birth, I wonder should I exclude this participant? Because I think her answer may shows that she is not paying enough attention in filling the questionnaire, but since this question is not exactly about the experiment, so I am a bit frustrating about this.

I replied:

We didn’t pre-register that, and that sort of thing, by itself, isn’t a good cause for exclusions. We consider that ‘noise’, some participants don’t pay attention and aren’t serious. If you pre-registered seriousness as an exclusions criteria and that participant indicated not being serious, then can do the exclusion together with unserious others, but in any case need to report both sample with and without exclusions. In large samples, like ours, I think you’ll find the differences are rather minor.

I wonder what is the meaning of direction of findings? Does it mean positive or negative results or something else? Thank you.

My reply:

Generally, direction is whether relationship is positive or negative.

Student asked:

I would to like to ask since I pre-registered excluding participant who did not complete the study, I found a problem that some of them did complete the survey but didn't not answer some of the questions.

I answered:

All questions in the survey were meant to be forced choice. If they weren’t, that’s an unfortunately glitch, but understandable. These things happen.

In any case, you set missing values to NA, and they’re no included in the statistical analyses. There’s usually no reason to exclude those manually. In any case, if you do find that you need to treat those differently, you take note of this decision in your analysis and run it. Pre-registration is not a jail, it’s a plan, that needs to be updated as you realize things. The only thing is to closely document departures from the original plan and where those happened.

I need to conduct a proportional test. However, Jamovi cannot calculate Cohen's d and CIs for proportional test. Should I use other calculator to obtain those statistics?

I answered:

JAMOVI can’t do everything, so sometimes you’ll need to use other tools we had in the guidelines/Dropbox. Like MAVIS:

Also, JAMOVI can calculate Chi-square (X2) for proportions (One-sample proportions test – N outcomes X2 Goodness of fit). Then, Chi-square can be converted to Cohen’s d using a variety of calculators online and some of the Excels we have in the Dropbox. For example:

You might see different results based on the two approaches, because they use different tests. In any case, please carefully document how you reached the findings you reported.

Student with a 3-way ANOVA asked:

A question on Cohen's d and CI. For our analysis, we are running ANOVA tests, where the effect size estimate is the eta-squared, but not Cohen's d. From our previous email, we do not need to report the CI for eta-squared. Is it still the case as after reading your announcement from Moodle, there is some analysis from our classmates which ran ANOVA, but still they included Cohen's d estimate and CI there, so I'm getting confused.

Secondly, I've been thinking about our previous conversation on p-values and effect sizes. Actually, which should we consider when estimating the true effect of the manipulation. Does it both need to be significant (large effect size and significant p-value) or we just take a look at the effect size and screw p-value?

P.S. Should we report eta-squared or report omega-squared? Seems like the better option was omega-squared but it is still not widely recognized by the psychology field and I'm afraid introducing statistics like eta-squared and omega-squared in our presentation will leave our audience very confused.

I answered:

Yeah, it is confusing, there’s a lot of flexibility in how to address this issue.

You can convert ETA to Cohen’s d, lots of calculators online and in the Dropbox for that. If we’re talking about a 3-conditions ANOVA, then these conversions aren’t accurate and the CIs calculations will be biased. So, as we discussed in the previous email, all conversions aren’t ideal, the best for a 3-way ANOVA is to report ETA and/or f (lower case, from GPower). If it’s a 2 conditions ANOVA, that’s basically a t-test, and best reported with Cohen’s d.

For the 3-conditions ANOVA, the more important part is the contrasts between the two conditions that most closely match the hypothesis. For those you’ll need to report the Cohen’s d and CIs between those two conditions and compare that to the original findings. In your case, the original study was missing some vital information for you to calculate that, but for others in the future to be able to compare their results to yours, it’s best you report those contrasts with the Cohen’s d + CI effects.

As for omega versus ETA, I think our audiences will be as confused regardless, but ETA is a bit more common.

I understood that we should also perform data analysis for data before exclusion for comparison. For this data analysis, ​should I still exclude those who did not complete ​the study due to their ineligibility (as we need American undergraduates and they did not able to proceed ​to the survey question at all). If we should ​still include their data, should we set their missing as zero not other value?

You cannot analyze something that you don’t have. The non-American participants do not have any answers for you to analyze. They cannot be included in the overall count. They just need to be removed entirely, it’s ​not even considered an exclusion, since they didn’t even begin to take the survey.

Screenshots/original article results

that for the reporting of result, I wonder if it is appropriate to screenshot or quote the original article for its result in APA

My reply:

Can (and should) quote and add the stats from the original article. Definitely.

Screenshots are sometimes tricky, might be some IP issues (unless really old or open-access CC articles), so when not-sure, it’s best to redo the tables.

Students asked how you can use a computed variable to recode an existing scale to a different scale.

You can compute a JAMOVI variable where is scale is 1 or 2 then it gets -1 and if it’s 4 or 5 it gets 1. How? It’s a bit tricky with JAMOVI but you can use the following equation:


You can see this sets 3 to NA, meaning those are not included in the analysis.

If you just need to reverse a column, say Scenario2raw with a scale of 1-5, then you can simply do:


1) I wonder for the instruction to report everything as exhaustive as we can, is it because we are currently a student that we need feedback on our assignment or exhaustive reporting is also accepted for paper publication. Because since we started learning to write article, we are always instructed to write within a certain number of word to be precise and concise. Or this full report on everything is going to be saved at the public reservoir as reference for other authors when they feel doing so? Sorry for this question that might sound offensive to the concept of data transparency that we have been dealing with throughout the semester but I would like to know how to balance between being concise while allowing transparency.

[GF>] Yes, things are changing. It used to be that papers didn’t report much, and see what happened, we can’t replicate/reproduce things. Which is why some journals are now changing to demand very comprehensive reports, but differentiating between main text – with summarized results and with a strict format, and supplementary files, with no specific format (that will have to change with time, we need structure).

You can see an example in one of my recent articles:

We need full transparency about how we did things and what we found. The length is addressed by putting all the details in supplementary material files.

2) I read that we should use active voice instead of passive voice. In the past, passive voice is always encouraged as it sounds more objective. In this case, i wonder if active voice is encouraged due to the nature or replication study.

[GF>] Active voice is part of the APA guidelines. See : The Publication Manual says to “prefer the active voice” (p. 77), and there are two main reasons why. First, the active voice clearly lays out the chain of events: Lion eats mouse. With a passive voice sentence, the reader must wait until the end of the sentence to discover who was responsible for the action. When used in a long sentence, the passive voice may confuse readers. Second, the active voice usually creates shorter sentences. Although your paper should include a variety of sentence lengths, shorter sentences are usually easier to understand than longer ones.

On the other hand, I would like to ask about several questions about the reporting format. 3) Is it preferable to report all results of original study in cluster then the replication study or separating result for each IV/DV sounds better to inform the reader or it doesn't matter?

[GF>] It’s hard to say, but it’s best to first address the original findings and pre-registration, and then add further analyses. If possible, try to structure it so that it’s easy for readers to find what they need and are interested in.

4) About p-value, should we add asterisk when it's significant or it doesn't matter here since our focus is not p-value anymore?

[GF>] Simply report the p-values as they are to the 3 decimal (APA style). If in a table, we usually flag different levels. one star for < .05, two stars for < .01 and three stars for < .001 p-values in a well-powered replication are valuable and convey important information, but they shouldn’t be the only thing.

5) Also, for non-significant result, I would like to get confirmation on my understanding that if i am not mistaken, we are supposed to report every result including ESCI instead of “ns” as we are aiming to compare the ESCI instead of p-value.

[GF>] yes, please report exact p-values regardless significant or not. You can flag non-significant > .05 with ns. Like : “t(DF) = T.TT, p = .304ns, d = X.XX, 95% CI [Y.YY, Z.ZZ]”

6) If i conducted extra test such as Welch's t-test and Wilcoxon W-test (due to violation of some assumptions) and I have already reported the ESCI in the t-test conducted previously according to analysis plan, should I still report them after the t/w statistics since the ESCI remains the same.

[GF>] Best to report the most relevant analysis in the main text, and move the less-accurate original results according to plan to the supplementary, explaining why you decided to add another test.

7) Would it be a problem if I use the same way to report every result such as “Independent two-sample t test indicated there is a main effect of self-other difference in the predictability of personal action, t = …, P = …., Cohen's d = …, 95%CI[…,…]..” for every effect including simple effect or different style of reporting is preferable for the reporting of different results.

[GF>] If you have many of those, you can move all statistics to a stats table and just summarize the results in words referring to the table. See for example my article: Tables 3 4 5 . It also really helps the reader see it all in one place, and keeps the text short and to the point.

8) Lastly, I would like to ask how draw line between reporting the result and discussion part in this section. This is because I have found Pronin & Kugler (2010) reported some interpretation of result in their article as below.

[GF>] Your article is published in PNAS, which is a general science journal, not psychology. Also, they have a multi-experiment format so they need to explain what they’re doing in the next experiment at the end of each section. Therefore, they have a different format from what you’re used to. Generally, best to separate the results and the discussion, especially comments like that. You have a single replication and you’re following psychology guidelines and norms.

Since our group has 7 people, our group tutor, Louis, told us that 4-5 people presenting is enough and that he would mention this to you. Is this alright if we only have 4 or 5 people presenting?

I replied:

Yeah, 5 should be fine. Not everyone has to present, but everyone needs to be involved.

When I ask a question, I’ll probably ask it of the persons not presenting and they should be able to answer well.

Also, I would like to ask for the presentation, does it mean the pair need to prepare the PPT and present together (like a group presentation)? Since you mentioned about the expected time allocation within the 10minutes, but I thought we are suppose to work individually.

I answered:

You need to work individually on your data analysis. The presentations, if you’ll check the syllabus, is a joint effort/score:

Pair project presentations: Each two students working to replicate the same target article will present together. They will integrate insights from their independent projects to give an overall analysis on the replicability of the target article.

The whole point is for you to compare your individual analyses and provide a combined summary of your separate conclusions.

Should we write the report in the format of a published article? Like do we need to format it as two columns, and the line in the top and stuff…? Or we can just stick to normal layout of writing an essay will be fine? Thanks!

I answered:

Oh, no no. Ofcourse not. This is after publishing.

You only write it using the APA template word file, there are some in the Dropbox. You can see the examples from my former students, like the exceptionality effect, or my working papers (; example: ) on what typically journal submissions look like.

Do you have any suggestions on plot generators that allow us to make a forest plot or equivalent for comparing the effect size and CIs of the original with the replication?

My answer:

Sure, if you’re not familiar with R or don’t want to do anything too complicated, you can just use MAVIS (which we typically use for effect size calulations, but mainly does meta-analyses): If you have the effect size Cohen’s d, just choose the option Mean Differences (n, Effect size d) and enter the details as explained in the example on the “Input examples” tab. It calculates the CIs by entering the N for each of the two samples.

If you want to try R, then there’s a fairly simple example here: (just click on the “show R code”)

I would like to ask if the need to rephrase applied to the result section of final report as well. Or are we allowed to put in what we have written without changing the text from the data report assignment?

My reply:

The data analysis was meant to be the results section of your final report. It’s okay and meant to use the data analysis report as is in the final report (with extensions/revisions, if needed).

1) Rephrase content from pre-registration report

I understood from the guide that we are not allowed to repeat the sentence within our final report. If we are supposed to include the pre-registration report as the supplementary material, does that mean that everything from the pre-registration report should be rephrased for those relevant section such as methods, brief introduction of target article etc. in the main text ( 5000 word ) of the final report?

2) ESCI for paired- sample t-test/ Wilcoxon W- test

For Cohen's d CI, if one of the analysis of my replication study (Main effect of time), I have found a calculator (see below link) that can calculate its ESCI with correlation (I used JAMOVI to calculate the correlation calculation), can I still report the ESCI for this paired-sample test?

However, for this effect, it violated the assumption of normality, I have reported Wilcoxon W-test instead of t-test, if the calculation from the calculator above is valid, can I still report it as the ESCI will not change no matter using t-test or w-test (if i am not mistaken).

3) Report without-exclusion findings in the main text

Could I ask for the reason that we should report the data without exclusion instead of after exclusion in the main text?

4) Testing for assumptions

If the assumption test is important in explaining why I report the result of additional test performed, such as Welch's test and Wilcoxon W-test instead of those pre-registered (due to the violation of equal-variance assumptions and the violation of normality), could I still report them in supplementary analysis?

5) Tense According to APA 6th edition, about the tense used in discussion, should I follow the text highlighted as below, by using present tense instead of past tense? On pages 65-66 of the Publication Manual of the American Psychological Association (6th ed.), APA (2010) states that you should use verb tenses consistently: Past tense (e.g. “Smith showed”) or present perfect tense (e.g., “researchers have shown”) is appropriate for the literature review and the description of the procedure if the discussion is of past events. Stay within the chosen tense. Use past tense (e.g., “anxiety decreased significantly”) to describe the results. Use the present tense (e.g., “the results of Experiment 2 indicate”) to discuss implications of the results and to present the conclusions.

6) Limitation For limitation, I am not sure about what I consider is constructive for better replication or not. So, I would like to ask about your opinion on this.

The original study did not state that the Princeton college students are American or not that they may come from other countries while our replication study focused on Americans but I found that some of them were from university outside US, such as Oxford. So I wonder if this is some point considered important when we are designing a direct replication.

7) Table in other sections than result When reporting the effect size in the pre-registration plan of methods section, can I insert table to present the effect size and its CI of original study?

I answered:

  1. It doesn’t. You can use whatever you want from the pre-registration in the final report, even copy-pasting. Let me know what are the confusing guidelines and I’ll change them.
  2. Yes, please, that would be great. For the purpose of this course, we’ll assume ESCI isn’t directly related to NHST testing and p-values.
  3. There are lots of reasons for that, but the main one to mention here is to allow comparison between the two reports of the peers and cross-validate. As you saw in the presentations, there’s just too much flexibility in exclusions, and generally they change very little. Because of that, we’ll choose the most conservative fixed comparable results.
  4. Yes, the supplementary is meant for you to elaborate on all the choices you made in analyzing your results. You could add a note that it’s detailed in the supplementary.
  5. Yes, that’s correct. I should have elaborated on that more. Please follow those APA instructions.
  6. I personally don’t think these are big issues, but for things like that your opinion matters just as much as my opinion. You can mention these if you think it’s important. If you’re not sure, maybe just mention these brief, or even maybe just in the supplementary.
  7. I answered it somewhere, can’t remember where. There are probably copy-right issues with some of those journals, especially PNAS. It’s better to recreate the table yourself with modifications, and a clear citation with refer to the section/table you used.

I want to attach the Qualtrics survey we've used. However, the file is locked from any amendment. The full length of the survey is 60 pages long in words format, which contains a lot of sessions that are not related to my study (Since we've combined 3 studies into 1 survey). I don't want to attach the whole thing but Qualtrics does not allow me to delete or change any block.

Do you know how to unlock the survey or I should just separate the survey into another document?

I replied:

Oh no, that’s easy, there’s really no need for you to do much. I already did all that when I pre-registered.

The core of your experiment was added to the pre-registration on our OSF online, and it also in the Dropbox (where the peer-reviews were). I already included only the part of the survey that is needed.

You don’t even need to import the Qualtrics to your report, you can simply add a link to the OSF file in the Appendix linking to the Qualtrics.

Students asked:

1) Rephrase content from pre-registration report I understood from the guide that we are not allowed to repeat the sentence within our final report. If we are supposed to include the pre-registration report as the supplementary material, does that mean that everything from the pre-registration report should be rephrased for those relevant section such as methods, brief introduction of target article etc. in the main text ( 5000 word ) of the final report?


As my group is writing our final report, we would like to ask if we are allowed to copy and paste hypothesis, methods, etc from our pre-registration report. We read in your instructions that “The same text cannot be repeated within the text”, may I ask do you mean that we cannot copy from pre-registration report?

To which, I answered, again:

Please see:

So - “You can use whatever you want from the pre-registration in the final report, even copy-pasting.”

When referring to not reusing sentences and phrases twice I was referring to the final report (the 5000 words). It builds on all the things you submitted before, all the submission you did were building up to be components in the final report, and therefore the final report can include any content or variations of these in as is.

I was a bit confused about how I should finish the final report.

For small f in ANOVA, from what I have studied, which is the F-value in F (x, x) = “F-value” or should it be another thing?

And for the “code/analysis files”, do I simply copy the JAMOVI PDF output to the final report as supplementary and comments on it? Or should I simply start another section talking about the different sections in the JAMOVI output and what they are for?

Also, as for the part that I need to compare our result with the original study, do I need to compare every statistics i.e. mean in each conditions? Our current way of doing it is simply comparing their significance and eta-squared. Would it be too simplistic and should I include more?

Moreover, a lot of information are already addressed in our data analysis report, is it ok to ‘self-plagiarize’ or it is still necessary for us to rephrase all of them?

Lastly, should I include our word of caution in the results section or should I put it into discussion such as the reasons we did not run our additional analysis and the change in participants.

I replied:

Big F and small f aren’t the same. The small f is the effect size that is calculated in GPower. Big F is the F statistic from ANOVA.

JAMOVI code – it’s in the instructions. Create a Dropbox/Google-drive folder, and upload everything you’re worked on.

Comparing to original findings: Comparing p-values and effect-size+CIs is sufficient, but for each test conducted in the original article that I marked relevant (e.g., posthoc).

Data analysis - Again, addressed already. Data analysis is the core of your results section in the final report. Should be used as is and expanded/revised if needed.

Caution and changes – best addressed in the supplementary under a dedicated section.

b.Disclosures about open-science on p.4 of the final report guidelines. Can you explain that please?

I replied:


Data collection

Data collection was completed before conducting an analysis of the data.

Conditions reporting

All collected conditions are reported

Data exclusions

There were no data exclusions.

All data is included in the provided data.

Variables reporting

All variables collected for this study are reported and included in the provided data.

Open Science

Data and code will be shared using the Open Science Framework, and together with pre-registration files are available on: [OUR OSF PRE-REGISTERATION MATERIALS HERE]


We pre-registered the experiment on [ENTER DATE HERE] on the Open Science Framework and data collection was launched later that week.

(Malle & Knobe (1997), PSYC3052)

1. I compared my replication findings to the original with the reproduced results generated from the raw data provided by Malle (which you helped us make the JAMOVI analysis). Do you think this approach is acceptable? Since we are not “directly” comparing with the “original” results, but kind of comparing the “reproduced” results from original raw data. Do we need to add any extra explanation to that?

2. Also, last time I figured that the sample sizes for actors (N=31) and observers (N=73) from the raw data set are different from those they reported in the article (actors: N=32, observers: N=72), and I found out later that this was because in the raw data set, one item was missing for one participant, thus s/he was excluded by JAMOVI in the analysis. However, in the original study, the authors excluded another participant because he/she gave consistently negative correlations with other participants - which we do not have information about whom was excluded. How should we deal with this? Can we still proceed to comparing the replication findings with this “reproduced” findings from original study?

3. Regarding the data exclusion criteria, in the pre-registration I did not plan to exclude any participants with a low score in self-report of seriousness, therefore in the data analysis I also did not excluded any. I just wonder, do you think I should exclude these participants? Or I can just follow what I planned in the pre-registration?

4. Sorry I am a little confused in reporting the analysis before and after exclusion, should we report both in the results section, or only the analysis after exclusion (and report the results before exclusion in the supplementary materials)?

I answered:

1. Yes, that’s great, that’s even better. But you should make a note of that so that it’s clear in the text. It would also be great if you briefly detail in the supplementary whatever changes you noticed between their reports and the raw data.

2. Yes, definitely. That’s great. This is part of why replications are so important, you re-examining their raw data and compare that to their results and yours. Good investigative work! In reference to the above, do make a note of all these discoveries. These are extremely important for the academic community to know.

3. I really don’t. Follow the plan. I generally don’t think these exclusions matter much, and they very rarely affect results in a meaningful way. In my studies I generally don’t exclude based on these factors anymore, but I do report collecting them so that others can take my dataset and see if/how these affected results.

4. I think that was clear in the instructions, so go revisit those. Please report findings without exclusions in the main text, and you can report the analyses with exclusions in the supplementary. The reasons were outlined in the WIKI FAQ after someone asked, and have to do with being able to compare your and your peer and addressing/removing the flexibility inherent in choices regarding exclusions.

1. As I saw from your sample, you separated the main report and the supplementary materials into two files, so when submitting the final report should we submit it in two files or we just combine it?

2. You mentioned we have to include all the pre-registration materials in an appendix, does this mean that we should have a section for it by just copy and paste the original pre-registration that we have submitted?

I replied:

  1. You can do these in the same document, or in separate ones, both okay. If they’re different format/template, sometimes easier to have them as separate files.
  2. I think this was answered in the guidelines or WIKI. Briefly, the best way is to simply add an appendix and provide a link to your pre-registration online on the OSF. You can also simply add that as a separate file, and add that in your overall submission. Either way, your pre-registration should be easily retrievable for your reviewers/TA/me from your submission.

I am writing to ask about the issue of one of side of the confidence interval being infinite as I remember you spotted it out from our presentation. We checked after that and we discovered that may be due to our one-tailed setting on Jamovi as our hypotheses are H1<H2 instead of H1 not equals to H2. So is it appropriate for us to keep on using the one-tailed test and report the confidence interval with one infinite side?

I replied:

I suggest you calculate the Cohen’s d using some other tool, just like you did in your article analysis/pre-registration. Reporting a confidence interval of infinite isn’t helpful or meaningful.

Also, make sure that the confidence intervals are for the Cohen’s d or other standardized effect-size and not for the mean difference.

In our supplementary materials section, would you also like us to include all our raw data with steps, just like you had us do for our data and analysis document?

I replied:

Upload your data/code/output/analyses (R/JAMOVI/etc.) to a Dropbox/Google-Drive and provide a working public link in your document. Make sure all code/data is well documented with comments/notes so I’ll be able to follow what you did.

Please see the WIKI FAQ and guidelines for further details.

1. For “Methods”,

(a) Is writing about the process we do pre-registration unneccesary in this section? For example,

“We first did an article analysis report about the Experiment 1 of the target article and addressed the section 1 and 2 of the Replication Recipe (Question 1- 16). We then redesigned the survey on Qualtrics…”

(b)As we will include link for pre-registration plan, should we still mentioned about the experiment design with IV and DV if nothing was changed?

© According to my understanding, we should mention about change that is deviated from pre-registered plan in “supplementary materials”. However, should we mention it in the ” methods” section as well and refer it to the supplementary materials?

2. As you have mentioned about writing the result according to before-exclusion data to allow convenient comparison, I wonder if I pre-registered the main effect of time (Pronin & Kugler, 2010) as additional analysis, would this create problem to allow comparison with study done by my peer?

3. On the other hand, I wonder if there is any way to define the strength of p-value. This is because while doing peer review, my peer has interpreted the strength of p-value as following. If I want to see how important is a significant p-value to interpret its importance, should I use p-curve to see how the other p-values distribute and judge it according to the overall context. Thus i wonder if using p-curve is relevant here. Also. as I know, p- curve calculator needs input from a lot of study that I wonder if there is any p- curve calculator that allows us to see the trend without the input of result from unlimited replication.

4. Also, can I use post- hoc power calculator to see if our study has enough power? This is because I realized that we did power analysis based on original study effect size which we assume the effect size to be the true population effect size. However, the original study effect size might not be true. Thus, as our replication study changed in effect size, I would like to look up for its power. At the same time, i wonder if post- hoc power calculation cause opportunity bias?

I answered:

  1. No need to mention the article analysis stage, the pre-registration is enough as it is registered, public, and official. The article analysis is just us doing our thing.
  2. Not sure what you mean here, but the experimental design needs to be clear in the main part of the final report. This is what readers will look at to understand your replication experiment.
  3. You should mention in great detail what deviated from the pre-registration in the supplementary materials, you should briefly mention that – or even better – summarize that in a table, in the main text, referring readers to details in the supplementary.
  4. Differences in pre-registration for analyses does not create “problems”, it is understandable and expected. You don’t need to worry about these things, these are “higher level” things for me when I integrate your reports, just focus on doing the best in your own report. My only mention of the other’s report was in reference to some of my request. If I didn’t make a request, don’t worry about it.
  5. A p-curve analysis is relevant for multi-study experiment or multi-study such as in a multi-experiment or meta-analysis. Less relevant for one experiment one study. There’s no need to do a p-curve for this project.
  6. No need to calculate power post-hoc in this project. Post hoc power has all sorts of issues, that we didn’t have time to go into, and maybe needs to be discussed in a methods course. If you already did that – great, it adds some information and definitely doesn’t ruin anything. If you didn’t – no need, leave this out.

  • hku_psyc2020_and_psyc3052.txt
  • Last modified: 2018/05/14 21:00
  • by filination