HKU PSYC2071 and PSYC3052 - Autumn 2018-9


Apart from the main effects, I would like to ask something related to my extensions. My extension after you kindly modified asks the participants to fill in an amount for the safe/ risky option so that they would choose that option. As I was trying to analyze my extension data, I have found some of the participants filling answers like $1million or amount that is the same as the other option, which has an amount stated clearly (rendering a fail to measure risk preference).

Therefore, I am not sure how to set the exclusion criteria. I would like to ask if it is okay for me to exclude participants that show skyrocketing amounts using IQR? I suspect these participants are very likely to be outliers. Moreover, can I also exclude participants that fill the amount as stated in the other option? I am not quite sure on how I can define the exclusion criteria on this.

I am actually unsure if this constitutes p-hacking. Nevertheless, put in reality, it seems very unlikely that the amount for the two options can differ by that large amount. So I am not sure on what I can do with these data


This is a good example of the complexity of analyzing data :)

First off, just so we get the terminology right, p-hacking constitutes things for decisions taken to affect the p-value without transparency. Since you are stating the differences from pre-registration, marking those as exploratory and explaining every step of the way, reporting both before and after exclusions, plus sharing all your data and code, none of that is p-hacking.

We failed to anticipate this in advance, but that is okay. There are several ways to address this:

  1. A standard criterion to address outliers in such cases is whether participants were +- 3 standard deviation from the mean. Someone who wrote a million US$ would qualify.
  2. Examine the distribution, if the distribution is not normal (skewness and kurtosis), which sounds like I might not be, then can perform a log transformation: newvariable=ln(oldvariablee) . In any case, I suggest plotting out the distribution of these variables and reporting their skewness and kurtosis in your descriptives.

Not sure I understand what “the same as the other option, which has an amount stated clearly (rendering a fail to measure risk preference)” means exactly, but you could code that and see how many qualify, and send me a follow up email.


Also I observed that I use a different effect size (cohens VS cramers V) compared with other peers. Should I switch Cramers V for a more convenient analysis?


Rather than switch, you could add that if you wish, to compare to your peer, but it’s always better to have more information than less information.


Besides, I found that plotting (violine & box) is not applicable to proportion but means, would that be sufficient to use tables only ?


Proportions don’t need violin plots or error bars, but it would be helpful to do a plot in Excel to show the proportions visually. See example:


You mentioned in wiki that in our final report, we can only use R/jamovi for submission. However, does it eliminate the chance of using more convenient online calculators e.g. psychometrics for calculating effect size?


Your first priority should be JAMOVI/R, but if needed - you can supplement your JAMOVI/R analyses with whatever you feel you need in order to do a comprehensive high-quality report. Our first tool of choice is JAMOVI/R/Gpower since these are open-source free tools that allow for open science and reproducibility, and this is at the core of this course and your final report. Online calculators, SPSS, etc. are either closed, expensive, temporary etc. and do not facilitate open-science and therefore fail to meet our aims. However, if you feel like you have no choice, and JAMOVI/R somehow fail to do what you need, then you could and should use supplementary tools.

Q: Student sent me outputs comparing an online calculator binomial Z test to R prop.test function output


First, make sure both effects are the same. In your case, one is a chi-square test and one is a z-test. If you try a converter, you’ll see that these are essentially the same. As for the p-value, you need to read your output carefully, your output reads “with continuity correction”, so if you’ll add correct = FALSE

prop.test(x = c(279,269), n = c(514,512), alternative = "greater", correct = FALSE)

you’ll get a very similar p-value. Differences then could be about rounding, since that online calculator isn’t open, I can only say I trust R more.


Are there any changes that I need to pay attention to?


You’ll need to very carefully compare what we finally ran, and what you planned and examine any possible deviations.

Comparing results of original article and replication


do we need to compare the results of the replication study with the original article in our data analysis


The comparisons should be made in the discussion section in the final report (and with lots of details in the supplementary).


how to interpretation the effect size for chi square without confidence interval. Because without confidence interval, how can we tell if the effect size overlaps with 0 hence making the replication inconclusive?


Generally, you can do CIs for chi-square. JASP does that fairly easily, I think. And there are R packages. I don’t expect this of undergrad students, but ofcourse if you’d like to I’d be happy to see that.

However, for now, you can simply rely on whether the findings are generally in the same direction and NHST (“significant” p-values). From what I saw in the findings below, it looks pretty clear that your findings are a successful replication.



In a Chi-square goodness of fit, Jamovi does not provide an effect size. How do we obtain it?


Chisquare is basically an effect size, and can be converted to others that you might be more familiar with regarding CIs.

For the effect, if you recall - you calculated the effect-size (and CIs) for your power analysis, and this is no different. Use the same method. Either R, Gpower, online calculators, JASP. You might be asking this for the CIs, and how to assess replications.



I will be running one-sample binomial test and difference between 2 independent proportion, according to the manual, it is no required to report the CI. However, if not reporting, how can we draw a conclusion about the replication results using the CI and effect size figure?


You are welcome and encouraged to compute effect sizes and CIs, but I admit that we currently have no simple tools for that, aside from some R packages. I try and give some examples on how those are done in the replication guide, and there are examples online, but for the sake of an undergraduate class I do not require this of students. In such cases, the effect size with the NHST results and the direction of the findings can be used as an assessment criteria for comparing the original results with the replication results.


I have conducted some brief data analyses using jamovi, however I found out that the confidence interval generated is not the confidence interval for Cohen’s d (I think it is the confidence interval for the probability estimates difference instead). By looking at the Cohen’s d (for example the first row, in red), we can judge the effect as a successful replication. So do I still need to compute the confidence interval for the effect size (perhaps using online tools like MBESS)?


Yes, good you noticed that. We indicated that issue in the replication guide, but it’s hidden away.

Yes, you’ll need to calculate those, just like in the article analysis. You can either use MBESS, or use JASP, which worked exactly like JAMOVI but includes the CIs for Cohen’s d as well.


Supposing my calculated required sample size is lesser than the number of participants in the data set, how do I go about such a situation?


You state your calculated required sample size and the actual sample size, just like everything else in your pre-reg. You compare the plan to the actual analysis.


my data is kind of funny. The full sample failed to generate significant results but the excluded sample revealed a significant result, should I conclude the replication to be successful replication or inconclusive?


Have a look at the replication guide on how to assess replications, we generally care about direction and effect size. NHST is just nice nice to have.

You should report it exactly as you did here: full sample and after exclusions both effect, CIs, and NHST results.

Generally, if the exclusions are what makes the difference, you should point it out as a possibly significant factor in replications.


I heard about filtering for exclusion criteria. Do we need to filter before doing the analysis?


For the presentation and data analysis, focus on the full sample.

About exclusions and filtering - Add the analysis based on exclusions to the supplementary of the final report.


In the full qualitrics survey, there is a question asking participants whether they have seen the materials before. 2 participants said Yes, and one of them specifically stated that he/she was familiar with the hypothesis behind. Should I exclude these 2 participants in the data set? If so, do I directly exclude them and conduct the analysis or do I do both with/without?


For the presentation and data analysis, focus on the full sample. About exclusions and filtering - Add the analysis based on exclusions to the supplementary of the final report.

The question was meant for that purpose. Yet, with only 3 participants indicating that, I doubt that exclusion will do much, but in the final report do add that to your exclusion and report the total excluded findings in the supplementary.


Do I need to use only one software consistently throughout the whole data analysis or can I use both SPSS and JAMOVI together for the analysis?


For the final report where you’ll include all your code and analyses, you will need to use either R or JAMOVI.

Since time is very short for the data analysis report and presentation, and most of you are unfamiliar with R/JAMOVI, using SPSS for the data analysis report/presentation as a temporary solution is okay. You will have to submit R/JAMOVI for your final project, though.


it added on top of the original coding with more levels (refer to the second image) which cannot be deleted


About that extra line and added labels, simply delete that line in JAMOVI in the editor, these are labels and they’re not needed.


As the pre-registration was edited, I m not sure about the sample size used in the replication. Also, the replication adopted both within and between subject designs which differ from the proposed design in some of the pre-registration.


If the actual data collection did more than you planned, then simply focus on the data that fits your planned analysis. If you analyzed within, then simply use the data for the within design.


I tried viewing the combined questionnaire. However as the ts1992 study was imported from the library. I was unable to review the survey flow within that particular study. May I know how was the survey flow for that study arranged at the end?


I explained in the email that your survey is with your revised pre-registration. In the “survey file note.txt” it reads the following: “The combined survey imports 1-3 surveys from a Qualtrics “library”. It will therefore not show you the “flow” of each of the surveys in the “Survey flow” and only contains limited information. For detailed information and full survey flow for each of your experiments, you'll need to examine the Qualtrics of each of the imported surveys. Please go see that survey in the designated directory.”

If you’ll go to C:\Dropbox\2018-19-HKU-PSYC2071-3052-Projects\Tversky & Shafir 1992\Revised pre-registration you’ll see your pre-registered Qualtrics files with the survey flow. You’ll notice it’s a mixed design, based on what was suggested. Carefully examine the survey flow to see how I contrasted the within and between-designs.


However, opening the file I have run into some problems. I imported the csv file into both Jamovi and Excel (as i found excel easier to view all labels/ ans), I realized the tabs for “display order” for ts 1992 are missing. As for my original experimental design I had the within subject experiment presentation order randomized, if i understand correctly I should analyze the data according to different presentation order.


No need to analyze display order, that’s not an important part of your study. Focus on the what’s important, the effects.

Within design doesn’t have to do with order, it’s not repeated measures.

I randomized order with Qualtrics. If you wanted to, you could use the fields Qualtrics created at the far right of the datafile.


I have been trying to do the data analysis for Tversky & Shafir (1992) Disjunction Effect Article.



These are tricky if you don’t have experience, it takes time to get it right.

There’s a very simple solution to this. You open the CSV file with Excel, do find-replace (Ctrl+H) and replace all the commas with spaces. You then save it again as a CSV. I did that for you and created file 20189PSYCINCFischhoff1975Hamill1980TverskyShafir_1992_legacy-nocommas.csv That loads fine to JAMOVI.


there is one participant who mentioned that he is allergic to wool, I guess that's why he gave the rating of 0 to questions about the attractiveness and feeling towards the gift although it is the high-valued option and the others gave the rating of 5-6 in average. Should I exclude his response in the data set? If so, how should I explain the exclusion?


just goes to show you what kind of unexpected things can happen with questions like that 😊 Yeah, that kind of exclusion sounds very reasonable, just need to explain it wasn’t pre-registered.


In the Qualtrics, I see that you have added the value the stimuli before asking the participants how attractive they think the stimuli is (e.g. the variable of “h1998_S2_ICL_attract”). I would like to know what is [QID121778270-ChoiceTextEntryValue] stated in the question?


if it wasn’t yours, maybe it was your peers. Sounds like general rating of the coat as a gift.

You can focus on the questions that you intended to analyze, but sounds easy to analyze this one as well. Up to you.


First, for the Qualtrics, the item of the extension that I have put in the pre-registration report has been changed, as I have just checked the dataset. … . May I know if I have to amend this in my pre-registration or my data analysis?

Another thing is that I am aware that you have added a note saying that you have edited the report prior to pre-registration in my fellow replicator's report. Yet, I am not able to see this note in my revised pre-registration report. So I am not sure if you have missed my pre-registration or if there is something wrong with my report.


Thanks for following up on this, good you reminded me, and brought this up, good questions.

I had to make a quick judgment call and replace it with something else that would capture the spirit of your extension. That was the best I could come up with that would be clear enough. I hope you can see how this relates to what you suggested, and what kind of (exploratory) information can be extracted from that.

About your questions:

  1. A pre-registration has been pre-registered, and data has been collected. There are no more amendments to the pre-registration. However, a pre-registration is not a jail, it is a plan, so deviations from that plan just need to be clarified.
  2. You should explain that instructor made amendments, put what you planned in the pre-reg in the supplementary, and work on what’s in the dataset marked as deviation from the pre-registration and exploratory.
  3. I am leaving it completely up to you if you want to data analyze that extension or not. Not doing it will not affect your grade, you have done plenty with your suggested extensions, and difficult decisions have to be made in real-time, so it’s not your responsibility. If you wish, and are curious about these results, it would be great if you’ll analyze that anyways. No pressure on my part, at all, completely up to you.

About your report, I might have forgot adding that, I was multi-tasking mad. Do not assume there's anything wrong with it, you already did above and beyond what the average researcher does, if they do any pre-registrations at all.


  1. In the folder consisting of datasets of my study (Epley and Gilovich), there are 4 documents (2 sav files and 2 csv files). I tried to open them, and found no difference amongst them. Do I only use one of them for data analysis?
  2. Will my peer and I be using the same set of data now despite the differences we had in our initial qualtrics survey?


  1. No major differences other than format. For those who use R/RStudio and importing the data, these small differences could be big. In term of values, they’re identical.
  2. I integrated your survey into one. We could only run one data collection for each project. I took the best of both your projects, addressed best I could both your extensions – if there were any, and ran only one integrated Qualtrics.


I had a question regarding this statement in the email you sent:

“Do your best, be clear about what you tried and couldn't figure out (do write what you did and where you got stuck), and state what gaps remain.”

Where should I list this information? Should I put it in the preregistration or as a comment on the google document?

I'm a little confused about the analysis plan so should I make a comment in that section?


It would be best to communicate issues to us as early as possible by email so that we can help resolve those. If we were all unable to resolve all those or those were found out last minute, please write them in the relevant section in your pre-registration submission.

Please write it clearly in the text and highlighted, not as a Google Doc comment, which might get lost or go unnoticed when you export the document.


I understand that there are a lot of submissions that you and the tutors need to review within a short period of time, but may I ask when will we be able to get our feedback on both the qualtrics and article analysis? As the deadline is approaching, I just want to double check on that.


Thanks for the note, might have been some misunderstanding, which is why I sat down to write a Moodle notice that you should receive shortly. I’ll copy-paste from that email:

Feedback schedule

Since this year we have ~50 students in PSYC3052 and PSYC2071 doing replications, that means very heavy workload for the TAs and myself, and so even if we wanted to, it would be impossible for us to provide you feedback for every submission. Keep in mind that the articles are also new to the TAs and I, so we are struggling to understand these with you, all of them simultaneously.

Therefore, at the beginning of last class I explained that you will receive all of the feedback on the Qualtrics, article analysis, and pre-registration, together with the feedback from your peer-review, one week after you submit your pre-registration first draft. You'll then have one week to revise it all, and submit an updated pre-registration and Qualtrics.

I hope that makes it clearer when to expect the feedback.


In order to compute Chi-square in R, I will need to input the actual number of participants who chose Ann and Barbara. Take the question of economic terms as an example. As reported by Shafir, the proportions of participants who chose Ann and Barbara are 71% and 29%. The actual number will then be Ann: 150*71%=106.5, Barbara: 150*29%=43.5. However, the fractions can't be true in reality. Hence, I am wondering if I need to use rounded numbers, such as 107&43 OR 106&44, to calculate Chi-square, or I should just use fractions instead?


Yes, it’s a shame percentages aren’t reported as counts and will fuller data. I’m afraid these need to be rounded. Since what we’re interested in is effect size calculations to be used in a power analysis, the best would be to do the rounding that leads to the weaker effect. In this case 106 versus 44, so that the effect size is smaller, and our power analyses will lead to a larger require sample size. It will not be big differences, but still – it’s worth aiming for larger rather than smaller samples.


In your general comments, you have instructed us to remove all exclusion criteria that are not specific to our replication study. May I clarify if the removal of exclusion criteria questions (e.g. English and seriousness) applies to both the Qualtrics survey and the exclusion criteria section of the pre-registration?


Yes, please do not include those in the Qualtrics or in the pre-registration. I’ll take care of all of those for everyone together.


In the “description of results of the original article” section, you left a comment “No calculated section here?” in the reported results of “Attractiveness of options”. The author used paired samples t-test to calculate the result of this section. However, the calculation of effect size and CI of paired samples t-stat requires more information (e.g. mean and SD), and I cannot find an online tool to compute the effect size and CI. As a result, I left the calculated section blank.

Is there an online statistical tool to compute the effect size and CI of a paired sample t-stat using only the t-score and N?


Yes, there is, and it’s included in the replication guide. It had a bug until today, that someone pointed out to me, but I think it’s sorted now. See:


1. In the original article they did comparison between the following three conditions as indicated in the chart, does that mean for data analysis plan we have to follow those three conditions as well? Or we choose the ones we want to address?


What did you mean by “ones we want to address?”.

Since this is a replication, we want to address what the original article addressed, so we try to atleast follow the analyses the researchers did in the original manuscript. You can add more, if you think they missed something interesting and worth mentioning in the pre-reg.

Original article methods versus replication methods


I am now revising my pre-registration, and have encountered some problems related to the orders and formatting. As you have suggested in the general comments, we should first analyse the original article, then state what we should do in the replication. Yet, in my study, the materials used in the classic study and replication are basically the same. Therefore, I have written the original methods in the format of article analysis (the word doc you have kindly provided) and did not include the materials. Should I put the materials in the methods of replication or in the methods of the original article? Please kindly advise.


Please enter it in a section detailing the analysis of the original article. In the section about the replications, refer to the methods of the original article above, and detail all the deviations and additional analyses/extensions.


I had another question regarding the use of G*power. The statistics mentioned in that paragraph say X2=5.34 and p=0.02. So when I input values in g*power, will my alpha error prob be 0.05 or 0.02?


The alpha we’re aiming for across all of our power analyses are alpha (p value) of 0.05 and power of 0.95. Unless I requested otherwise. The p-values indicated in the article are not relevant for the a-priori power analysis we’re doing here.


Thanks for your reply. I have another question when running CI for my ANOVA design. As in the registration manual, there is an another formula used. May I know what software is used to run the statistics? When I entered the formula in R Console (as mentioned in lakens) it shows as the second picture.

Second picture showed error in installed MBESS package.


This is a basic issue with R, I really wish HKU psychology would train students in R. But also a point for me to improve in the guide, so thanks for bringing this up.

That error message only shows you the apaTables depends on a package called MBESS. This command line should have installed both:

install.packages("apaTables", dependencies = TRUE)

but if it didn’t, you can also do:

install.packages("MBESS", dependencies = TRUE)

If there are other such messages, just replace MBESS in the command above with whatever package is missing.

I changed the guide to have the “dependencies = TRUE” command in there.


This is the question we have: About the Qualtrics, do we need to copy the template of consent form and funneling sections from the Qualtrics template for MTurk experiments into our own survey? We got contradictory answers from replication manual and tutor. First, we have looked at the replication manual and found a sentence like this: “You do not need to concern yourself with the consent form…. Please leave that as is in the template. I'll combine or redesign…”. Does that actually mean we DO NEED TO COPY the template onto our survey or OMIT these parts?


This helped me understand that I haven’t updated things in the replication guide, thanks for pointing this out. I removed that requirement to try and ease things for students and make things simpler for you and other.

I added this comment in the replication guide, and removed references to the QSF template:

Students pointed out to me that this is different from my instructions in class:

1 - “Should be based on a provided template by the instructor. The template *.qsf file is in: Dropbox\Qualtrics and MTurk\Qualtrics template for MTurk experiments”

2 - “Create a new survey based on template provided using import (QSF file in the Dropbox)”

They're right. I removed this from the guide. I'll take care of the consent, funneling, demographics, and debriefing. Please focus on getting your experiment right. I'll add those afterwards when I run the data collection.


Yesterday I asked about the validation of answers in Qualtrics, and what I got is there is a need to set a range for the answers (e.g. lowest possible freezing point of vodka/ water = -459F; Year of Washinton elected president: min. = 1, max. = 2018) so as to minimize the possibility of random answers in the data set. However, according to the original article and my understanding, would that be giving a “clue” for participants to answer in a certain extent? That “clue” of not exceeding a certain value in their answers may enhance the accessibility for them to think of the answers (which is not intended to measure in the study and should be avoided if possible). So, should I keep that validation or not?

I have consulted Boely and she has the same concern. Yet I would like to hear from you before making my final decision.


verification to make sure answers make sense are extremely important. Answers of year 2100 aren’t meaningful, and validating that the year is a year doesn’t give a clue about anything aside from that they need to provide a year, and that is very important to be perceived as a serious researcher by your participants. If they can see they can get away with answering year 1000000, then they’re likely to answer meaningless garbage in other questions as well.

One should be careful with the ranges you provide. Obviously, year 2019 makes no sense, so upper range 2018 is good, and 0 seems low enough. So, I think the ranges you chose are reasonable. Also add an upper range for freezing point.

Definitely keep validations. They are essential for this experiment’s success. Just make sure they’re equivalent and exactly the same across “conditions” and in all the variants of the same question.

A student doing the replication for the hindsight bias asked:

for the extension part, do I need to think of something that is never done before or I can refer to the other studies that are related to hindsight bias? It is because I found it very difficult to think of something new to use as my extension as most of the variables were being done in the past.

My answer:

Extensions do not need to be completely novel, but they need to be something that hasn’t been done with this specific experiment, something that makes sense, and something that would be interesting to investigate. For example, it’s easiest to add dependent variables. So in the hindsight bias, some of the other hindsight bias experiments the dependent variables ask about the hindsight conditions how surprised are you by the findings (like the one I did in the first class). You could also do something in the foresight, so that after they guessed, show them a result and ask them if they found it surprising, and if it’s different from what they predicted ask them to explain why and examine their answers. Think of how to use this or something similar. It could be something very simple.

It should also be a different extension than that of your classmate(s) doing a replication on the same article.


I also want to know should I include the 1 simple extension in this stage into the experiment design in the Qualtrics Survey? And is it true that the assignment will only be graded from the revised pre-registration as the Professor mentioned during the lecture?


The extension would be needed for the pre-registration phase, which will also include a submission of a revised Qualtrics survey that should include both corrections to the feedback, and the additional extension. I'll announce this soon.

Extensions should be reported in the pre-registration in the same format as the one requested for the commentaries (on October 1st).

UPDATE, I sent this to the students:

Getting your emails and going over your reports I realized that all of you had difficulties in calculating the effect size. I admit the repeated ANOVA stats in this paper on the within-design it isn’t straightforward to deduce the effect size. I asked some of you to redo the analyses, but since time is short I will simplify things for you.

The effects in this paper are very large, which leads to very small required sample, but we would like to have a well-powered replication. Therefore, in your reports, please report all the stats in the original paper, but instead of doing an effect size calculation and a power analysis please indicate that we will aim at a sample size that is 2.5 time the original sample size, based on the recommendations of Simonsohn (“small telescopes” paper Simonsohn, 2015).


I encounter two questions regarding my replicating article (pluralistic ignorance Miller & McFarland). The experiment asks each participant to rate themselves and average others on some personality traits, and the question designs are identical. I wonder if the two ratings independent or dependent?

My second question is, in the original study, they did a lot of different tests to figure out many correlations. For example, they conducted ANOVA first to detect interactions between several variables. When we do the data analysis, do we need to conduct ANOVA as well or can we just go to t-test for those already showed correlated in the article (and thus directly compare whether the correlations confirm to their findings?)? Also, for those nonsignificant results in the original study, do we still need to test them?


The phrasing of your question confuses me a bit, so I’m not 100% sure what you mean by dependent or independent, but if you mean whether these are dependent or independent measures, then according to the original design they are definitely dependent. This is because it is a within-subject design, with each participant answering both experimental conditions. If it was a between subject design, where every participants answer either the self OR the other, then it would be independent measures, because they were not completed by the same participant. Does that answer your question? Let me know if I misunderstood.

About your second question, great, thank you for asking that. What I highlighted in yellow didn’t make sense, because these are correlations with their pre-test. I removed that highlight. You do not need to address these correlations, I wouldn’t know how you would do that anyways.

Regarding significant/non-significant, we do not evaluate results by whether they were significant or not, but by whether the analysis is relevant and important. It would also be that their samples weren’t well-powered enough to detect significant effects. Which is only meant to say – yes, please perform their tests and analyses regardless of the outcomes of their tests. There’s no need to do a power-analyses on those, but these should be part of your data analysis plan in the pre-registration.


1. As the article uses z scores to present their results without including the mean and standard deviation, I am confused about how to find the effect size. I used the z-score effect size calculator in Psychometrica, but I am not sure what to put in as the sample size.

I ended up using the group size and input it as the sample size to calculate the effect size for each problem (see the attached-Effect size for example). For example, 170/2=85 in Problem 1 because they are comparing two groups?

2. There are two findings for each problem:

(1) whether the percentage of participants choosing the enriched option is significantly greater than 100

(2) comparing whether more people chose the enriched option in the 'choosing' than 'rejecting' condition

To calculate the required sample size for each problem, I could only figure out using z-score (differences between two independent proportions) in Gpower for the second type of findings. Is that correct?

And do we have to calculate the required sample size using the first type of findings as well (i.e. greater than 100)? And which test should we use in Gpower for it?

And how do we decide the minimum required sample size for our study because there are different sample sizes for each of the eight problems in the original study? I assumed that we would use the sample size of the problem with the smallest effect size (which is Problem 3 for me and I calculated that it required 904 participants with Gpower).

3. Since the mean and SD were not given by the authors, is there a way we could validate their results using their percentage, z-score and assuming 100 as the population mean?

4. In the original study, the experimental problems were inserted among other unrelated problems. Do we have to insert unrelated problems? If yes, is it up to us to decide how many unrelated problems we include in the Qualtrics, or is there a required number?

5. The way in which the replications were conducted (the paragraph highlighted after Problem 4 and 5) seemed to be different from how the main problems was conducted.

Do we have to replicate this effect in our study as well (i.e. having participants do Problem 4 or 5 twice in the experiment)?


  1. Mean and standard deviation for… what? The dependent variable here isn’t a scale, it’s a choice between two options, and the analysis is on comparing the proportions of how many answered each option between the two groups. The statistical test is therefore either using a chi-square or a binomial Z, which only require proportions. See section “Counts/proportions independent samples between-subject (Chisquare)” in the replication guide.
    1. When we need to use the sample size for each group, but it’s not provided it is okay to do what we can to estimate that number, and dividing the total number of participants by the number of groups is a good way to do that. Just be very clear about what was missing and what you estimated in your reports so that readers will know and understand what you did.
    2. Be sure that you’re comparing the right proportions. In Problem 1, for example, it’s 64% versus 45% and not 64%-55%.
  2. A few things here:
    1. First, see 1b above to make sure you’re comparing the right proportions.
    2. About the deviation from 100, you’ll need to calculate a one-sample binomial Z-score for that. I admit I also had to Google it, to refresh my stats. If you goto: , Because we’re comparing the likelihood of adding two proportions, then the maximum is 200 (100+100) and the expected is that it will add to 100 (50%). The observed proportion is 119/200 = 59.5%, sample size is 170, and the null hypothesis value % is 50%. If you run that, you’ll get: z-statistic 2.477, Significance level P = 0.0132, 95% CI of observed proportion 51.71 to 66.95, which is very close to what Shafir reported.
    3. Please calculate needed power based on both, and aim for the largest sample size of the two.
  3. See above.
  4. No need for unrelated problems. But please note that I instructed you to not include those and that this is a deviation from the original design.
  5. If you’re 100% sure that these are replications and there are no changed from the previous problems, then no need to do problems twice. But check that carefully.


I have read the comments that you have given to my other classmates on the same article and I think I have also made some similar mistakes and would make modifications accordingly. May I clarify about the comprehension checks? We only have to add comprehension checks at the beginning of the survey, and not for each problem?


About your question below, here’s the comment I wrote about Comprehension checks:

I really appreciate your taking the effort to write these comprehension questions, but these aren't very helpful here, for various reasons, most of which that it isn't about the contrast in the choice being made. Since constitute a deviation from the original article, the costs overweigh the benefits.

Please remove comprehension checks throughout the Qualtrics.

So, no need for comprehension checks here.


Our article is Base Rates, Representativeness, and the Logic of Conversation: The Contextual Relevance of “Irrelevant” Information-Schwarz, N., Strack, F., Hilton, D., & Naderer, G. (1991) and experiment one.

Each condition has 11 participants and 44 in total. However, SD is not provided in the original article.

If in the original article it didn't provide information regarding SD and only mean and sample size are known, do we not report SD in article analysis?

Also cause it's a 2×2 Anova design, do we need to run 6 t-test effect size and power analysis? Since the information of SD is lacking, what is the way to run the statistics?


These things are tricky, we need to do the best we can given what we have, and it’s not well organized in one tool or one analysis.

Boley’s tutorial, tools in the Dropbox, and sections in the replication guide try and address these kinds of issue. Such as:

Sometimes there is some missing information in the ANOVA or t-test. In such cases, you can use calculators such as:

  1. MAVIS: (see “Effect size calculator” on the menu). Can be useful to calculate effects if you only have t-statistic and the p-value.
    1. “How to Calculate Effect Sizes from Published Research: A Simplified Spreadsheet” (see “Calculating Cohen D with lacking info.xls” in the Dropbox).

As for your second question.

I added a new section in the guide:

Power analysis could be done on the ANOVA f effect size, rather than on the t-test, which you might not have enough stats for here. There is lots of information about that in the replication guide as well pointing to various videos and guides on the web on how to run that with Gpower. I might be able to help if you have further questions, but I’ll need you to share with me what it is that you’ve done and tried already.


I am working on the article of Hsee (1998)'s Less is Better. However, I have difficulty in computing the effect size for the joint evaluation of Experiment 2 and 4 in the original article. Because these versions should be a within group design, so I cannot use the Excel spreadsheet on Moodle to get the Cohen's d (otherwise the effect size will be so large). I have searched online for other calculators that I can use but it seems that I need to have the correlation value as well but I did not have the value. Can you give me some advice on how to compute the Cohen's d for within-group design t-test?


If I asked you to combine two experiments, then simply analyze each separately, report both required sample sizes, and then summarize that we’ll aim for the higher one. No need to do more than that at this point.

About your other question, throughout the WIKI I answered similar questions, with the suggestion that: “when we don’t know the correlation we use proxies and use “best estimates”. A typical estimate when correlation isn’t provided is 0.5.”.

Please do indicate that you “input correlation as 0.5, explaining that as the original article did not mention the correlation, typical estimate of 0.5 will be used”.


In the experiment 2 and 3, there are 2 versions – separate evaluation and joint evaluation. Some of us think the IV is the mode of evaluation. However, I see experiments 2 and 3 as each containing 2 different sub-experiments – separate evaluation and joint evaluation, that's why 2 t-tests were conducted, for separate evaluation and joint evaluation. So for example, for experiment 2 in my article analysis, there are 2 sub-experiments and the 2 experiments share the same IV, which is the proportion of filling relative to the cup size, instead of the evaluation mode.

Because of the difference in how we see the experimental design and the sample size for each condition is not given other than the total sample size in the experiment as a whole, we have different ways to calculate the sample size for experiment. So in experiment 2, I think the participants were divided into 3 groups, 2 groups in separate evaluation mode and 1 in joint evaluation. But my peer simply divide them into 2 groups, separate and joint evaluation. So I have 2 required sample size for experiment 2 (for both separate evaluation and joint evaluation) but my peer only has 1.

Because of the big difference in the way that we analyse the article, I would like to know if both interpretations are acceptable?


Thanks for writing in and asking. These are -very- important.

This is an excellent demonstration of the difficulty in replications and the flexibility in decisions made in how to conduct a study. Different researchers plan data analysis and analyze studies in different ways. It also shows that with my limited time I was not able to catch things that you as peers were able to observe comparing your different approaches.

So, excellent! This is what open-science is about.

If I understand and remember correctly, I think you all agree that there are the following conditions:

  1. Separate A
  2. Separate B
  3. Joint evaluations of A and B

I recall Hsee’s point being that separate evaluations show the bias whereas joint evaluations do not. So, analyzing separate and then analyzing joint.

  1. Compare A to B in separate. Between subject independent samples t-test.
  2. Compare A to B in joint. Within-subject dependent samples t-test.

So, there are three conditions, and two main contrasts, and you’re right that some students indicated the IV as the comparison of the tests rather than two comparisons between A and B. In a way, there are two IVs:

  1. Comparing A to B
  2. Method, between (separate) or within (joint)

It is generally possible to have one test to contrast the two comparison effects #1 and #2, but it’s complicated because of the different experimental methods and I think even Hsee didn’t do that. He simple indicated the reversed effect-size.

In any case, what you did seems closer to what Hsee did and intended. I would suggest add all I explain above to your pre-registrations and building on that.


In the experiment, there are phase 1 and phase 2, where phase 1 is the matching procedure, and participants will answer questions in phase 2 based on their answers in phase 1.

I'm confused about the experimental design:

1. whether phase1 and phase2 share the same IV, thus resulting in a 3×1 design

IV: riskiness of options [phase1] DV1: amount to be won in gamble S [phase2] DV2: choice of gamble [phase2] DV3: strength of preference for the chosen option

2. As phase 2 is answered based on phase 1, should phase 1 be included as one of the DVs? Or should I just include the main DVs from phase 2?

3. Concerning the result under “choices”, may I ask what is the meaning of ” x-square “, as in which type of statistical test is it?


To be clear, the manipulation (IV) does not affect Phase 1, it only affects Phase 2.

The analysis they did in “matching” indeed confirms that the random assignment to a condition had no effect. One other interesting info is to see if there are differences in attractiveness of overall choice R versus choice S. Phase 1 is only intended to make the two options equally attractive. Once you get the number that makes both similarly attraction, then you proceed to a between subject assignment into one of the three conditions.

X2 is chi-square. So, comparing the proportions of the three conditions (you can get the data from figure 1).

The last part is a oneway ANOVA of three conditions with t-test contrasts of each pair of the three (1-2, 1-3, 2-3).


Then should I skip the analysis of results of phase 1 and move on to phase 2?


You should report and analyze everything in your target paper but focus your power analyses for sample size calculations on the effects from Phase 2.


For the results of “Matching”:

1. F-statistic is reported as ”<0.3“ in the original study, should I take it as ”=0.3“ in calculating effect size?

2. For “attractiveness”, I assume that it is a paired sample t-test. But correlation was not reported, how can I compute effect size on psychometrica?

For the results of “Choices”:

1. The tools given in dropbox only allow me to calculate the effect size of chi-squared test with df=1. However, one of them has df=2, and I cannot find any online calculator to solve the problem. May I know how can I calculate the effect size with df=2?


In response to your questions:

  1. Yes, that’s correct.
  2. As I noted, the power analyses should be on the effect size from Phase 2. But since you asked, and that is a generally very good question for paired t-test, when we don’t know the correlation we use proxies and use “best estimates”. A typical estimate when correlation isn’t provided is 0.5.
  3. Two things I’m a bit confused about from your question, so here is a bit of information, but perhaps I didn’t understand your question well.
    1. First, why do you need an online calculator? We do power calculations with Gpower, and that’s all in the guide.
    2. Please see the guide on how to conduct chi-square effect size calculations with Gpower. Let me know if something is missing.
      1. Example: Videos for conducting G*Power power analyses for different statistical tests:
    3. Second, Gpower uses w for these, but generally note that chisquare is already a form of an effect-size.


In the instructions of experiment 1 in the original article (Zeelenberg et al., 1996), the participants were asked to choose either Gamble A or Gamble B, and learned about possible feedback on foregone outcomes depending on their conditions they were in. For example, in Risky ONLY, the instructions in the questionnaire stated that the participants will always learn of the outcome of Gamble A and the outcome of their chosen Gamble (either A or B). However, the original article did not mention if the participants will actually learn the outcome of the gambles at the end of the questionnaire.

My question is: Do I need to inform the participants about the outcome of their chosen gambles at the end of my Qualtrics survey (whether the outcomes of Gamble A and/ Gamble B will be revealed depends on the condition they will be in and the choice they made)? Adding the outcomes of the chosen gamble simply fulfills the promise of doing so in the instructions part of the questionnaire, and should have no effect on the research results (since it will be added at the end of the questionnaire).


That’s a really good point, I had not considered that in the design before. I really appreciate you noting this detail.

Yes, it would be fair to provide them with such feedback but this is pretty advanced Qualtrics stuff, going beyond the call of duty in such a replication.

Let me think how to make it easiest for you. Qualtrics has a random number generator: see section “Random Number Generator”. What you could do, depending on which condition they’re assigned to is display one or both of the following:

  1. Risky bet result: ${rand:int/1:100} – if number is 1-35 then it means you would have received 130USD for your choice, if the number is 36-100 then it means you would have received nothing. - Safe bet result: ${rand:int/1:100} – if number is 1-35 then it means you would have received nothing, if number is 36-100 then it means you would have received… (ENTER PIPED SELECTION FROM BEFORE)

If you want to further complicate your life, you could also save that random number as an embedded variable and pipe that instead. (also explained in that link)

A simpler option, which would also be acceptable yet less fair, would be to simply note at the end that the purpose of the experiment was to determine how receiving feedback would affect choices using thought experiments, yet no feedback is given.


On the replication note posted, it was mentioned “combine two demonstrations into one experiment”. I assume the two demonstrations to be the two scenarios highlighted in the article. However, may I ask why are the non highlighted demonstration not necessary? (One of the demonstration was a between-subject version of the original highlighted within subject version, and another one is changing the amount of the gamble)

There are also other problems regarding the extra readings I did on disjunction effect. Below are the two readings I have consulted. I will attach their files in this email.

Kühberger, A., Komunska, D., & Perner, J. (2001). The disjunction effect: Does it exist for two-step gambles?. Organizational Behavior and Human Decision Processes, 85(2), 250-264. Lambdin, C., & Burdsal, C. (2007). The disjunction effect reexamined: Relevant methodological issues and the fallacy of unspecified percentage comparisons. Organizational Behavior and Human Decision Processes, 103(2), 268-276.

So for the 2001 study, they did four experiments in total. With the first one as a close replication (though changing it into a between subject design), and the remaining three as more like extensions. (the last experiment was a within subject design)

The article published in 2007, on the other hand, criticized the 2001 replication to have used between subjects design in 3/4 of their experiments, claiming that for the definition of disjunction effect. It has to be “within-subject” in nature.

After reading both articles, I was confused in a certain ways. Although seeing the grounds in the 2007 article, I have doubts over its claims. First of all, I think the argument largely holds onto a ground that “the definition of disjunction effect is within subject in nature”, which I am not sure if Shafir and Tversky would have agreed when they coined the term (as in 1992 they obviously did both between and within subjects design). Secondly, I had hard time understanding its assumption “the violation of STP is by chance” (please correct me if i misunderstood the article). Thirdly, I feel the discussion hardly convincing, as the sample size seem very small (though I know nothing about statistics) (e.g. “Of the total 55, 14 subjects met criteria A and B. Of these 14 subjects, 8 did not violate STP and 6 did.”). Along with the limitations mentioned in the article, it does not quite explain the results found in 2001’s one within-subject experiment results.

Other than the questions I found in the two articles and going back to the 1992 replication. There are more questions. I think the central one so far would be, if the conceptual definition has evolved (or simply not well elaborated back in times), would doing a close replication on the original article be meaningful enough? (assuming we used the definition claimed in 2007 article). And while there are other studies trying to replicate with extensions, is replicating the earliest publication enough to address the many newly arisen issues (brought up by the extensions).

This is a very long email. Thank you so much. I’m sorry that the attached files have already been highlighted I hope they do not bother you.


First of all, it’s terrific that you read up on following literature, that’s admirable. This line of thinking and questions shows depth and understanding. In PSYC3052 we discuss all kinds of things regarding the replication crisis, and why replications and pre-registrations are needed, and the importance of well-powered samples and open transparent sharing of all data. Your/our replication, as far as I know, stands our in that regard, as we aim to do a close replication of classic findings to try and re-test the effects. What followed after that is interesting, and relevant for your final writeup as you discuss your findings, but doesn’t directly impact the importance of what you’re doing.

Regarding the question of between and within… Since you raised this converse, if you’d like, and you think this is interesting, what you could do as an extension is compare the between-design to the within-design. You’ll randomly assign participants to either the conditions of the between or a condition of the within (random order assignment). You’ll then be able to compare the two effects and see if there’s a difference. Your findings could, ideally, help resolve the mystery of whether this works in a within, between, or both 😊 That would be a terrific contribution to the literature.

Regardless if you’re interested in this kind of extension or not, if you’re asking what design to follow, I ask that you try and follow what the original article did as closely as possible. When I go over the Qualtrics, I’ll try and assess if we need any adjustments.

Also, to clarify, when I said combine into one experiment what I meant was that each participants will do all problems, random order. For a between-design, make sure that in both conditions they’re assigned to the same condition. If within, simply randomize the order of the entire experiments.

As to why not what I didn’t highlight, I wanted to try and keep things simple and doable. Doing the last part combined with the first two would have required a design like the one I outlined above for the extension, and I didn’t want to demand this kind of advanced setup from an undergraduate class. If you want to do that, that would be very valuable.

It’s not as easy as I made it out to be. But to simplify things - you can generally analyze them separately and then compare the effect-size (Cohen’s d) between the two designs. They are, kind of, in a way, comparable. No need to go into the fine details, it’s already above and beyond an undergrad class.


I had been working on the article analysis for Tversky and Shafir (1992) Disjunction effect and handed in the assignment yesterday. However, after I had handed in the assignment, it came to me that chi square test might not be applicable to repeated measures. As the second experiment in the paper (choice under risk) experiment is within-subject, google recommended the Cochran’s Q test and Mcnemar Test. However, as i try to redo the parts I could not find any online statistical tools the supports the calculation of the tool. Meanwhile, SPSS requires the entry of raw instead of summary data…

For chi square tests I calculated the Cramer’s V and using the formula w = V x square root (k-1) to find the effect size w, and put the effect size w to g power for power analysis.

However for the Cochran’s Q and McNemar test, google recommended using Fisher’s exact test for effect size calculation. Yet, I have a hard time finding the right tool for power analysis with fisher’s figures.

May I know if there are any solutions to the problems? What test should I be using ?


First off, just want to say, great. I love seeing students take initiative to try and figure out effect sizes and understand articles. You might think it’s not a big deal, but it is. You’d be surprised how few scholars know how to do that well, and I was surprised to see you calculating w by hand and then go in depth to the differences between chi-square and McNemar, between and within.

It would have helped if you shared a little bit which part of the study you thought was between and now within and why you decided to switch between the two designs.

If we take the first experiment, if I recall correctly, it was sort of a between-design (unfortunately, not randomly assigned, as they should have done, something you might want to discuss in your article analysis/pre-registration/final-report), since each of the conditions was assigned to a different set of participants.

Perhaps you were referring to the second “choice under risk” experiment, and that does seem to be a (strange) within design. You, or one of the other students reminded me that later in that article they also did a between subjects design for that same experiment, which you could analyze in the same way you the first, but you’re right that this isn’t the section I originally highlighted.

Per your question:


or for the odds ratio


You can run these online using a tool like:

About how you hand calculated w, any reason why you didn’t use the built in Gpower calculator with “Determine”?

Regardless, whichever method you found works best for you, it would be great if you’ll consider sharing what you did with others by adding that as a suggestion to the replication guide? I would love to have students contribute more about the things they figured out doing but isn’t in there.


I have some questions regarding the statistical analysis for my replication study. I realized I’ve made many mistakes in the previous article analysis, and I’m trying to make things right. My assigned article is: Tversky & Shafir’s (1992) Disjunction Effect in Choice Under Uncertainty. I’ve been struggling to solve it on my own but I really need your feedback at this point.

There is no statistical information on the article, and only states the proportion of each condition. I’ve been advised to use Chi-square test for effect size calculation, and I used R-studio for the calculation.

Experiment 1 Effect size calculation:

I used the effect size from the R-studio result on G*power application to calculate the required sample size, but the output gives: Required sample size: 5

Analysis: A priori: Compute required sample size Input: Effect size w = 2.022 α err prob = 0.05 Power (1-β err prob) = 0.95 Df = 4 Output: Noncentrality parameter λ = 20.4424200 Critical χ² = 9.4877290 Total sample size = 5 Actual power = 0.9675599

This seems very wrong and I couldn’t find a way to solve this problem. The same issue occurs when I plug in the effect size I calculated the same way for experiment 2. Maybe it was wrong to use chi-square in the beginning? I’m sure calculation of effect size itself was correct because I was advised by Boley for the calculation.

Also, I saw on your wiki answer to a student’s question about the effect size of Experiment 2. Experiment 1 is between-subjects (Pass: 67, Fail: 67, Disjunctive: 66) and Experiment 2 is within-subjects (98 participants). I’ve tried the McNemar test as you suggested, but failed. It seemed that McNemar test only works for a 2×2 contingency table. However, as you can see, Experiment 2 - Choice Under Risk has a contingency table of 2×3. I would really appreciate it if you could guide me through these issues.


Let me try and answer what I can based on what you shared:

  1. Experiment 1: the most important point here is the shift regarding the last option: Pay 5$ to retain rights, right? That’s the whole point of the article. Therefore, it’s enough to compare 61 to 30 and 31, and you can basically do your effect size calculations only with these.
  2. To verify your analyses it’s always best to try and use another calculator. For example, you can use one of many chisquared online calculators (like this one - , just be sure to convert the percentages to counts) and then convert the chisquared to Cohne’s omega w used by Gpower using R commands like cohens_w ( You might be able to see where you went wrong in there.
  3. There are all kinds of ways to do that, but I want to keep it simple for you. How about first compare loss to disjunctive, then won to disjunctive? Report both, should be very similar results. Do your estimations based on that, and with the required sample size for each of those, simply add half a sample for three conditions (multiple by 1.5). That should be a good enough approximation for now.

These are excellent questions, and really show me you’re thinking about things. I do appreciate you putting in this effort, and challenging Boley and I to also do better in understanding and tackling these issues.

Final communications by Fili given a few more emails:

(This email is meant to make it simpler for both of you replicating Tversky & Shafir, please see decision at the end of the email)

First off, I want to say I really appreciate both of you putting the effort into this.

From all the emails I can see you spent considerable time on this, and went above and beyond the call of duty for what is expected of you in this course. Regardless of the outcome, you should be commended for caring and trying, hopefully you gained something from the process as well. This is why it’s very important you communicate your difficulties to me, otherwise I wouldn’t know how much you’re struggling with these.

I should also admit, in case you think you might not know these because of lacking HKU training, that these are also very difficult for me and Boley, and I imagine they are to other researchers as well. This is not straightforward statistics. Many professors and experienced scholars would need to look these things up and face the challenges you’re facing.

I started off writing email explanations for all the issues you two have been mailing about, but finally decided to make it far simpler for you, you’ve struggled enough with these:

Please write clearly in your report that we were unable to perform an exact effect-size calculation for this study (given that this is an undergraduate project) and instead will follow the rule of thumb of 2.5 times the original sample size as suggested by Simonsohn (2015) in the article small telescopes.

I also have a request, for both of you: Please document your thought process and these emails at the end of your supplementary when you submit. I want to remember us communicating this and the effort you put in so that I and Boley can take that into account when grading. Please point to the supplementary materials end to indicate what you did try in the process, for me to remember and for future reference so that others understand the difficulties.

That should be good enough at this point of time.

Apologies for this taking you this long on this, it is appreciated.


This is the question I have: Is there a way to perform a Two-way ANOVA analysis with only the mean, SD and sample size of the conditions? In the article, the authors did not perform a specific analysis on the highlighted 4 conditions, but instead did an overall analysis of results. Bill suggested me to perform a Two-way ANOVA analysis on the conditions for the data analysis part of the assignment. When I learned how to perform a Two-way ANOVA during statistics course, the raw data of all conditions is required for the calculation of the F stat. However, the authors only provided the means, SD & sample size for the 4 conditions.


Not needed, but since you asked - You could calculate those by hand, but I don’t know of an easy online calculator you could use for that. Let me know if you/Bill find one.

The good news is that this isn’t really needed here, and this is more stats than I expect you to do for this class. I honestly doubt more experienced students/scholars would tackle that well.

What we’re focusing on here is the t-test between the positive and negative, and that’s all that’s really important. There are plenty of calculators for that with N, M, and SD. Once you calculated the effect size (cohen’s d) you can compare the effect size between the physician and the patient, and use the smallest effect to do your power analyses, to ensure we have a large enough sample to detect both.


I have confusion while doing my article analysis. My assigned article is 'Outcome bias in decision evaluation'. I am responsible to replicate case 1 to 4 in experiment 1. However, the result shows in the article regards to overall 15 cases. So I am not sure how to get the F statistic, effect size and power analysis of the first 4 cases. Thanks and sorry for asking so late.


The general guidelines are – do the best you can give the information/stats you have, and explain in depth everything you decided and did. You have several choices, and which ever you’ll choose you’ll need to explain and discuss in detail.

First, you can take the stats provided (M,SD) and calculate based on that. In a 2×2 you have a few possible contrasts. If you’re missing some stats (like correlations) you can make approximations (like 0.5), see the WIKI for more.

Another option is to use the estimation from the stats on the 15 cases. The original design was a within-subject design, where all participants did all 15 scenarios. So, broadly speaking, the effect size should be comparable whether it’s 4 or 15. Just explain that. You focusing on the first 4 is to simplify design for re-rerunning the study, it doesn’t have to affect your effect-size calculations much.

Hope that’s clear and makes sense.

I was asked about randomization and so added these instructions:


  1. Display the 7 questions in the same page.
  2. Randomize the order of display for the 7 questions. Ignore instructions for display order.
  3. Ignore instructions on page display. They didn’t have online surveying software back then and we’re not printing these.
  4. Switch from a within-subject design to a between-subject design. The MTurk participants will kill us and drop out if we do a within-subject design. We’ll have to adjust and discuss the implications in the final report. So each participant will only see one of the conditions, randomly assigned, evenly presented (the checkbox in Qualtrics randomizer). If you or the other replicators don’t do this, I’ll change it, it’s an easy fix.


I would like to confirm on how should I deal with the 'missing value' from participants who give inconsistent response (e.g. should I simply put this as an exclusion criteria as the authors did not specify how to deal with these missing values?). This part is on p.47 (first paragraph under 'Results and Discussion' in the original article).


Yes, the meaning of NA is very close to that of an exclusion criteria in the pre-registration, so that works, please do that. The way statistical programs deal with NA is that they are not included in the statistical tests.


My replication article involves a news article that has to be used (

as the “sample case” for participants in the original experiment. Yet, I can only read the summary of it instead of the whole article, otherwise I have to pay and subscribe New Yorker for the experiment. What am I supposed to do?


That’s the tricky business with old articles. We need to do the best we can given what we have.

The summary is really all you need, try your best to reconstruct the scenario given the summary. MTurk participants have very limited attention span, so you’ll need to use that summary.

I went ahead and paid the 12USD to subscribe to the New Yorker and get the full story.

You can see it in the Dropbox

Dropbox\2018-HKU-Replications\2018-9 replication articles\Stimuli\Hamill et al 1980

You’ll see that it’s -very- long, so it’s just for you to read this in case you want to know more.

Focus on the summary, adjust it so that it makes sense in the experiment.


I am assigned to the Hamill, 1980 paper (Insensitivity to Sample Bias: Generalising From Atypical Cases) and wanted to ask about the dependent measures for the experiment to be replicated. Only 2 of the 7 dependent measure questions asked to subjects are provided in the paper and my partner and I both cannot seem to find the full dependent measure questionnaire used in the study anywhere online either.

Hence, I wanted to ask whether I should use only 2 questions, make up the rest of the questions myself for the replication or whether it is possible to find the actual dependent measure questionnaire provided to subjects in the study.


First off, it’s important to clarify a common misunderstanding about this project that I’ll talk about in the next class – there are no partners in these projects, these are independent projects. Here’s from the syllabus:

Students will conduct pre-registered replication and extension of classic findings in judgment and decision-making. Students will be randomly assigned an experiment in a classic article and will follow a structured procedure to attempt a replication with a simple extension.

Each classic article will be the target replication article for two students, who will work independently on the same article without any information-sharing or collaboration. This method will be used to educate students about different perspectives on conducting replication and analysis of the same article, and the two students will peer review one another's work, for both the pre-registration (with analysis and Qualtrics survey), and the final report, and will use the process to improve on their own work. The idea is not to have identical outputs, but for each of the students to do the best they can on their own and then compare their own approach to that by the other student.

I understand if there was some confusion over this for this stage, but just to clarify this from now on.

About your question:

These oldies are difficult to replicate because they don’t provide all the details, that’s part of the challenge. I don’t think you’ll find those details anywhere, you need to do the best you can with what you have to try and reconstruct that. One of the points of you and the other student replicating this working independently is that we could increase the chances of getting this right. Please try to come up with 5 other similar items regarding attitudes towards welfare, and you’ll get feedback from me, your TA, and eventually your peer in the peer review. Do the best you can given what you have in the article.

I will try to contact Prof. Wilson and Nisbett, who are still active researchers, but I’m not sure whether we’ll hear back.

UPDATE: Please see downloaded newyorker article I placed in the Dropbox: Dropbox\2018-HKU-Replications\2018-9 replication articles\Stimuli\Hamill et al 1980\Article

UPDATE: Please see reply from Timothy Wilson in the Dropbox: Dropbox\2018-HKU-Replications\2018-9 replication articles\Stimuli\Hamill et al 1980


As you may have discovered, the information provided in the original article is imcomplete (probably one of the reasons why we have to do a replication) that I am confused about some of the reporting methods in the analysis plan.

1) judgment criteria: Hamill et al. (1980) stated in p. 581 under the heading of “Effects of Exposure to the Sample”, it said, …The manipulation of order of presentation of sampling information (before or after reading the article) had only trivial and nonsignificant effects on attitudes toward the welfare recipients for both typical and atypical sample conditions. Similarly, the two control groups differed only trivially, so results for the two groups were combine.

I am not sure what it means by differ only trivially, as it was not mentioned in the original article. Yet I think a good science practice should disclose as much meaningful information as possible. Therefore, I would like to seek your advice regarding the judgment criteria in my article.


Regarding your question. It just means that order of presentation didn’t seem to have much effect, and when this happens and two conditions have similar “meaning”, like control, they are sometimes combined in the analysis.

I agree, these shortcuts are not helpful for high-quality cumulative science, but that’s how things were back then. Aim at the very least to do what they did, and then if you can try to improve. For example, it would be helpful if you run the analyses and report the effects for the order and the differences between the control conditions regardless of whether they were significant or not.


in my article it says they need there are seven 5-7 point scales assessing their attitudes toward welfare recipient but only two were listed as examples in the article. In the General Guideline, you have mentioned not to reinvent the scales. Should I just leave it to two questions in the Qualtrics survey?


We do what we can with what we have. We don’t have that stimuli, and anything we’ll invent might just add noise to the already very sensitive replication attempt. Only use what is provided, please do not add/reinvent anything. However, please do note that as a deviation from the original, and explain why we couldn’t do seven.


In the article there are 4 problems/scenarios with different number of participant groups in each. I was confused if I need to make just 1 qualtrics survey or 4 different ones? If it is 1 then how do I integrate the different number of groups?


Please do all 4 in one Qualtrics, and randomize the order of the scenarios. The number of participants is not important for our replication, we’ll randomly assign participants to one of the conditions, evenly (checkbox in the randomizer).

So, random assignment to conditions in each experiment, and then random display order of the four experiments in the same Qualtrics survey.

Boley added:

→ 1 qualtrics survey with problem 1-4 in randomised order. In terms of conditions, you will only have to randomise the three conditions for problem 4: a+b, c+d and e+f

→ Simply add the neutral condition (e+f) to problem 4. See the above for the three conditions.


1. On page 347, paragraph 2 line 2, it writes” We assume that the person is aware of inflation, and monetarily ignore other factors such as the posible social significance of a salary raise.” I don't understand why can the other considerations of the participants be ignored. Those considerations should be confounding variables that should be calibrated. But they are very difficult to do so, as those considerations may be subconscious. People may have thought about those considerations before in their own experiences, and then when they answer the question, the thoughts about other factors are like a feeling, but not explicitly thought about, but they do affect the result. Why can the considerations be ignored?

2. Also, for problem 1 (page 351), it is not explicitly stated in the question that all prices of all goods and services will rise at that percentage, so the participants may not have a standard view on the changes of price of the goods and services. For example from experience some good does not change price that fast, following the inflation, then, the participant think they have a period of time to purchase the items before the raise in pricr of them. How can the authors ensure that the participants see the prices increase as a very simplified, immediate, and uniform ecomoical model instead of the phenomenon they experience in their daily lives?

3. Again in problem 1, page 352, the hapiness and job attractiveness questions are framed as a third person angle. “Who do you think was happier” requires the guessing of another person's feeling, “who do you think was more likely to leave her present position” is also guessing other's decision. While the economic terms question is a subjective evaulation without involving guessing other's thoughts, which the thinking process is inconsistant to the happiness and job attractiveness questions. Framing as third person's angle in happiness and job attractiveness questions can only measure whether people evaluate other's money concept is nominal or real terms, but not whether the participants themselves' view on money. We cannot conclude that there is a bias reflected by the study. It is because it is possible that lots of participants are aware of nominal and real terms, and they don't have any money illusion at all, but they think they are more clever than others(than Ann and Barbara), so they perceive Ann and Barbara as having money illusion. The study is not valid in this way, why didn't the auhors use first person angle to make sure the answers reflect the participant's own judgement instead of guessing other's judment or feelings?


  1. As you’ve seen in class, economists generally make all kinds of assumptions regarding the way that humans behave. We simplify the situation and humans in order to be able to illustrate an effect, or in this case, provide an example. Different scholars may emphasize different aspects of the situation. Economists, and this is an econ journal (The Quarterly Journal of Economics), tend to focus on financial factors that are fairly public, open, and clear. I wouldn’t worry too much about this, and I don’t think this is an important point for your replication.
  2. I’m not sure that I understand this question well. The inflation are clearly stated, as are the factors that the experimenters would like participants to compare. For economists, there is a very clear simple solution to this problem. It’s math. They would assume that monetary incentives drive happiness and job attractiveness, but it seems as though they don’t go in the same direction. To economists, this is a deviation from rationality.
  3. Need to be careful with criticism of experimental design and about what’s valid or not. This is a common experimental design. There are issues with asking people what they would do, and there are issues with asking people what anonymous (average person) others would do. Just like you pointed up in the first two questions, if you ask them about themselves, there are all kinds of unexpected issues that might come up, but if you ask people about the average person, it might reflect an evaluation going beyond the moment. This was the first demonstration, and others cab follow on that, and one of the things that can be examined is self-other differences. Here they make an argument that what people assess reflects their own attitudes.


In problem 1, 3 different groups of participants were asked to choose between 2 options about economic terms, happiness and attractiveness respectively, I am wondering if I should use chi-square goodness of fit test to compare the response to an even split of 50% for each of the question. And if I need to do 3 chi-square goodness of fit tests for problem 1, how should I report them in the article analysis?

In problem 2, participants were asked to rank their options in term of 1st, 2nd and 3rd, and percentage of participants choosing each option is reported. I am not sure which kind of test should I use to evaluate whether the difference in their response is significant. Should I only focus on response ranking the 1st order and comparing them to a chi-square goodness of fit tests with an even split of 33% for each option?


Great questions, thanks for asking these! It also forces me to rethink some of these.

For the first question:

  1. There are two types of analyses that you can do with these stats. Because the authors did not include statistical analyses, I have no idea why, then it’s up to us to make the most of what they did provide.
  2. Each of the questions can be compared to a 50-50 split to show that it deviates from a random chance split. You can use chi-square or binomial z for that.
  3. The more important analysis is to compare the percentages between happiness versus economic terms, and happiness versus job attractiveness. This is to show that happiness is different than both economics term and job evaluations. Since these are different subjects, this is a chi-square.
    1. BTW: If these were the same subjects, then this would be a within-subject chi-square, something we call a McNemar test. Looking at this, I think that in our replication we can increase power by runing all questions on all subjects (random order). This would be a deviation from the original, but a deviation that makes sense in a single data collection.

For the second question: Here, again, you can do both analyses, to compare Adam, Ben, and Carl to a 1/3-1/3-1/3 split, but you’ll need to do that for both the first and the last (3rd) rank, because both are revealing. In addition, you could also do a chisquare of 3×3 for all of the data together, and that will capture all of the data together and be a more accurate estimate.


In the first image between contracts C and D, the description says that more people chose C but the statistics mentioned in parentheses shows the opposite. Similarly in the second picture, it says more people chose E but stats are opposite. What is the correct thing?


Good that’s you’re keeping track and checking things and that you’re asking to make sure when something doesn’t quite make sense to you. Academic reports can get a bit confusing. This just goes to show you that only reading through without checking the statistics in depth can lead you to understanding the article wrong, doesn’t it?

Per what you asked.

Contracts C and D are equivalent to contracts A and B, respec- tively, except that they are framed in terms of nominal rather than real values. Contract C, in contrast to A, is framed as (nomi- nally) riskless; Contract D, in contrast to B, now appears risky: depending on inflation you may be paid more or less than the fixed nominal price. Thus, the first decision was between a guar- anteed real price (contract B) and a nominal price that could be larger or smaller than the real (contract A), whereas the second decision is between a guaranteed nominal price (contract C) and a real price that could be larger or smaller than the nominal (con- tract D). As expected, subjects are influenced by the frame pre- sented in each problem, and tend to exhibit the risk-averse attitudes triggered by that frame: a larger proportion of subjects now prefer contract C, the seemingly riskless nominal contract, than previously preferred the equivalent contract A (X2 = 5.34, p = .02). The disposition to evaluate options in the frame in which

There is an ambiguous “equivalent contract” phrase here, so there are two ways to understand the English in that paragraph. One, is that C is preferred to D, which is obviously not the case. Since C is 41% and D is 59%. The second, which I believe is what they mean here, is that the proportion of choosing the first option (A or C) over the second option (B or D), and since A was 19% and C is 41%, that is a meaningful shift in preferences.


I am working on the article about money illusion by Shafir etal. I have some problems about the preregistration.

  1. Is it ok if I report effect size w/phi directly instead of Cohen's d, I have been doing some research online but i can't seem to find out the difference between the 2 kinds of effect sizes.
  2. In problem 3, there was most subject who chose “same”, which indicated they are thinking in term of real value, does that mean money illusion is not present there?
  3. I still have some trouble understanding chi-square test for problem 3, as in the original paper, the author used a 2×3 chi-square to compare the response for buying and selling. If we want to show that participants are more reluctant to buy and more willing to sell during inflation, isn't a chi-square response comparing buying/ selling response to an even spilt already sufficient to illusion the phenomenon. I have trouble understanding why the author used 2×3 chi-square here.
  4. I have read from other replication that, sometimes people will change the year to fit the present time, for example the case presented in original paper is presented in 1993, I am wondering if it's a good idea to change the year to 2018? Also for the amount of salary to fit the present time standard.


Thanks for following up. These are very important, you have helped me realize some of the challenges in this article I failed to notice myself. This goes to show, you are now a bigger expert on this article than me, possibly anyone else. Good job.

Let me try and answer…

  1. Yes, that’s actually what’s preferable, effect size w is an effect size and sufficient here. I noticed a lot of students coverted everything to Cohen’s d. Did I put some instruction about converting to d or indicated this in lectures/tutorials somewhere? Let me know where and I’ll update that.
  2. Thanks for asking me, this forced me to revisit that scenario, and I admit that following your questions I now realized new things. First off, I agree, it’s really unclear what they did here, and it’s very confusing what test they ran.
    1. Second, rereading the scenario I realize that the question was setup so that the rate of inflation and salary increase 25% was exactly the increase in price of the armchair. Therefor, it is expected there would be no change in preference for either selling or buying. And yet, 38% indicated less likely to buy and 43% indicated more likely to sell. The question here, what split were they expecting? It can’t be a 1/3, 1/3, 1/3 because that’s answering in random, and they were expecting everyone not to change. They might have been comparing more to less.
    2. In any case, I’ll make it simple for us, please indicate the following clearly in the pre-reg: We were unable to determine the exact test conducted and the effect-size calculations. Following the instructors decision, for the power calculations we will aim for atleast the same sample size as in the original experiment (N = 362)
  3. See #2. I agree, it’s very confusing. I’m afraid I don’t have a good answer.
  4. Great question, really. I suggest you add this prior to the scenario: This scenario is a replication of a study from the 1990s, and therefore includes factors that relate to that time. Please imagine yourself living at that time and answer according to the information indicated in the scenario best you can.


regarding Problem 2 where I need to use the Goodness-of-fit test to compute effect size w for rank 1st and rank 3rd.

In the original study, the percentages of rank 1st are 37%(Adam), 17%(Ben), 48%(Carl), whose sum exceed 100%. The sum of percentages for rank 3rd is less than 100%. However, G*Power requires the sum of the percentages to be exact 100%. Could you please tell how I should do with these percentages?


Yeah, these things are tricky. The issue here is that the percentages they’re using are not the percentages you need to plugin. They’re using percentages for 1st, 2nd and 3rd, whereas you need percentages for Adam Ben Carl, summing up to 100%. The way to do it, is to convert these to counts (round if necessary), then run the chisquare with those counts.

Code you can use to calculate with counts:

they mentioned “The virgin rat study was presented to one set of foresight and hindsight groups. The other three studies were presented together to a second set of foresight and hindsight groups.” Does this mean that the virgin rat outcome A and B are all included in both foresight and hindsight group or what?

My answer:

Yes, these oldies are a bit confusing sometimes. I also struggle with these. Part of the challenge is trying to figure these out and doing that well. It means there were two groups of participants, the first group only saw the rat scenario/study, and the other group saw the other 3. For the purpose of our replication, please randomly assign participants to one of the conditions. Conditions: 1 - foresight (positive+negative or negative+positive, order randomized), 2 - hindsight positive, or 3 - hindsight negative. In each of these conditions please present all four scenarios/studies to them in their assigned condition.

Please let me know if that’s different from your understanding of the design, or you have any other insights/comments.

UPDATE: Please see reply/stimuli received from Fishchhoff in the Dropbox Dropbox\2018-HKU-Replications\2018-9 replication articles\Stimuli\Fischhoff


I was doing the article analysis of Slovic and Fishhoff (1977) and I realised that there is no statistics (except the mean) for me to analyse. Is there a way to analyse the data by asking the original researchers?


To my great surprise Fishhoff did reply yesterday, and you could see what he sent in in the Dropbox: Dropbox\2018-HKU-Replications\2018-9 replication articles\Stimuli\Fischhoff

About Slovic and Fishhoff (1977), you might want to follow up with Boley, I discussed this with her when she voiced a similar concern. What I said was – power analyses are to determine the minimal sample size required to detect the effect at 95% power. To make sure we do that we aim for the most conservative test. Meaning, for the smallest effect size that would require the largest sample size, to make sure we get really capture the phenomenon if it’s there. Therefore, take the most conservative estimate with the highest p-value written it the table and sample size that results in the smallest effect, and calculate power based on that. There are tools like MAVIS that allow you to calculate an effect size from simply having a p-value and the number of participants: (Effect size→p value to effect size).


Should we use the statistics from the original study (Slovic & Fischhoff 1977) or the replication study (Davis & Fischhoff, 2013) to calculate the effect size in our article analysis?


My request has to do with the original article, so if you want to meet the basic requirements just follow the original.

If you want to go beyond that, great, thanks for that. We generally tend to go for the more conservative estimate (smallest effect size, highest sample size) to ensure that we’re well powered to detect even the smaller effect. So you could simply write both effects, indicate which of the two showed the smallest effect, and do your power calculations based on that.


About #1, my classmate talked to Boley and she confirmed her/us that we should be using independent sample t-test. Anyway, the following is the paragraph where my quotation came from, it is on p. 263 of Davis & Fischhoff's (2013) article.

I actually cannot understand why they used “repeated measures” and what “first factor” and “last factor” mean. By my understanding, the first factor would mean foresight vs hindsight, which should refer to different participants so I don't know why it is repeated measures. For the last factor, I guess it means outcome A or B which both probabilities are given in the foresight condition, but I didn't know there is a comparison between these probabilities of the two outcomes. So if there is such a comparison, it would make sense to use repeated measures but I just don't understand why they needed one.

About #2, the smallest effect size I found was 0.39 using MAVIS “p-value to effect size” function. Since I don't have the SD, p-value, and the group size for experiment 1 is 24-37, I inputted the largest group sizes of 37, p-value of 0.05 (from “All show maternal behavior” that row as it is the row with the largest p-value) one-tailed test for the most conservative results I could find with different combinations.

However, this most conservative estimate of effect size is still very large compared to what Davis and Fischhoff found in their replication (this is from p. 264):

So I wonder if I did anything wrong in the calculation, or should we use the effect size in the replication instead? Because the effect sizes in the original article seem way too large… (the largest one, as the p-value is <.001, is found to be around 0.8!)


Good, I understand things better now. Thanks for doing the copy paste/screen-shot. Let me share with you how I understand it. It could be that I’m wrong, so try and check that together with me, and let me know if that doesn’t make sense.

There are three factors, by order:

  1. Study/case: rat, hurricane, duck, test. Repeated measures/within-subject. All subjects do all 4 studies, random order.
  2. Time: foresight (before) - hindsight (after) outcome. Between-subject. Subjects are either before OR after.
  3. Outcome: Outcome A or B. The tricky bit is that this depends on Factor #2 time:
    1. Foresight (before): Repeated measures/within-subject, all subjects rate chances for both of the 2 outcomes, random order.
      1. “For each scenario, foresight participants first judged the probability of two outcomes, such as A the rat exhibited maternal behavior and B the rat did not exhibit maternal behavior. They then judged the probability that each outcome would be replicated on all, some, or none of 10 additional observations, were it the initial observation.”
    2. Hindsight (after): Between-subject, subjects see either outcome A or outcome B.
      1. “Participants in the hindsight condition were told that a specific outcome had occurred (either A or B); then, they assessed its probability of being replicated in all, some, or none of 10 additional observations.”

Now, if you understood that, go back an re-read that confusing paragraph and it will become clearer what they’re trying to say. They just didn’t explain this very well.

Another way to put it, maybe easier, is that you have three conditions for each of the 4 studies:

  1. Foresight, asking about both outcome A and outcome B in random order.
  2. Hindsight, outcome A. Asking probability of A replicated.
  3. Hindsight, outcome B. Asking probability of B replicated.

How to analyze this? For each of the studies:

  1. Foresight outcome A versus hindsight outcome A. Between-subject #1 subjects versus #2 subjects above.
  2. Foresight outcome B versus hindsight outcome B. Between-subject #1 subjects versus #3 subjects above.

Let’s go back to the Slovic & Fischhoff. If they had 184 participants, and 3 conditions, it’s about 184/3=61 participants per condition. “Group size” doesn’t matter here, it’s just how many they did in each session, ignore that.

If we look at the first case in the table, the above means that you’re comparing:

  1. Foresight “Shows maternal behavior” versus hindsight “Shows maternal behavior”. Between-subject #1 subjects versus #2 subjects above. Let’s say 61 versus 61.
  2. Foresight “Fails to show maternal behavior” versus hindsight “Fails to show maternal behavior”. Between-subject #1 subjects versus #3 subjects above. Let’s say 61 versus 61.

So, if we have a p value of .05 and a sample size of 122, then we can indirectly estimate the effect size as d = 0.36

And you can see it’s comparable to David & Fischhoff’s table.

Does that make sense?

Generally, given that table, I would suggest you explain all that above, include the table from David & Fischhoff’s and do your power calculations for a two-independent t-test with a Cohen’s d of 0.2, the smallest effect in there. Because this is a weak effect, it requires a very large sample size, and so what I would like to ask is that you do the power analysis aiming for 0.95 and for 0.8. I might not have sufficient funding for a sample size of 0.95 power.


Dear Fili and Boley,

Thanks Fili for writing back. Now I understand a lot better what the replicators mean and the design of Slovic and Fischhoff's (1977) experiment 1.

However I am still quite confused about the sample size. You mentioned that we can ignore the “group size” here,

but I do think that this “group size” should be taken into account in statistical analysis because at the bottom of Table 1 (p.547) where the results are displayed, it is said that “Note. Sample size varies from 24 to 41 subjects.”

Here the author used “sample size” instead of “group size” which includes the range of 24-37 mentioned in experiment 1. So I suspect that the authors used “group size” and “sample size” interchangeably and “group size” doesn't simply mean how many participants they have in each session but in each “groups” (or conditions or whatever they are called… cause I'm getting very confused about these names) that Boley suggested in an early email she sent me:

Which I think is based on this paragraph in Slovic & Fischhoff's (1977) article (p.546):

So I actually thought that the original studies have five “groups/conditions” but to simply our replication, we combined them and do three “groups/conditions” instead (Foresight, Hindsight A and Hindsight B).


I admit, this stuff is very confusing. And it’s great you got me to read this again so I can revise my understanding. I really appreciate that. That’s why it’s VERY important to have independent scholars, you two, Boley, and myself read this independently and try to get to the bottom of things.

First off, about sample size = group size. Can it be the same if on says 24-37 and the footnote says 24-41? 😊 and what do these mean anyway? (sidenote: I suggest make a note of this strange change in number in your supplementary, I think that indicates a mistake they did in their reporting. It’s part of what’s important in replication work, finding such oddities)

About Boley’s table, I’ll need more detail about that since it’s possible Boley noticed something I didn’t. But, regardless, my understanding from Table 1 is that it’s clear Rat had both Outcome A (Shows maternal behavior) and B (Fails to show maternal behavior), both for foresight and hindsight. All the values are in there, so clearly someone answered Hindsight Outcome A.

Now for that confusing paragraph: The virgin rat study was presented to one set of foresight and hindsight groups. The other three studies were presented together to a second set of foresight and hindsight groups. These hindsight subjects received either Outcome (A) of each of Studies 2, 3, and 4 or Outcome (B) for each.

To me that meant that they had 2 data collection runs, 3 groups each. 6 overall. 184/6=30 average, which is somewhere in between 24 and 37 or 41. More specifically:

  1. Data collection run #1: Rat – foresight versus hindsight A versus Hindsight B.
  2. Data collection run #2: Hurricane/Gosling/Y-test together (random order) – foresight versus hindsight A versus Hindsight B.

The meaning of that last sentence was just to make it clear, that if someone was assigned Outcome A in Hurricane, they were also Outcome A in Gosling/Y-test. If someone was assigned Outcome B in Hurricane, they were also Outcome B in Gosling/Y-test.

Now, thanks to you asking questions, I realize that 184/3 was too simplistic, because of these data collection “sets”. Actually, it’s more like 30 per condition, because they had two data collection sets.

Now, let’s say this is how it was, what do we do with that? How to do a power analysis based on that? Regardless of which scenario you look at, you have a sample size of about 30 per condition for each of these conditions:

  1. Foresight, asking about both outcome A and outcome B in random order.
  2. Hindsight, outcome A. Asking probability of A replicated.
  3. Hindsight, outcome B. Asking probability of B replicated.

How to analyze this? For each of the studies:

  1. Foresight outcome A versus hindsight outcome A. Between-subject #1 subjects versus #2 subjects above.
  2. Foresight outcome B versus hindsight outcome B. Between-subject #1 subjects versus #3 subjects above.

Let’s recalculate, 30 per group, with p < .05 is more like an effect size of d [ 95 %CI] = 0.52.

Here’s what I suggest. Include both of the analyses I suggested to you and indicate a possible effect size range of 0.36 to 0.52, calculate sample size needed based on power analysis of both, and say we aim for the weakest one 0.36. That would require us to aim for 336. It’s always best to aim for the most conservative test and larger sample size, and 0.36 is closer to the Davis effect sizes.


I was doing the article analysis and saw you updated the wiki saying that “Here’s what I suggest. Include both of the analyses I suggested to you and indicate a possible effect size range of 0.36 to 0.52, calculate sample size needed based on power analysis of both, and say we aim for the weakest one 0.36. That would require us to aim for 336. It’s always best to aim for the most conservative test and larger sample size, and 0.36 is closer to the Davis effect sizes.”

When I was calculating the effect size of 0.36, it is from a one tailed p-value. However, when I was calculating the 336 subject size by G*Power, it is from a two-tailed p-value. Therefore, I would like to ask, whether we should use one tailed or two tailed p-value in this study?

Moreover, when the sample size is 30 per group, with p < .05, the effect size of d [ 95 %CI] = 0.52. However, the confidence interval of this effect size is [-0.01, 1.04], which includes the null. As I remembered, Boley mentioned that the confidence interval reported should not include the null in the tutorial. Therefore, should we stick with the 61 sample size and effect size of d [ 95 %CI] = 0.36?


Good questions, thanks for asking.

For the first question:

The 336 calculation for the 0.36 is for one-tail. We typically use one-tail when we know the direction. So, when we calculate the effect size from the original article, they did a two-tail, because they didn’t know what the result would be and wanted to check both directions. In a replication, we know what direction to expect, so we do one-tail.

Here’s from Gpower:

-- Thursday, October 11, 2018 -- 21:36:46
t tests - Means: Difference between two independent means (two groups)
Analysis:	A priori: Compute required sample size 
Input:	Tail(s)	=	One
	Effect size d	=	0.36
	α err prob	=	0.05
	Power (1-β err prob)	=	0.95
	Allocation ratio N2/N1	=	1
Output:	Noncentrality parameter δ	=	3.2994545
	Critical t	=	1.6494286
	Df	=	334
	Sample size group 1	=	168
	Sample size group 2	=	168
	Total sample size	=	336
	Actual power	=	0.9503142

About the second question:

Two different things. It’s a complicated thing to explain by email, but generally, p-values NHST of 0.05 is like Confidence Intervals of 90%. So, it’s possible that something is significant p < .05 but the confidence intervals still include the null. Regardless, this doesn’t matter in this case, whether confidence intervals includes the null or not does not matter for when you do the power-analysis. It matters when we analyze the replication data and then try to determine whether or to what degree our replication was successful (if there’s a “signal”, if you recall the last class).

Does that help clarify things?


Let's take the Virgin Rat study as an example, the original text copied from the original article is like this:

Several researchers intend to perform the following experiment: They will inject blood from a mother rat into a virgin rat immediately after the mother rat has given birth. After the injection, the virgin rat will be placed in a cage with the newly born baby rats, after removal of the actual mother. The possible outcomes were (a) the virgin rat exhibited maternal behavior or (b) the virgin rat failed to exhibit maternal behavior. Subjects estimated the probability of the initial result being replicated with all, some, or none of 10 additional virgin rats.

For foresight group, you mentioned that we should randomised the order for them, so they either get outcome A then B or B then A. However I wonder if this randomisation will be truly effective to counter the order bias if we copy the original text above, since “virgin rat exhibited maternal behaviour” comes first anyway in their first encounter with the possible outcomes. So my question is, if I randomise the order for foresight subjects on the outcomes they receive, should I alter the text as well? e.g. if they receive outcome B then A, should I change the text to ”… (a) the virgin rat failed to exhibit maternal behavior or (b) the virgin rat exhibited maternal behavior… “? Thanks!


No, there’s no need to change the actual scenario, just the order of the display of the questions being answered. Actually, there is some importance in the scenario being the same across these two conditions.


The following is my list of question.

Q1) The beginning of P.272, it says “each participant judged the frequency of one target item and one filler item for each problem”. Take Linda problem as an example, does it mean that for experiment 1 there are two conditions, condition one judge the frequency of “bank tellers” and “high school teacher” and condition two judge the frequency of “feminist” and “high school teacher”? What's the purpose of condition 2 then if the statistical analysis focuses on whether frequency estimation of “conjunction” is larger than frequency estimation of “unlikely item” (bank tellers).

2) If I am correct about Q1, what about the “and” and “and are” questions? For example are there going to be questions like “Among 100 people, how many of them are bank tellers and high school teachers?”, or does the researchers simply means that “high school teacher” is simply judged alone and for “and” and “and are” questions, simply use the two target items (bank tellers and feminists) will be fine.

3) Check from table 2, does it mean that for experiment 2, the last two conjunctions are eliminated (aka no “who are” questions)? Why is it eliminated?

4) If the whole study is to find out whether frequency estimate can eliminate the conjunction effect, and that the wordings (like “and”, “and are”, “who are”) have a huge effect on people's judgment, why isn't there a comparison between “probability estimate” and “frequency estimate”?

5) For the questionnaire asking people to judge the frequency of 100 people fitting Linda's description, does the total have to equal to 100?

6) I don't understand how experiment 3 works. What's the purpose of dividing the experiment into two parts and first asking participants to rank then estimate, but not simply asking them to estimate in the first place? At the same time, how does “and” & “and are” questions added in experiment 3 if there are 5 filler items.

7) On page 272 paragraph 2, it says “A third problem was added in that experiment (adapted from the Bill problem in Tversky & Kahneman, 1983). Yet, it is not shown in the statistical analysis in table 2 nor it is mentioned later in the result and discussion session. I was wondering do I have to include that question in our replication. If yes, can we have assess to the article and that specific question?

My answers:

First off: The replication note on your article states “Replication note: Experiment 1 with both Linda and James”. After reading your questions and going back to the article I decided to try and simply things for you further With the following changes:

  1. No filler items needed.
  2. Only do the four conditions:
    1. Likely target
    2. Unlikely target
    3. “and”
    4. “and are”

I am not sure why, but the PDF in the Dropbox lost the highlights I added before.

I readded those, so please revisit the PDF in the Dropbox to see what I highlighted for you to focus on.

Filename: Mellers, Hertwig, & Kahneman, 2001 PsycSci Do frequency representations eliminate conjunction effects.pdf

Location: Dropbox\2018-HKU-Replications\2018-9 replication articles

Per my changes above and your questions below:

  1. Therefore, you’ll have four condition above, participants randomly assigned to one of these conditions. In each of these conditions they’ll rate both Linda and James on the same assigned condition.
  2. The “and” and “and are” means combining the two likely and unlikely targets. So, in the Linda scenario, these will be “Feminists and bank tellers” and “Feminists and are bank tellers”.
  3. No need to address Experiment 2 and 3. The reason these were eliminated is explained in the results section about the adversarial collaborations. They didn’t feel like it added much, also for the filler items.
  4. I think you may have misunderstood the effect here, or I don’t understand your question. The effect is that people estimate “bank teller” as less probable than “feminist and (are) bank tellers”, which makes absolutely no sense since the first includes the second.
  5. The total of what?
  6. Let’s skip Experiment 3 for now. I think once you constructed the Qualtrics and did the article analysis you’ll be able to better understand that experiment as well. But no need to replicate that.
  7. No need to add the Bill problem, that’s from Experiment 2. Indeed, it’s very unfortunate they didn’t report everything in full.

I really appreciate you asking all these questions, and helping clarify things. Your questions also helped me understand some of your challenges, and so I could help make things simpler for you and the other replicators. Keep asking.

1. Just to confirm, while two studies in the article are highlighted, it's stated that 1a/b on the syllabus, so we only need to work on either study as the ”/” sign implies?

2. For study 1a, participants were required to finish 12 questions (6 self-generated-anchoring questions and the other 6 experimenter-generated-anchoring questions). The Table 1 listed the 6 self-generated-anchoring questions and the average of participants' estimated answers and plausible range and so on. However, if this is merely the self-generated-anchoring questions, then why is the column of “Anchor” appearing on the left side?

3. According to the question above, participants also answer other 6 experimenter-generated-anchoring questions. Since it's not recorded from the article, where will I be able to look for those questions so that I can make the comparison between the two conditions?

4. In the same study, they also mentioned “For both sets of questions, half required upward adjustment and half required downward adjustment”. First, I am not sure if “both sets of questions” are referring to the two sets of questions of the self-generated and the experimenter-generated ones?

5. If the question above is true, then how can self-generated-anchoring questions be required to have upward or downward adjustment? Don't participants arbitrarily or intuitively provide their answers? How can half of them require upward adjustment and the other half downward adjustment?

6. For study 1b, participants were asked to finish 6 self-generated-anchoring questions (one group has to provide their answers in estimates, while the other group has to answer in plausible ranges). I am quite confused. Why are they comparing these two conditions? Why is comparing estimated values to estimated range related to anchoring-and-adjustment effect?

7. Since all participants were also asked to finish a follow-up questionnaire checking whether they knew the intended anchor value for each item and whether they were aware of how those values had affected them, I am wondering whether I will have to set those questions on my own because there's no direct references for me to replicate. Besides, if I am creating those follow-up questions on my own, that may greatly deviate from the original study despite the fact that might not be the entire focus of this research.

My answers:

  1. That’s a good question, because the two use the same questions but with different instructions/conditions. Studies 1a and 1b use the same items but different experimental design.
    1. To simply things I ask you focus on Study 1b, with a between subject design either asking for an estimate or asking for a range, without an experimenter-provided anchoring.
  2. It is there so that you know what anchor they were aiming for. Since these are sort of trick questions, they wanted to make sure that readers know what the trick was. For example with water, the freezing point is 32F (0C0), which is what comes to mind first, but for Vodka they indicate in the table it’s -20F. They want you to know what anchor they expected participants to come up with.
  3. I understand that difficulty. I’ll make it simple for you, ignore that. See my above #1. But since you asked, it’s in: Jacowitz, K. E., & Kahneman, D. (1995). Measures of anchoring in estimation tasks. Personality and Social Psychology Bulletin, 21(11), 1161-1166.
  4. Yes, in each set half is upwards, half is downwards. You can see that from the table.
  5. Yeah, that’s tricky business to understand. My understanding is that the self-generating anchor is from a different easily retrievable fact that needs to be overridden to get to the right estimate. For example with freezing, people usually first think of 32F (0C) but then they need to estimate Vodka, so they adjust from that. That’s what the whole effect/article is about.
  6. See the note in the table: “Skew was calculated by dividing the difference between the estimated answer and the range endpoint nearest the intended anchor by the total range of plausible values. Estimates that were perfectly centered within the range received a score of .50 on the skew index, whereas those closer to the anchor received a score less than .50.”
    1. That is confusing, but my understanding is that what they did was to run a one sample t-test comparing to an average skew of .50. They “comparison” between the two groups is just to measure the skew, but it’s not for the statistical test.
  7. Yes, you’ll have to reconstruct those best you can. That’s part of the challenge in replications, and part of the frustration with the way we did science in 2006 not sharing all materials. I’ll try and contact Epley to see if he might be able to provide us with the materials.

Followup Q:

if they are getting both the ranges of estimates and absolute values of estimates from participants separately to produce the skew value then to compare them to the average skew value of 0.5,

  1. How does this comparison provide significance for the anchoring-and-adjustment effect?; Whys is the skew value of 0.5 a meaningful “controlled” condition?
  2. Moreover, if the results provided by both groups of participants are combined, how is this still a between-group study?
  3. When we are setting up the qualtrics questionnaire for the assignment this week, is it a must for us to include an introduction and consent form in the beginning?
  4. In addition to that, is it also a must for us to randomize all the questions? Or can I follow the flow according to the table 1 posted from the article?

My answer:

  1. That is the point/core of the article, and is explained throughout the article. A skew of 0.5 means no skew, perfectly in the middle of the range. If it’s not perfectly in the middle and in the direction of the anchor, then this shows insufficient adjustment.
  2. It’s not a between-group study in the statistical test. The test is not a two-samples independent test. The test is a one-samples t-test of those who gave the point estimates.
  3. As I mentioned in class and the guide, no need to include a consent/debriefing/demographics, I’ll take care of those. You do need to set up and briefly tell the participant what their general task is and how long (number of question/pages). Please see examples in the Dropbox for intros.
  4. Yes, randomization is needed to help ensure there are no order effects.

Also, please note this exchange between Prof. Epley and myself: Epley:

One big challenge with this research for us from the very beginning is that in order to test the anchoring and adjustment hypothesis with self-generated anchors, participants need to know the self-generated anchor that you presume they are starting from. Much of this relies, then, on shared cultural knowledge. We tailored our questions to the populations we were studying so that we could ask about anchors we thought that most people would know. We were consistently surprised by how little people actually knew.

For instance, one of the questions was, “In what year was George Washington elected president of the United States?” We hypothesized that participants would anchor spontaneously on the date when the US declared its independence (1776) and hence adjust away from that to a later date. But to even potentially use this process, you have to know when the U.S. declared its independence, and then report using this number when considering your answer. A fair number of people in our US sample actually didn’t report knowing this […]. I wonder if you might also, then, want to consider generating some items where you think your own participants would be especially likely to know an anchor that they could potentially adjust from to answer a question?

I answered:

Yes, we’re aware of the challenge ahead. It’s tricky business, and will be noted in the pre-registration and taken into consideration in the power analyses and design. These replications will be conducted in the US using MTurk, and we’ll try and make it clear that we’re aiming for general knowledge, and also aim for much higher sample size than is required by effect size alone, to take into account people not knowing these things. We’ll also need to try and address possible issues with MTurkers Googling these things rather than answering intuitively.

This is all to say that we’d still like to give it a shot and see how we might be able to make it work, there’s much learning for the students from the process, but the challenges are noted and important. We might try and add a few additional anchors as extensions, but I do try to leave it up to the students how they want to try and build on that, and address such challenges. They can be quite creative at times.

So please also try and think/add instructions to clearly indicate to the participants not to look up answers, and check whether they had knowledge regarding each of the questions. Not a must, but you are also welcome to try and think of two anchors in addition to test this further, as suggested by Prof. Epley. (PSYC2071: that you will also give the students in class when you run this in your week).


I am doing the replication article for Epley & Gilovich (2006). This is the question I have or the help I need.

  1. I am not sure why Study 1b is a one-sample t-test. Shouldn't it be a 2-sample independent t-test? And how can a one sample t-test show the anchoring effect, while comparing the skewness to 0.5.
  2. And if Study 1b is a one-sample t-test, then does that mean it has one condition only?
  3. Shouldn't the condition the original anchor value be included somewhere into our data analysis, so that we can showcase the anchoring effect by our “anchor”?


1. Yeah, I understand the confusion, that’s… understandable. I'm also a bit confused about their design. But, think about it, how would you do this as a 2-sample t-test if the DV is not the same scale? One is a range, one is a specific point. What can you compare here? it’s tricky, they compute the skewness based on how the estimate is within that range, and that’s not a statistical test, it’s computation. It’s one sample, because they compute the difference from that skewness to 0.5 which is considered the expected “rational” unbiased answer. They could have done this a bit different, I would, but from my understanding of the article this is what they did. If you think they did something else, show me. It’s possible I misunderstood this. You can also follow up with Boley on that, if needed.

2. a one sample t-test does indeed mean one condition, or, no experimental manipulation.

3. That’s reflected in the skewness score, that deviates from 0.5 in the direction of the anchor. Read the note on the table: “ Estimates perfectly centered within the range received a score of .50 on the skew index, whereas those closer to the anchor received a score less than .50.”


1. My understanding is that Group 1 (participants who gave their estimates) served as the population (not very accurate but it is my way of thinking). While Group 2 (participants who provide their estimates in range) served as the sample. The calculation is measuring their mean difference of whether the skewness of Group 2 significantly differs from Group 1, i.e. (below or above 0.5). If there's an adjustment-based anchoring effect, the skewness will not be 0.5. Am I correct?

2. On the other hand, I found that there is a term p(rep). I googled it and it was the p-value of a replication. However, do I need to report that value in my study? Specifically, where should I include this value in the article analysis?

3. After my computation of the Cohen's d (according to the original article, it is d=2.91) into G Power, I get a sample size of 4 suggested by the program. I do not think it is possible to do a replication with such a small sample. Moreover, an effect size of d=2.91 does not seem to be possible (Please see the attached screenshot) Is that normal? Shall I report that value in my article analysis? If not, what should I do for obtaining a 'reasonable' sample size?


Thanks for following up on these.

Let me try and answer your questions:

  1. Not sure about the way you interpret it, but let me say this about the last part: If there's an adjustment-based anchoring effect, then the skewness is meaningfully different than 0.5 in the direction of the anchor.
  2. I don’t know what p(rep) is. Would have helped to know where this is from, and why you think it’s relevant or important. In general, I always prefer more information than less information, so if you feel something is relevant, please do go ahead and include it. It’s hard for me to comment without more information.
  3. Let me see. First, if I remember this right, then we need to consider that this specific analysis is conducted not on number of participants, but on number of scenarios. T(5) isn’t number of participants 6, since there were actually 58 participants, but rather it is an analysis of the 6 scenarios. So the effect you’re getting might be more relevant for number of scenarios. The more revealing one for number of participants is perhaps t(61) and an effect size of d = 1.12, which is still a very large effect. Since the required sample size is small, we can try and aim for higher power of 0.99. When we run the replication we can see whether the effect is really that large.


1. Should Study 1b be a comparison study (a one sample t-test) as we are only required to compare the mean skewness of answers to a skewness value of 0.5?

2. On the net, I saw you answered another fellow student working on the project. You told them it's okay to set the effect size as 0.99 since the original effect size was too huge (d = 2.91). With this effect size of 0.99, the required sample size is still only 13-15. However, in the original study, they recruited 102 participants. Is it okay if we recruit the same number of participants to ensure the reliability of our study is, at least, as good as the original study? Or should we just follow what the effect size of 0.99 has offered us (13-15 participants)?


  1. Yes, does seem like it. Let me know if you think other wise. Maybe just that perhaps a better way to call it is a one-sample experiment.
  2. Good point, thanks for bringing that up. I will add to the replication guide that as a general rule of thumb we should never run a smaller sample than the original study. In this case, what I suggest is that we write that according to the small telescopes article we’ll follow the advice for 2.5 the original sample size. How does that sound?


1. I have considered adding validation onto our study, basically adding informing participants with a certain range of answer. Hence, will I have to clearly state that “if participants provide an answer beyond this range, their answer to that question will be excluded” in my exclusion criteria?

2. Since I am trying to draft out the data analysis plan, I have encountered some difficulty when I envision the processes. Since the data from participants will be excluded when they do not have knowledge regarding the self-generated anchor or they do not apply those when answering the questions, the number of responses for each question will then be unequal. Thus, the average skewness index of each question will carry different weight in the final computation of one-sample t-test. As a result, I am wondering if we first calculate the actual mean skewness value of all six questions (taking their individual number of responses into account), then run a one-sample t-test with that one final skewness value and compare it with the skewness value of 0.5. Does this method sound okay though the degree of freedom may be 0?


    1. What validations mean is that participants CANNOT answer something that isn’t valid. It’s impossible. It’s meaningless to exclude participants who didn’t pass validation, because participants cannot pass without validation.
    2. You should definitely implement validation best you can.
  1. That’s a really good question. Every time you guys ask me something about this article I realize it’s such a messy way to examine their research question and very unclear. So just want to say it’s confusing to me as well, so I appreciate you thinking about these in depth and trying to figure this out with me. Here is what I suggest:
    1. Participant level (per scenario): Calculate the average skewness of all participants for each specific scenario. Obviously, this only includes the participants that were not excluded for that specific question. Then it’s 6 one sample t-tests comparing the mean skew to 0.5 in each of those scenarios. It’s very possible there will be different samples, and therefore different dfs.
    2. Scenario level (per whole study): Then, calculate the average skewness across all 6 scenarios on the scenario level skew mean (like the one provided in the table), not the participant level, and then compare that to 0.5. Since there are 6 scenarios, the dfs are 5.


1. What criteria should we state regarding the successfulness of our replication study? What conditions should there be? 2. And does our replication study mainly investigate if people would be influenced by, specifically, adjustment-based heuristic and not enhanced accessibility? Or does it exploring anchoring-and-adjustment effect in a broader sense? Simply examining whether participants would be influenced by their own self-anchor? 3. I hold doubts regarding the point above (point 2) because I was wondering if the self-generated anchor would enhance their accessibility of a certain numerical value too? Thus, it may not necessarily explain the intrinsic and profound nature of this heuristic.


  1. Please see section “Evaluating replications” in the guide: and use the Lebel etal criteria.
  2. I don’t follow what you’re asking, sorry. But if you mean whether this tests against experimenter produced anchor, then Study 1b alone does not, only together with 1a if you present experiment induced anchors for comparison.
  3. I don’t follow what you’re asking, sorry. But, explaining/addressing theory isn’t needed in a pre-registration, just very clear hypotheses and experimental design to test these. Focus on that.

UPDATE: Please see reply/stimuli received from Fishchhoff in the Dropbox Dropbox\2018-HKU-Replications\2018-9 replication articles\Stimuli\Fischhoff


We are working on Fischhoff 1975 for our replications study and formulating the Qualtrics survey. These are the questions we have.

1. Do we have to replicate all the 4 events a,b,c,d? If yes, do you have the original materials for events b, c, d? (As only Event A was included in the paper as example) We think we need to replicate all the 4 events but we do not have any clues for event b,c,d. It seems to us that the original materials for Events a,b,c,d were not available online as well. It still okay for us to modify just one sentence to make the event A material but for events b,c,d, what should we do to it?

2. The author's compare the experiment 2 result with experiment 1. In this case do we need to replicate the experiment 1 as well? [like having another group of participants answering facing the same data set (Event A,B,C,D with 4 conditions each) but without the instructions of responding “as they would have had they not known the outcome.]

3. We are not sure about how to allocate the condition to the participants. For experiment 2, it seems that each participant would answer all of the 4 events. But do we need to randomize the condition they got for each event, say for participant A: he might be in the before group (not knowing the outcome) for event A and in the after (ignore) group1 for event B etc. ? Or should we fixed like the participant must be the before group for all the 4 events, and after (ignore) group1 for all the 4 events etc, which seems the author did not do so.

My answers:

  1. Please do what you can with Events A and B. You can find the book for Event B here: (pg. 53). It’s not a must, but I would appreciate it if you would add one of your own scenarios with similar style.
  2. No need. Just do it once, Experiment 2 only, with the “as they would have…”.
  3. Please randomize once to before and after and be consistent for Event A and Event B (+Event C if you added one). This would allow for examining the interaction of the between (before-after) and within (events a, b, c). You need to focus on the before-after contrasts. The interaction and differences between events are a welcome but not must bonus on top of that, but please add that to your pre-registration.


regarding Event B. The book describing Event B is rather lengthy, is it possible for me to extract part of it and use it for the experiment, or should I summarise the event? And should I make up the possible outcomes as well?


Yes, please. It needs to be very brief and to the point, and with different outcomes. Do the best you can, and you’ll get feedback from me and the TA, and we’ll see what your peers did with this and combine this to try and achieve the best replication we can.

UPDATE: Please see reply/stimuli received from Fishchhoff in the Dropbox Dropbox\2018-HKU-Replications\2018-9 replication articles\Stimuli\Fischhoff


I am wondering what are the sample sizes (n) to be inputed when I use the tools to compute the effect size.

Should they be, for example, for Event A (see the attached image),


n1: before (n=17) and

n2: after (ignore) outcome 1 (n=20)

(Like comparing the before group with each outcome, so there would be 4 sets of comparison in total for each event)



n1: before (n=17*4=68) and

n2: after (ignore) (n=20+15+18+18=71) total n of all 4 outcomes


In terms of events, Experiment 1 was a between-subject design and Experiment 2 was a within-subject design. In terms of conditions, there were both a between design.

In Experiment 1:

Subjects. Approximately equal numbers of sub- jects participated in each group in each subexperi- ment. Event A (Gurkas) was administered twice, once to a group of 100 English-speaking students recruited individually at The Hebrew University campus in Jerusalem and once to a class of 80 Hebrew-speaking subjects at the University of the Negev in Beer Sheba. Event B (riot) was ad- ministered to two separate classes at The Hebrew University, one containing 87 Hebrew-speaking psychology majors with at least one year's study of statistics and one of 100 Hebrew-speaking stu- dents with no knowledge of statistics. Event C (Mrs. Dewar) was administered to the 80 Uni- versity of the Negev students; Event D (George) to the 100 Hebrew University students without statistics training.

The design was:

Method Design. The six subexperiments described in this section are identical except for the stimuli used. In each, subjects were randomly assigned to one of five experimental groups, one Before group and four After groups.

So, it’s a between-subject design. There were two groups for Event A, two groups for Event B, one group for C, and one for D, six “subexperiments”. Since there are 4 comparisons between the before and each of the afters, it’s 24 comparisons.

But the table only reports one of those, can’t imagine why:

For the sake of tabular brevity, only one subexperiment in each pair is presented.

And then the stats are:

The critical comparisons are between the outlined diagonal cells (those indicating the mean probability assigned to an outcome by sub-jects for whom that outcome was reported to have happened) and the Before cell in the top row above them.

Now, the results:

In each of the 24 cases, reporting an out- come increased its perceived likelihood of occurrence (p < .001; sign test). Twenty- two of these differences were individually significant (p < .025; median test). Thus the creeping determinism effect was obtained over all variations of subject population, event description, outcome reported, and truth of outcome reported. The differences between mean Before and After probabili- ties for reported outcomes ranged from 3.6% to 23.4%, with a mean of 10.8%. Slightly over 70% of After subjects assigned the reported outcome a higher probability than the mean assignment by the corre- sponding Before subjects.

In Experiment 2, it was a within:

Subjects. Eighty members of an introductory statistics class at the University of the Negev participated. Procedure. Questionnaires were randomly dis- tributed to a single group of subjects. Each sub- ject received one version of each of the four dif- ferent events. In a test booklet, Events A, B, and C alternated systematically as the first three events, with Event D (the least interesting) always ap- pearing last. Order was varied to reduce the chances that subjects sitting in adjoining seats either copied from one another or discovered the experimental deception. All materials were in Hebrew. Questionnaires were anonymous.

In Table 2, I have no idea why some events are 71, and some more, or less. It should be 80 in all, but it could be that some participants just didn’t finish the whole thing.

In any case, we’re comparing participants from the no-outcome to each of the outcome groups. So, for each of these comparisons the sample size is N of no-outcome + N of outcome. You can’t combine the N and run a high-level comparison. What you can do is run a mean/median of these comparisons, so get the average sample size of all these comparisons. So, in what you wrote below it’s option 1, but it needs to be calculated to all 16 presented (experiment 1, out the real 24 comparisons) and 16 (experiment 2) comparisons. I suggest you input the average sample size for the comparisons, and use that with p-values to average calculate needed sample per one comparison, and multiply that sample size for the number of comparisons needed (16).

Is that clear? Does that make sense?

I could be getting this wrong, let me know if it doesn’t make sense to you.


I am not sure if I am interpreting it correctly. So I would need to run two sets power analyses (namely 1. After (ignore) group vs Before group and 2. After (ignore) group vs After group), and input 295, 368 and 296 for the sample sizes respectively (p-value: 0.001, with Mann Whitney U Test)?


Good you followed up on this, because I can see I wasn’t clear enough and leading to some confusion. Plus now I realize maybe I added something in there that isn’t accurate and possibly misleading.

What I wrote was: “In any case, we’re comparing participants from the no-outcome to each of the outcome groups. So, for each of these comparisons the sample size is N of no-outcome + N of outcome. You can’t combine the N and run a high-level comparison. What you can do is run a mean/median of these comparisons, so get the average sample size of all these comparisons. So, in what you wrote below it’s option 1, but it needs to be calculated to all 16 presented (experiment 1, out the real 24 comparisons) and 16 (experiment 2) comparisons. I suggest you input the average sample size for the comparisons, and use that with p-values to average calculate needed sample per one comparison, and multiply that sample size for the number of comparisons needed (16).”

What I was thinking was: N of no-outcome + N of outcome, for example in Table 1 Event A it’s basically 20+20=40 for all comparisons, so that’s easy = 40*4/4=40 average for comparison, so the average sample size per comparison is 20 per group and 40 overall. If I recall then p was p < .001, so if we enter this into MAVIS “p-value to effect size” we get “d [ 95 %CI] = 1.13 [ 0.44 , 1.82]”. This we can enter into Gpower, and get that for 1.13 we need a total sample size of 18+18=36. If we multiple that by 4 for the four comparisons needed, we get 144.

If all four events were by different participants, then we can just multiple that number by 4 (144*4=576), which is what I meant originally by the 16 above, which I think is what they did in Experiment 1. In Experiment 2 that was a within design, meaning showing each of the participants all four scenarios, so for that 144 would be sufficient.

What I suggest is that you write all of the calculations above, including the screen captures, and the links, note the difference between Experiment 1 and 2, and summarize that we’ll be aiming for a sample size of 576 regardless of doing a within design following Experiment 2.

Is that clear enough? Does that make sense? Let me know if not.


I am not very sure if I should choose Wilcoxon-Mann-Whitney test (see attached) instead of difference between two independent means (for the statistical test part) you suggested in the previous email as the original article used Mann-Whitney test. Therefore, in this case, the total sample size needed will be 23+23=46. And we will need 46×4=184 for one event, and 184*4=736 in total.


Good question. And great suggestion. Thank you for that.

To be honest, I think the MAVIS p-values to effect size already shifted between the two tests, using a t-test approximation, because I don’t know of a tool to do that for the stated Wilcoxon-Mann-Whitney test. I was assuming they can’t be too far off, so that effect size may already be a bit biased.

But, let’s put it this way. It’s better to be more conservative and have higher sample size than not. So, what I suggest is that you include both analyses, to be fully transparent, and then summarize that we’ll aim for the highest one. 736 it is, then.


Not sure about the data analysis plan, specifically, I suggested using Mann-Whitney U Test and she suggested one-way ANOVA and run post-hoc afterwards. We are also confused about the comparisons to be made. (Please see attached) Do we just run tests for the highlighted pairs, so 4 tests for each event, 4*4 events, a total of 16 sets of test?


Yes, that is correct. That’s what I previously emailed and posted on the WIKI. That’s what the authors did: In each event there are 4 comparisons, there are 4 events. There is no point in running the other comparisons, definitely not 16*4.

I am fine with you two pre-registering and doing different analyses on this. This is partly why it’s important to have two students working on one article. One thing, though. I went back to the pre-registration to check what the one-way ANOVA and posthoc comparisons were about. There is no need that I see in an ANOVA and posthocs, only in tests comparing two conditions, since there are only two conditions in each comparison and these are not meaningfully related. There was no ANOVA test in the text I highlighted. But do let me know if you think I’m missing something.


For the experiment 1 (poetry), I could find the poem somewhere online. However, for the experiment 2 (painting), I have tried to find but can't find the paintings of “Deborah Kleven, a contemporary fine and graphic artist based in Washington - ‘‘12 Lines’’ and ‘‘Big Abstract” online. Has Kruger released his materials somewhere else / sent you the materials? Is it okay to use paintings of a relatively unknown artists to replace Kleven (who is NOT very well known too) paintings in case we can't find the materials?


I’ll try and contact Kruger, but I think the actual painting doesn’t matter at all as long as it’s unknown. Your last sentence is right on track. For the time being, please replace that with something else lesser known.


The instruction attached in the article from the dropbox file ask me to “Combine Experiments 1-2 into one experiment based on the design of Experiment 1, each participant completing the experiments in a between subject design.”.

I am confused with this instruction. From what I can understand, the experiment 2 in the article serves 3 additional purposes (3 main purposes that differentiate it from the Experiment 1) First: extend the Experiment 1 to a different artistic domain (painting) Second: Compare the effort heuristic between normal people and self-proclaimed expertise Third: Examine whether the effort heuristic plays a role in comparative judgment, other than the absolute judgement demonstrated in Experiment 1)

My first question: By combining the Experiment 1 and the Experiment 2 based on the design of Experiment 1, does it mean giving up the third purpose suggested in the Experiment 2? As it will lead to a mixed-model design.

Second question: To keep the Second purpose while combining the two experiment, does that mean I have to add a self-proclaimed expertise in evaluating the quality of the poem (Experiment 1) as well? Or apply it only to the evaluation of painting (Experiment 2)?


I admit my instructions in the replication note were confusing, so it’s very good you contacted me to ask. I’m curious why the other replicators haven’t 😊

I would like to combine the two experiments into one Qualtrics survey so that a participant is assigned to either the low effort or high effort conditions and is then given shown both the painting and the poem for that condition. The order of display of the painting and the poem should be random.

So, two conditions:

  • 1: random display high effort poem and high effort painting
  • 2: random display low effort poem and low effort painting

I also ask that you use the DVs from both experiments, for each of the stimuli (poem / painting). If I recall correctly, it’s: effort, liked, quality (these three are 1-11), and price in auction (free range in USD, to be log transformed in analysis).

There is no need to do mediation analysis for this undergrad course, but I do ask that in your pre-registration you include testing normality assumptions and corrections if there’s a violation (like log-transform, and use Welch t-test).


Q1: “I would like to combine the two experiments into one Qualtrics survey so that a participant is assigned to either the low effort or high effort conditions and is then given shown both the painting and the poem for that condition. The order of display of the painting and the poem should be random.

So, two conditions:

1: random display high effort poem and high effort painting

2: random display low effort poem and low effort painting”

If I understand the study correctly (old studies take more time to understand haha), for Kruger experiment 2, participants were asked to rate the effort, like, quality, etc for two paintings. BOTH “low-effort” paintings and “high-effort” paintings (painting A 26H, painting B 4H; Or painting A 4H, painting B 26H) were shown (this means participants go through TWO CONDITIONS?), so that the participants can compare the “low-effort” painting with the “high-effort” painting. Do you mean the “comparison questions” (E.G. Compare the quality of the paintings [in experiment 2, low-effort vs high-effort paintings], Compare the effort invested in the paintings, see P.4 Lower Left) should be deleted to simplify?

Q2: Also, by “combining two experiments into one”, you mean the two experiments should be in the SAME BLOCK, or DIFFERENT BLOCKS in Qualtrics (which is what I did for the draft)?


Q1: Right, good question, thanks for asking. It wasn’t required in my request on the replication note, I was aiming to keep things simple. However, if you’d like to add that as bonus, that would be great, and would increase the contribution. If you do, you can add that comparison between the two at the end, for both the painting and the poem.

Q2: Different blocks, please. Display should be organized by the “Survey flow”.

Editing collaborative section on a student's replication article/study

I am working on finalizing the collaborative section and realizing my classmates summarized the whole article rather than the highlighted one. I wonder should I just focus the highlighted experiment in the finalization and the later article analysis? I’ve looked in the syllabus and the Moodle and at my notes from class and online and I asked Boley, but I am not sure could I just delete those materials from my classmates (including the contributor's name in the collaborative section)?

My answer:

Please focus on the highlighted experiment in terms of covering the effect but make sure to also cover implications and real-life examples of the bias overall. Feel free to change/edit/remove anything you see fit. It is up to you, you will be graded based on what appears in that section, so it’s up to you that this section looks good. You can take/build-on what your classmates wrote, or do your own thing. I removed the names for you so you’ll feel more comfortable to edit/remove things.

another problem was that I realized for Jamovi, as you can see in the screenshot, the label of the variable does not match with the question presented. Some how for the question row it just weirdly shifted to the right. This problem does not happen if I open it with Excel comma delimited. I am not sure if other rows shifted as well but that is also why I haven’t been using Jamovi yet.
  • hku_psyc2071_and_psyc3052_-_autumn_2018-9.txt
  • Last modified: 2018/11/21 03:39
  • by filination