All research complies with all relevant ethical regulations; all studies were approved by the local lab’s Institutional Review Board. The four participating labs conducted pilot and exploratory research in the social sciences pursuing their own typical practices and research interests independently of the other labs. The labs were encouraged to investigate any aspect of social-behavioural science, with the requirements that the discoveries submitted for self-confirmatory testing and replication be novel and non-obvious and not involve deception.
The labs submitted promising discoveries for self-confirmatory tests and replication if they met the following inclusion criteria: a two-group between-subjects manipulation with one focal dependent variable, with methods administered via computer online to adults within a single 20-minute study session.
Prior to conducting the self-confirmatory test, the discovering labs preregistered the study design and analysis plan, including materials, protocol, plans for data cleaning and exclusion, and specification of the analysis model. Once a self-confirmatory test was preregistered, the lab wrote a methods section to share with the other labs. These methods sections had to include everything that the discovering lab believed would be required for an independent lab to conduct an effective replication. This was done to capture the naturalistic conditions when a researcher reads a methods section and conducts a replication based on it.
Following preregistration, no changes could be made to the methods or procedures, and all labs were committed to replicating the protocol regardless of the outcome of the self-confirmatory test. The discovering lab conducted its self-confirmatory test with about 1,500 participants, and then the project coordinator initiated the replication process with the other labs. The labs were assigned the order to conduct replications in a Latin square design to equate lab-specific effects across the order of replications (Supplementary Information section 6).
Sharing study descriptions
After a lab identified an ostensible discovery for a self-confirmatory test, they distributed a description of the methodological details that they believed would be required for an independent lab to run a replication. When the replicating labs considered the instructions to be ambiguous on a meaningful part of the design (71% of studies), the replicating labs sought clarifications about methodology from the discovering lab. Usually these were trivial clarifications or confirmations, but not always (Supplementary Information section 2).
Replications were done sequentially following the same protocol as the self-confirmatory tests, including preregistration. Variation from the 1,500 participants per study was due to idiosyncrasies in how the panels and labs managed participant flow and the application of preregistered exclusion criteria. In most cases, the panels allowed more participants to complete the questionnaire.
The discovering labs could specify required exclusion criteria, such as attention checks. The replicating labs could also choose to preregister and implement exclusions for attention checks following their own laboratory’s best practices. This was done to capture the natural way researchers conduct replications using their own view of best practices. To maintain the ecological validity of labs conducting research in their own style, and to maximize the independence of each replication, all sharing of materials was managed by a project coordinator to prevent unintended communication of designs or results.
Main studies
Sixteen new discoveries of social-behavioural phenomena were submitted to self-confirmatory testing and replication, four from each of the participating laboratories. Table 1 catalogues the new discoveries with a brief name, a one-sentence summary of the finding and a citation to the research. Supplementary Table 3 provides links to comprehensive information for each self-confirmatory test and replication, including the preregistration with the design and analysis plan, research materials, data, analysis code, analysis output and written reports of the methods and results.
Participants
The population of interest for the self-confirmatory tests and replications was adults living in the United States who could read and write in English. The participants were members of panels that had been recruited through non-probability sampling methods to complete online questionnaires in return for small amounts of money or redeemable ‘points’40,41. Labs contracted with different sample providers to provide participants (Stanford University: Toluna, SSI and Dynata; University of California, Santa Barbara: CriticalMix; University of California, Berkeley: Luth; University of Virginia: SoapBox Sample and Lightspeed GMI). We used different sample providers to minimize potential overlap in sampling, although we cannot be sure that some participants are not part of multiple panels and also repeated our studies as part of different panels. These samples were taken from the providers’ online, opt-in, non-probability panels. The sample providers were instructed to provide American adults drawn in a stratified way with unequal probabilities of selection from the panels so that the people who completed each survey would resemble the nation’s adult population (according to the most recently available Current Population Survey, conducted by the US Census Bureau) in terms of gender, age, education, ethnicity (Hispanic versus not), race (allowing each respondent to select more than one race), region and income. This method produced samples designed to look similar to probability samples on the matched characteristics, but the samples may still have differed in unknown ways on unmatched characteristics. The sample providers may have varied in their success at achieving representativeness. A potential lack of adherence to that sampling plan was non-consequential for the conducted studies. For none of the discoveries were the findings presumed to be limited to a subsample of adults, although there may have been a priori or post facto hypothesizing about moderation by demographic variables. For the pilot and exploratory studies, the labs used whatever samples they wished (for example, panel, MTurk or participants visiting the laboratory).
Blinding and sample-splitting manipulations
Two planned manipulations of secondary interest were included to explore potential reasons for variation in the replicability rate or its decline over time. One involved randomly assigning participant recruitment for each data collection of 1,500 participants into a first and second wave of 750 to investigate declines in ES across a single data collection. We assign less confidence to this manipulation, however, as not all panels may have consistently followed our strict protocols for this random assignment (see Supplementary Information section 7 for all additional procedures that the labs and sample providers were instructed to follow). The second manipulation randomly assigned 8 of the 16 new discoveries (2 from each team) to blind the results of the primary outcome variable from the self-confirmatory tests and replications for all team members until all replications for that finding had been completed. For the other 8 discoveries, the data were analysed and reported to the other teams as the results became available. This was to determine whether explicitly blinding research findings would moderate replicability rates and/or declining ESs across replications24,25.
Confirmatory analysis
Meta-analysis
In all analyses, meta-analytic models estimated with restricted maximum likelihood were used, as implemented in the metafor package (version 4.2-0) for R version 4.2.2 (ref. 42,43). For single-level models, Knapp–Hartung corrections for standard errors were used. For multilevel models, cluster-robust variance estimation with small-sample corrections was used to account for the limited number of independent studies40. Preregistration of the overall analysis plan is available at https://osf.io/6t9vm.
We summarized the overall distribution of effects using a multilevel meta-analysis model, including fixed effects to distinguish replications from self-confirmatory tests, with random effects for each unique discovery and each unique ES nested within discovery27. The study-level variance component describes heterogeneity in the phenomena investigated in different studies and labs. The ES-level variance component describes heterogeneity across replications of the same phenomena.
Confirmation versus self-replication and independent replications
A random-effects meta-analysis was estimated to analyse the differences between the self-confirmatory test and the replication of the same discovery by the same lab. A negative average change would be evidence of declining replication ES, even when conducted by the same investigators.
Comparing self-confirmatory tests to replication results from other labs allows for assessment of the impact of between-lab differences in replicability success. Again, a random-effects meta-analysis was used to analyse differences between the ES in the self-confirmatory test and the average ES estimate in the three independent replications. Negative average differences would be evidence of declining replication ESs in cross-lab replication. The random-effects model provides an estimate of heterogeneity in the differences between self-confirmatory tests and replications beyond what would be expected by sampling error alone. Positive heterogeneity would indicate that ESs from self-confirmatory tests could not be exactly replicated by independent labs.
Slope across replications
According to one theory, declines in ESs over time are caused by a study being repeatedly run25. If accurate, the more studies run between the self-confirmatory test and the self-replication, the greater the decline. To examine temporal decline effects across all replications, we aggregated ES estimates from each self-confirmatory test with each of the replications and conducted a meta-analytic growth curve. The model also included random effects for each self-confirmatory test or replication attempt of each study that were allowed to covary within study according to an auto-regressive structure. The ESs were recoded for this analysis so that all effects were positive and a slope to non-significance or weakening ES would be negative in sign.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Dr. Thomas Hughes is a UK-based scientist and science communicator who makes complex topics accessible to readers. His articles explore breakthroughs in various scientific disciplines, from space exploration to cutting-edge research.