Best Practices for Replication Packages

Data Editors Collaborate

A number of Data Editors in the Social Sciences are coordinating reproduciblity requirements and guidelines across different journals. An example of this cooperation is this common guidance by Social Science Data Editors, and parts of the below list.

There is a set of practices that are strongly recommended for all replication packages. The following list includes some of them:

Folder Structure

We are aware that research happens in haphazard ways sometimes, without any time to waste on seemingly irrelevant housekeeping tasks. We encourage you nevertheless to try and impose a rigid folder structure on your research project from day one, when concerns about a replication package might still be far away - some folder could even remain empty for a while, but that’s ok! You will save a lot of time later on. Here are two examples:

Good.

Meaningful sub directories
top level README
code/data/output are separated

.
├── README.md
├── code
│   ├── R
│   │   ├── 0-install.R
│   │   ├── 1-main.R
│   │   ├── 2-figure2.R
│   │   └── 3-table2.R
│   ├── stata
│   │   ├── 1-main.do
│   │   ├── 2-read_raw.do
│   │   ├── 3-figure1.do
│   │   ├── 4-figure2.do
│   │   └── 5-table1.do
│   └── tex
│       ├── appendix.tex
│       └── main.tex
├── data
│   ├── processed
│   └── raw
└── output
    ├── plots
    └── tables

Notice how this example includes tex/main.tex which will produce the final product. The code folder would contain all the source code needed to produce everything in the paper, and could be placed under version control. Notice also that the content of output/ could be erased without problems - all required inputs are separated and could regenerate all results immediately.

Not So Good.

Sub directories are not helpful
File names are confusing
no README
code/data/output are not separated

.
├── 20211107ext_2v1.do
├── 20220120ext_2v1.do
├── 20221101wave1.dta
├── james
│   └── NLSY97
│       └── nlsy97_v2.do
├── mary
│   └── NLSY97
│       └── nlsy97.do
├── matlab_fortran
│   ├── graphs
│   ├── sensitivity1
│   │   ├── data.xlsx
│   │   ├── good_version.do
│   │   └── script.m
│   └── sensitivity2
│       ├── models.f90
│       ├── models.mod
│       └── nrtype.f90
├── readme.do
├── scatter1.eps
├── scatter1_1.eps
├── scatter1_2.eps
├── ts.eps
├── wave1.dta
└── wave2.dta
└── wave2regs.dta
└── wave2regs2.dta

Data

Always keep your raw data intact (i.e. read-only). Generate separate analysis datasets to perform analysis. After you copied your raw data into the package (rawdata folder), set to read only. Look here for windows tips to set read-only, on *nix systems it’s a
```
chmod 444 rawdata
```
Datasets change over time, keep a record of the date and versions you obtained. It might be difficult to obtain it in the future.

Code

Generate your analysis data using programs, many journals require you to submit them.
Separate code, data and outputs in your directory structure. This will facilitate (i) keeping versions of your code, and (ii) producing a replication package.
Stronlgy consider the benefits of version control for the reproducibility of your research. Click here for a tutorial.
Use programs/scripts as opposed to command-line instructions!
Design your code to be run all at once, and not section by section. Create a master file that calls all other subsidiary files.
Set the paths only once and at the top of the master-file, and then use either relative paths or global variables, so that users don’t have to change the paths in multiple parts of your code.
Whenever possible, write paths in a way that is compatible across different operating systems. For example, in Stata, always use forward slashes / to separate directories, even if you use Windows.
If you are using software which requires compilation of source code into executables (e.g. Fortran, C, etc.), please use make files (or else make sure you provide very detailed instructions on how to compile the different files - we accept bash shell scripts).
Name your files in a wise manner. For example, if your output generates tables and figures as new files, give them names that easily identify them. Also, name your master file as “Main”, “Master” or something similar.
Use sub-folders when your package includes a lot of files. Make sure that your package includes all relevant folders, even empty folders that will be filled with the outputs from the program.

Output

Use log-files and write to disk. Make sure that your code stores the output (both tables and figures) as files on disk (in subdirectory outputs/) as opposed to only showing it on screen while you are executing it.

Make your code produce all tables and figures as they appear in the manuscript: you will save a lot of time! We recommend splitting your code into small subunits, each of which produce a dedicated output, for instance a function figure1() which would produce and save the file outputs/figure1.pdf in your package, where outputs/figure1.pdf is included via \input{outputs/figure1.pdf} in the \(\LaTeX\) source code of your paper as as Figure 1! Of course the same holds for tables and other output. You could then include the following table in your README, greatly helping our replicators (and yourself 😉):

Output in Paper	Output in Package	Script/Program to execute
Table 1	`outputs/tables/table1.tex`	`code/table1.do`
Figure 1	`outputs/plots/figure1.pdf`	`code/figure1.do`
Figure 2	`outputs/plots/figure2.pdf`	`code/figure2.do`

Run your codes from the replication folder before you submit and make sure it runs and all your results are reproduced - ideally on another machine!
Make sure to delete all expected output from the package before you run it, so you can be sure that all output was actually produced.
Ideally, your submitted paper (your \(\LaTeX\) file which produces it) should depend on the output of your replication package, so that if a piece of output is missing, the paper cannot be compiled (or you would quickly spot the mistake).
Help us by submitting your package without any expected output, i.e. with an empty folder outputs/.