Best Practices for Replication Packages
The Data Editors of the American Economic Association (Lars Vilhuber) the Review of Economic Studies (Miklós Koren), the Canadian Journal of Economics (Marie Connolly), the Econometric Society (Joan Llull) and the Economic Journal and Econometrics Journal (Florian Oswald) are coordinating reproduciblity requirements and guidelines across different journals. An example of this cooperation is this common guidance by Social Science Data Editors, and parts of the below list.
There is a set of practices that are strongly recommended for all replication packages. The following list includes some of them:
Folder Structure
- We are aware that research happens in haphazard ways sometimes, without any time to waste on seemingly irrelevant housekeeping tasks. We encourage you nevertheless to try and impose a rigid folder structure on your research project from day one, when concerns about a replication package might still be far away - some folder could even remain empty for a while, but that’s ok! You will save a lot of time later on. Here are two examples:
Good.
- Meaningful sub directories
- top level
README
- code/data/output are separated
.
├── README.md
├── code
│ ├── R
│ │ ├── 0-install.R
│ │ ├── 1-main.R
│ │ ├── 2-figure2.R
│ │ └── 3-table2.R
│ ├── stata
│ │ ├── 1-main.do
│ │ ├── 2-read_raw.do
│ │ ├── 3-figure1.do
│ │ ├── 4-figure2.do
│ │ └── 5-table1.do
│ └── tex
│ ├── appendix.tex
│ └── main.tex
├── data
│ ├── processed
│ └── raw
└── output
├── plots
└── tables
Notice how this example includes tex/main.tex
which will produce the final product. The code
folder would contain all the source code needed to produce everything in the paper, and could be placed under version control. Notice also that the content of output/
could be erased without problems - all required inputs are separated and could regenerate all results immediately.
Not So Good.
- Sub directories are not helpful
- File names are confusing
- no
README
- code/data/output are not separated
.
├── 20211107ext_2v1.do
├── 20220120ext_2v1.do
├── 20221101wave1.dta
├── james
│ └── NLSY97
│ └── nlsy97_v2.do
├── mary
│ └── NLSY97
│ └── nlsy97.do
├── matlab_fortran
│ ├── graphs
│ ├── sensitivity1
│ │ ├── data.xlsx
│ │ ├── good_version.do
│ │ └── script.m
│ └── sensitivity2
│ ├── models.f90
│ ├── models.mod
│ └── nrtype.f90
├── readme.do
├── scatter1.eps
├── scatter1_1.eps
├── scatter1_2.eps
├── ts.eps
├── wave1.dta
└── wave2.dta
└── wave2regs.dta
└── wave2regs2.dta
Data
Always keep your raw data intact (i.e. read-only). Generate separate analysis datasets to perform analysis. After you copied your raw data into the package (
rawdata
folder), set to read only. Look here for windows tips to set read-only, on*nix
systems it’s achmod 444 rawdata
Datasets change over time, keep a record of the date and versions you obtained. It might be difficult to obtain it in the future.
Code
- Generate your analysis data using programs, many journals require you to submit them.
- Separate
code
,data
andoutputs
in your directory structure. This will facilitate (i) keeping versions of your code, and (ii) producing a replication package. - Stronlgy consider the benefits of version control for the reproducibility of your research. Click here for a tutorial.
- Use programs/scripts as opposed to command-line instructions!
- Design your code to be run all at once, and not section by section. Create a master file that calls all other subsidiary files.
- Set the paths only once and at the top of the master-file, and then use either relative paths or global variables, so that users don’t have to change the paths in multiple parts of your code.
- Whenever possible, write paths in a way that is compatible across different operating systems. For example, in Stata, always use forward slashes
/
to separate directories, even if you use Windows. - If you are using software which requires compilation of source code into executables (e.g. Fortran, C, etc.), please use
make
files (or else make sure you provide very detailed instructions on how to compile the different files - we acceptbash
shell scripts). - Name your files in a wise manner. For example, if your output generates tables and figures as new files, give them names that easily identify them. Also, name your master file as “Main”, “Master” or something similar.
- Use sub-folders when your package includes a lot of files. Make sure that your package includes all relevant folders, even empty folders that will be filled with the outputs from the program.
Output
Use log-files and write to disk. Make sure that your code stores the output (both tables and figures) as files on disk (in subdirectory
outputs/
) as opposed to only showing it on screen while you are executing it.Make your code produce all tables and figures as they appear in the manuscript: you will save a lot of time! We recommend splitting your code into small subunits, each of which produce a dedicated output, for instance a function
figure1()
which would produce and save the fileoutputs/figure1.pdf
in your package, whereoutputs/figure1.pdf
is included via\input{outputs/figure1.pdf}
in the \(\LaTeX\) source code of your paper as as Figure 1! Of course the same holds for tables and other output. You could then include the following table in yourREADME
, greatly helping our replicators (and yourself 😉):Output in Paper Output in Package Script/Program to execute Table 1 outputs/tables/table1.tex
code/table1.do
Figure 1 outputs/plots/figure1.pdf
code/figure1.do
Figure 2 outputs/plots/figure2.pdf
code/figure2.do
Run your codes from the replication folder before you submit and make sure it runs and all your results are reproduced - ideally on another machine!
Make sure to delete all expected output from the package before you run it, so you can be sure that all output was actually produced.
Ideally, your submitted paper (your \(\LaTeX\) file which produces it) should depend on the output of your replication package, so that if a piece of output is missing, the paper cannot be compiled (or you would quickly spot the mistake).
Help us by submitting your package without any expected output, i.e. with an empty folder
outputs/
.