Example usage

Overview

Imagine that on a central sever we have a data repository

├── Data folder/
│   ├── database release 1/
│   ├── database release 2/
⋮    ⋮
│   └── version index

Elsewhere, in our user directory, perhaps on another computer, things look like this.

├── latest_data/
├── latest_code/
├── results/
│   ├── old_results_with_inputs_1/
│   ├── old_results_with_inputs_2/
│   └── latest_results/
├── catalogue_results/
│   ├── TIMESTAMP1.json
│   ├── TIMESTAMP2.json
│   ├── TIMESTAMP3.json
│   └── TIMESTAMP4.json

Run analysis

We’ve just made some minor tweaks to our code and now we want to run our analysis. Before we start running any of the scripts in our code folder, we run:

catalogue engage --input_data latest_data --code latest_code

Now we run whatever we need to perform our analysis. Immediately after finishing this we run:

catalogue disengage --input_data latest_data --output_data results/latest_results  --code latest_code

This will produce the following file:

// catalogue_results/TIMESTAMP5.json
{
"timestamp" : {
     "engage": "<timestamp (of .lock)>",
     "disengage": "<timestamp (new)>"
   },
"input_data": {
     "latest_data" : "<hash of directory>"
   },
"output_data": {
       "results/latest results":{
           "summary.txt": "<hash of file>",
           "output.csv": "<hash of file>",
           "metadata.json": "<hash of file>"
           }
     },
"code" : {
     "latest_code": "<git commit hash>"
     }
}

Check outputs

Let’s suppose that between TIMESTAMP4 and TIMESTAMP5 we modified the code to output a further file summary.txt, but that otherwise nothing has changed. We would like to check that our file output.csv hasn’t changed but oops! We’ve just overwritten it. Luckily we can compare to the json at TIMESTAMP4.

catalogue compare \
  catalogue_results/TIMESTAMP4.json \
  catalogue_results/TIMESTAMP5.json

Note that this time as we are comparing two specific files (both generated via repro-catalogue) we directly specify the file locations.

Let us also suppose that one of the other files generated by our analysis, metadata.json, includes a timestamp. The diff would look something like this:

results differ in 3 places:
=============================
timestamp
code
results/latest_results/metadata.json

results matched in 2 places:
==============================
input_data
results/latest_results/output.csv

results could not be compared in 1 places:
============================================
results/latest_results/summary.text

Of course this is what we want:

  • The code has been updated to produce summary.txt, and the timestamps have changed
  • Our data and results have not changed at all
  • Our new file summary.txt could not be compared as that file was not present at TIMESTAMP4

Alternatively, let’s suppose that our changes to the code had affected our results, so that our output.csv file has changed. In that case catalogue would inform us of the problem without us having to permanently store the output of every analysis we run. The hashes alone would not be enough to recover the original TIMESTAMP4 version. But since we have recorded the timestamp, that information can help us track down the data version, and the git commit digest tells us exactly what version of the code is used, making it easier to try and reproduce those results should we wish to do so.

Share outputs

We can then send a zip file of the results to a colleague along with the hash json produced during the final analysis (TIMESTAMP5.json).

They can rerun the analysis and use catalogue to check that the json they received is the same as their own:

catalogue compare TIMESTAMP4.json

OPTIONAL: config

The config command creates a configuration file with argument values for the previous commands. This helps ensure that the process of using repro-catalogue is itself reproducible. Another way to share output is to share a copy of our user directory to a colleague, including separately, a copy of our configuration file. This will allow our colleague to freely run catalogue engage and catalogue disengage without worrying about getting the correct directory paths.

For a detailed example on using config, see the Getting started with catalogue section of this documentation.