Example usage¶

Contents¶

Overview
Run analysis
Check outputs
Share outputs

Overview¶

Imagine that on a central sever we have a data repository

├── Data folder/
│   ├── database release 1/
│   ├── database release 2/
⋮    ⋮
│   └── version index

Elsewhere, in our user directory, perhaps on another computer, things look like this.

├── latest_data/
├── latest_code/
├── results/
│   ├── old_results_with_inputs_1/
│   ├── old_results_with_inputs_2/
│   └── latest_results/
├── catalogue_results/
│   ├── TIMESTAMP1.json
│   ├── TIMESTAMP2.json
│   ├── TIMESTAMP3.json
│   └── TIMESTAMP4.json

Run analysis¶

We’ve just made some minor tweaks to our code and now we want to run our analysis. Before we start running any of the scripts in our code folder, we run:

catalogue engage --input_data latest_data --code latest_code

Now we run whatever we need to perform our analysis. Immediately after finishing this we run:

catalogue disengage --input_data latest_data --output_data results/latest_results  --code latest_code

This will produce the following file:

// catalogue_results/TIMESTAMP5.json
{
"timestamp" : {
     "engage": "<timestamp (of .lock)>",
     "disengage": "<timestamp (new)>"
   },
"input_data": {
     "latest_data" : "<hash of directory>"
   },
"output_data": {
       "results/latest results":{
           "summary.txt": "<hash of file>",
           "output.csv": "<hash of file>",
           "metadata.json": "<hash of file>"
           }
     },
"code" : {
     "latest_code": "<git commit hash>"
     }
}

Check outputs¶

Let’s suppose that between TIMESTAMP4 and TIMESTAMP5 we modified the code to output a further file summary.txt, but that otherwise nothing has changed. We would like to check that our file output.csv hasn’t changed but oops! We’ve just overwritten it. Luckily we can compare to the json at TIMESTAMP4.

catalogue compare \
  catalogue_results/TIMESTAMP4.json \
  catalogue_results/TIMESTAMP5.json

Note that this time as we are comparing two specific files (both generated via repro-catalogue) we directly specify the file locations.

Let us also suppose that one of the other files generated by our analysis, metadata.json, includes a timestamp. The diff would look something like this:

results differ in 3 places:
=============================
timestamp
code
results/latest_results/metadata.json

results matched in 2 places:
==============================
input_data
results/latest_results/output.csv

results could not be compared in 1 places:
============================================
results/latest_results/summary.text

Of course this is what we want:

The code has been updated to produce summary.txt, and the timestamps have changed
Our data and results have not changed at all
Our new file summary.txt could not be compared as that file was not present at TIMESTAMP4

Alternatively, let’s suppose that our changes to the code had affected our results, so that our output.csv file has changed. In that case catalogue would inform us of the problem without us having to permanently store the output of every analysis we run. The hashes alone would not be enough to recover the original TIMESTAMP4 version. But since we have recorded the timestamp, that information can help us track down the data version, and the git commit digest tells us exactly what version of the code is used, making it easier to try and reproduce those results should we wish to do so.

Example usage¶

Contents¶

Overview¶

Run analysis¶

Check outputs¶

Share outputs¶

OPTIONAL: config¶