Example usage¶
Contents¶
Overview¶
Imagine that on a central sever we have a data repository
├── Data folder/
│ ├── database release 1/
│ ├── database release 2/
⋮ ⋮
│ └── version index
Elsewhere, in our user directory, perhaps on another computer, things look like this.
├── latest_data/
├── latest_code/
├── results/
│ ├── old_results_with_inputs_1/
│ ├── old_results_with_inputs_2/
│ └── latest_results/
├── catalogue_results/
│ ├── TIMESTAMP1.json
│ ├── TIMESTAMP2.json
│ ├── TIMESTAMP3.json
│ └── TIMESTAMP4.json
Run analysis¶
We’ve just made some minor tweaks to our code and now we want to run our analysis. Before we start running any of the scripts in our code folder, we run:
catalogue engage --input_data latest_data --code latest_code
Now we run whatever we need to perform our analysis. Immediately after finishing this we run:
catalogue disengage --input_data latest_data --output_data results/latest_results --code latest_code
This will produce the following file:
// catalogue_results/TIMESTAMP5.json
{
"timestamp" : {
"engage": "<timestamp (of .lock)>",
"disengage": "<timestamp (new)>"
},
"input_data": {
"latest_data" : "<hash of directory>"
},
"output_data": {
"results/latest results":{
"summary.txt": "<hash of file>",
"output.csv": "<hash of file>",
"metadata.json": "<hash of file>"
}
},
"code" : {
"latest_code": "<git commit hash>"
}
}
Check outputs¶
Let’s suppose that between TIMESTAMP4 and TIMESTAMP5 we modified the code to output a further file summary.txt, but that otherwise nothing has changed. We would like to check that our file output.csv hasn’t changed but oops! We’ve just overwritten it. Luckily we can compare to the json at TIMESTAMP4.
catalogue compare \
catalogue_results/TIMESTAMP4.json \
catalogue_results/TIMESTAMP5.json
Note that this time as we are comparing two specific files (both generated via repro-catalogue) we directly specify the file locations.
Let us also suppose that one of the other files generated by our analysis, metadata.json, includes a timestamp. The diff would look something like this:
results differ in 3 places:
=============================
timestamp
code
results/latest_results/metadata.json
results matched in 2 places:
==============================
input_data
results/latest_results/output.csv
results could not be compared in 1 places:
============================================
results/latest_results/summary.text
Of course this is what we want:
- The code has been updated to produce
summary.txt, and the timestamps have changed - Our data and results have not changed at all
- Our new file
summary.txtcould not be compared as that file was not present at TIMESTAMP4
Alternatively, let’s suppose that our changes to the code had affected our results, so that our output.csv file has changed. In that case catalogue would inform us of the problem without us having to permanently store the output of every analysis we run. The hashes alone would not be enough to recover the original TIMESTAMP4 version. But since we have recorded the timestamp, that information can help us track down the data version, and the git commit digest tells us exactly what version of the code is used, making it easier to try and reproduce those results should we wish to do so.