Commands and usage

Catalogue overview

The catalogue tool comes with four commands (engage, disengage, compare, config). The first two commands (engage, disengage) both require similar arguments and should be run consecutively. config is an optional command that can be run to generate a configuration file of arguments values for the other three commands.

USAGE
  catalogue [-h] <command> [<arg1>] ... [<argN>]

ARGUMENTS
  <command>       The command to execute
  <arg>           The arguments of the command

GLOBAL OPTIONS
  -h (--help)     Display help message.

AVAILABLE COMMANDS
  engage          Run before an analysis. Saves hashes of `input_data` and `code`.
  disengage       Run after an analysis. Check `input_data` and `code` hashes now
                  are the same as at `engage`. Hash `output_data` and save all
                  hashes to a `TIMESTAMP.json` file.
  compare         Compare hashes.
  config          Create a configuration file of arguments that will be used by
                  by the other commands.

Note that all arguments have default values which will be used if they are not provided. To see these use:

catalogue <command> -h

Available commands

engage

This command is run before an analysis is conducted:

catalogue engage --input_data <data directory> --code <code directory>

Replace <data directory> and <code directory> with the path to the data and code directories. In practice, this might look something like this:

catalogue engage --input_data data_dir --code code_dir

This will do a series of things. First it will check that the git working tree in our code folder is clean. It gives users a choice:

Working directory contains uncommitted changes.
Do you want to stage and commit all changes? (y/[n])

If we choose to proceed, catalogue will stage and commit all changes in the code directory. Next it will create a temporary file .lock in json format:

//catalogue_results/.lock
{
"timestamp" : {
    "engage": "<timestamp (of catalogue engage)>"
  },
"input_data": {
     "<data directory>" : "<hash of directory>"
   },
"code" : {
     "<code directory>": "<latest git commit hash>"
     }
}

Once catalogue is engaged, you can run your analysis.

disengage

The disengage command is run immediately after finishing an analysis to version the results.

For example, my analysis is done by running my code as an executable file in command prompt. Once I have finished running this code, I proceed to the disengage stage:

catalogue disengage \
  --input_data <data directory> \
  --code <code directory> \
  --output_data <results directory>

Replace all <...> with a path to the directory described. In practice, the command might look something like this:

catalogue disengage --input_data data_dir --code code_dir --output_data results_dir

Running this command checks that the input_data and code hashes match the hashes in the .lock file (created during engage). If they do, it will take hashes of the files in output_data and produce the following file in a catalogue_results directory:

// catalogue_results/<TIMESTAMP>.json
{
"timestamp" : {
     "engage": "<timestamp (of .lock)>",
     "disengage": "<timestamp (new)>"
   },
"input_data": {
     "<data directory>": "<hash of directory>"
   },
"output_data": {
       "<results directory>":{
           "<output file 1>": "<hash of file>",
           "<output file 2>": "<hash of file>",
           ...
           }
     },
"code" : {
     "<code directory>": "<git commit hash>"
     }
}

compare

The compare command can be used to compare two catalogue output files against each other:

catalogue compare <TIMESTAMP1>.json <TIMESTAMP2>.json

The arguments should be the paths to the two files to be compared. For example, I might want to compare results produced on different days to check nothing has changed in this period:

catalogue compare catalogue_results/200510-120000.json catalogue_results/200514-170500.json

If the hashes in the files are the same, this means the same analysis was run on the same data with the same outputs both times. In that case, catalogue will output something like:

results differ in 1 places:
=============================
timestamp

results matched in 3 places:
==============================
input_data
code
output_data

results could not be compared in 0 places:
============================================

If only one file is provided to the compare command, then the hashes in the file are compared with hashes of the current state of the working directory. In that case, it is possible to also specify paths to the input_data, code and output_data (otherwise the default values are used).

config

The other commands engage, disengage and compare use a common set of arguments:

Required --input_data, --code, --output_data. Optional --csv, --catalogue_results (see below)

The config command creates a configuration file with values for the above arguments. This allows you to specify the arguments just once and have them used for each of engage, disengage and compare. It is recommended you run this command first.

catalogue config --input_data <data directory> --code <code directory> --output_data <results directory> --csv <csv_file_name> --catalogue_results <versioning_files>

This will create a catalogue_config.yaml configuration file in the root repository with the following format:

input_data: <data directory>
code: <code directory>
output_data: <results directory>
csv: <csv_file_name>
catalogue_results: <versioning_files>

We can now run commands without specifying the full arguments. The arguments will instead be taken from the configuration file. Repro-catalogue uses the following priority ordering for arguments:

specified arguments > configuration arguments > default arguments

For example running:

bash catalogue engage <data directory2>

Is equivalent to running:

bash catalogue engage --input_data <data directory2> --code <code directory>

Here we have specified an exact argument so it has been selected. No arguments however have been chosen for --code and so the parser instead looks inside the catalogue_config.yaml file to find the argument value.

We can also manually edit the catalogue_config.yaml configuration file, but it needs to retain the illustrated keyword:value format for it to be used. The config command also uses the same priority ordering for arguments. Rerunning config will overwrite any previous config files and create a new one.

Optional arguments

–csv

It is possible to save the outputs from disengage to a csv rather than a json file. For this, use the --csv flag followed by the name of the file to save results to. Each new run will be appended as a new line to the csv file. For example:

catalogue disengage --input_data data_dir --code code_dir --output_data results_dir --csv hashes.csv

The compare command can then also be used with a --csv flag. In that case, one would provide the two timestamps to compare (these must exist in the csv file for the command to work):

catalogue compare 200510-120000 200514-170500 --csv hashes.csv

It is possible to provide just one timestamp instead of two and this will be compared against the state of the current working directory.

–catalogue_results

By default, all files created by catalogue are saved in a catalogue_results directory. It is possible to change this by using the optional --catalogue_results flag. For exmaple:

catalogue engage --input_data data_dir --code code_dir --catalogue_results versioning_files

Note that if you change the default --catalogue_results directory, you have to use this flag in each subsequent command. Also, this directory cannot be the same as the --code directory.