Commands and usage¶
Catalogue overview¶
The catalogue tool comes with four commands (engage, disengage, compare, config).
The first two commands (engage, disengage) both require similar arguments and should be run consecutively.
config is an optional command that can be run to generate a configuration file of arguments values for the other three commands.
USAGE
catalogue [-h] <command> [<arg1>] ... [<argN>]
ARGUMENTS
<command> The command to execute
<arg> The arguments of the command
GLOBAL OPTIONS
-h (--help) Display help message.
AVAILABLE COMMANDS
engage Run before an analysis. Saves hashes of `input_data` and `code`.
disengage Run after an analysis. Check `input_data` and `code` hashes now
are the same as at `engage`. Hash `output_data` and save all
hashes to a `TIMESTAMP.json` file.
compare Compare hashes.
config Create a configuration file of arguments that will be used by
by the other commands.
Note that all arguments have default values which will be used if they are not provided. To see these use:
catalogue <command> -h
Available commands¶
engage¶
This command is run before an analysis is conducted:
catalogue engage --input_data <data directory> --code <code directory>
Replace <data directory> and <code directory> with the path to the data and code directories. In practice, this might look something like this:
catalogue engage --input_data data_dir --code code_dir
This will do a series of things. First it will check that the git working tree in our code folder is clean. It gives users a choice:
Working directory contains uncommitted changes.
Do you want to stage and commit all changes? (y/[n])
If we choose to proceed, catalogue will stage and commit all changes in the code directory. Next it will create a temporary file .lock in json format:
//catalogue_results/.lock
{
"timestamp" : {
"engage": "<timestamp (of catalogue engage)>"
},
"input_data": {
"<data directory>" : "<hash of directory>"
},
"code" : {
"<code directory>": "<latest git commit hash>"
}
}
Once catalogue is engaged, you can run your analysis.
disengage¶
The disengage command is run immediately after finishing an analysis to version the results.
For example, my analysis is done by running my code as an executable file in command prompt. Once I have finished running this code, I proceed to the disengage stage:
catalogue disengage \
--input_data <data directory> \
--code <code directory> \
--output_data <results directory>
Replace all <...> with a path to the directory described. In practice, the command might look something like this:
catalogue disengage --input_data data_dir --code code_dir --output_data results_dir
Running this command checks that the input_data and code hashes match the hashes in the .lock file (created during engage). If they do, it will take hashes of the files in output_data and produce the following file in a catalogue_results directory:
// catalogue_results/<TIMESTAMP>.json
{
"timestamp" : {
"engage": "<timestamp (of .lock)>",
"disengage": "<timestamp (new)>"
},
"input_data": {
"<data directory>": "<hash of directory>"
},
"output_data": {
"<results directory>":{
"<output file 1>": "<hash of file>",
"<output file 2>": "<hash of file>",
...
}
},
"code" : {
"<code directory>": "<git commit hash>"
}
}
compare¶
The compare command can be used to compare two catalogue output files against each other:
catalogue compare <TIMESTAMP1>.json <TIMESTAMP2>.json
The arguments should be the paths to the two files to be compared. For example, I might want to compare results produced on different days to check nothing has changed in this period:
catalogue compare catalogue_results/200510-120000.json catalogue_results/200514-170500.json
If the hashes in the files are the same, this means the same analysis was run on the same data with the same outputs both times. In that case, catalogue will output something like:
results differ in 1 places:
=============================
timestamp
results matched in 3 places:
==============================
input_data
code
output_data
results could not be compared in 0 places:
============================================
If only one file is provided to the compare command, then the hashes in the file are compared with hashes of the current state of the working directory. In that case, it is possible to also specify paths to the input_data, code and output_data (otherwise the default values are used).
config¶
The other commands engage, disengage and compare use a common set of arguments:
Required --input_data, --code, --output_data.
Optional --csv, --catalogue_results (see below)
The config command creates a configuration file with values for the above arguments.
This allows you to specify the arguments just once and have them used for each of
engage, disengage and compare. It is recommended you run this command first.
catalogue config --input_data <data directory> --code <code directory> --output_data <results directory> --csv <csv_file_name> --catalogue_results <versioning_files>
This will create a catalogue_config.yaml configuration file in the root repository with the following format:
input_data: <data directory>
code: <code directory>
output_data: <results directory>
csv: <csv_file_name>
catalogue_results: <versioning_files>
We can now run commands without specifying the full arguments. The arguments will instead be taken from the configuration file. Repro-catalogue uses the following priority ordering for arguments:
specified arguments > configuration arguments > default arguments
For example running:
bash catalogue engage <data directory2>
Is equivalent to running:
bash catalogue engage --input_data <data directory2> --code <code directory>
Here we have specified an exact argument so it has been selected. No arguments however have been chosen for --code
and so the parser instead looks inside the catalogue_config.yaml file to find the argument value.
We can also manually edit the catalogue_config.yaml configuration file, but it needs to retain the illustrated keyword:value format
for it to be used. The config command also uses the same priority ordering for arguments. Rerunning config will overwrite any
previous config files and create a new one.
Optional arguments¶
–csv¶
It is possible to save the outputs from disengage to a csv rather than a json file. For this, use the --csv flag followed by the name of the file to save results to. Each new run will be appended as a new line to the csv file. For example:
catalogue disengage --input_data data_dir --code code_dir --output_data results_dir --csv hashes.csv
The compare command can then also be used with a --csv flag. In that case, one would provide the two timestamps to compare (these must exist in the csv file for the command to work):
catalogue compare 200510-120000 200514-170500 --csv hashes.csv
It is possible to provide just one timestamp instead of two and this will be compared against the state of the current working directory.
–catalogue_results¶
By default, all files created by catalogue are saved in a catalogue_results directory. It is possible to change this by using the optional --catalogue_results flag. For exmaple:
catalogue engage --input_data data_dir --code code_dir --catalogue_results versioning_files
Note that if you change the default --catalogue_results directory, you have to use this flag in each subsequent command. Also, this directory cannot be the same as the --code directory.