# CLI tools for SARIF processing Each of these tools present a high-level command-line interface to extract a specific subset of information from a SARIF file. The main tools are: `sarif-extract-scans-runner`,`sarif-aggregate-scans`,`sarif-create-aggregate-report`. Each tool can print its options and description like: `sarif-extract-scans-runner --help`. The tool was implemented using Python 3.9. # Sarif format information The tool operates on sarif generated by LGTM 1.27.0 (by default) or by the CodeQL CLI (enabled with the -f flag given a value of `CLI`). The supported sarif is [SARIF v2.1.0](https://docs.oasis-open.org/sarif/sarif/v2.1.0/csprd01/sarif-v2.1.0-csprd01.html). The values that the -f flag accepts are: `LGTM` and `CLI`. The CLI versions used against development of the CLI support were: 2.6.3, 2.9.4, and 2.11.4. Minimal tests are also run against the versions in [this build script](./build-multiple-codeql-versions.sh). Currently, those are 2.9.4, 2.12.7, 2.13.5, 2.14.0. The CLI sarif **MUST** contain one additional property `versionControlProvenance` - which needs to look like: ``` "versionControlProvenance": [ { "repositoryUri": "https://github.com/testorg/testrepo.git", "revisionId": "testsha" } ] ``` The script bin/sarif-insert-vcp will add that entry to a SARIF file. # Test Setup This repository includes some test data (in `data`) and uses =git lfs= for storing those test files; installation steps are at [[https://git-lfs.github.com][git-lfs]]; on a mac with homebrew, install it via #+BEGIN_SRC sh brew install git-lfs git lfs install #+END_SRC # Tool Setup Set up the virtual environment and install the packages: ``` python3.9 -m venv .venv . .venv/bin/activate ``` ### For development ``` # Use requirementsDEV.txt python -m pip install -r requirementsDEV.txt ``` ### For distribution ``` # Use requirements.txt python -m pip install -r requirements.txt ``` Then install: ``` pip install -e . ``` # Tool Use ## sarif-extract-scans-runner Parses the SARIF results into a result set of 4 csvs located under a directory structure like: ``` ├── results-log.scanlog ├── results-log.csv ├── results.sarif.scanspec ├── results.sarif.scantables ├── codeflows.csv ├── projects.csv ├── results.csv └── scans.csv ``` where `codeflows.csv`,`projects.csv`, `results.csv`, `scans.csv` are the consumable parsed output of the analysis. `results-log.scanlog` contains a raw log of any errors encountered while parsing the sarif and `results-log.csv` contains a summary of the scanlog contents. ### Sample usage 1 -- no separate timestamps file ``` python bin/sarif-extract-scans-runner sarif-files.txt -o ``` where `cat sarif-files.txt` contains sarif files to process, each entry of the form `/` and separated by newline, like: ``` data/wxWidgets_wxWidgets__2021-11-21_16_06_30__export.sarif data/torvalds_linux__2021-10-21_10_07_00__export.sarif ``` When called this way, `sarif-pad-aggregate` *should* be used because it will overwrite single-date timestamps with a random 1-year range. ### Sample usage 2 -- with separate timestamps file When a separate `timestamps.json` file is available and has the form timestamps = { "db_create_start" : "2023-07-03T00:56:15.576222", "db_create_stop" : ..., "scan_start_date" : ..., "scan_stop_date" : ..., } or { "db_create_start": ..., "db_create_stop": ..., "scan_start": ... "scan_stop": ... } the runner can be called via e.g., ```sh sarif-extract-scans-runner --input-signature CLI --with-timestamps - < ``` ## sarif-pad-aggregate **Optional** Post-fills the `scans.csv` file with more realisitic (but still fake) values for the following columns: `db_create_start`,`db_create_stop`,`scan_start_date`,`scan_stop_date`. These values are not in the input sarif and it may be beneficial to have date values near the present. Otherwise `sarif-extract-scans-runner` will have populated these columns with the value `1970-01-01`. ### sample usage: ``` python bin/sarif-pad-aggregate ``` ## sarif-create-aggregate-report Parses the `results-log.csv` files generated for some batch of input sarifs and creates a final summary report in `summary-report.csv` (unless otherwise specified). ### sample usage: ``` python bin/sarif-create-aggregate-report sarif-files.txt -in ``` # Sample Data Information The query results in =data/= are taken from lgtm.com, which ran the : ql/$LANG/ql/src/codeql-suites/$LANG-lgtm.qls queries. The linux kernel has both single-location results (="kind": "problem"=) and path results (="kind": "path-problem"=). It also has results for multiple source languages. The subset of files referenced by the sarif results is in =data/linux-small/= and is taken from ``` "versionControlProvenance": [ { "repositoryUri": "https://github.com/torvalds/linux.git", "revisionId": "d9abdee5fd5abffd0e763e52fbfa3116de167822" } ] ``` The wxWidgets library has both single-location results (="kind": "problem"=) and path results (="kind": "path-problem"=). The subset of files referenced by the sarif results is in =data/wxWidgets-small/= and is taken from ``` "repositoryUri": "https://github.com/wxWidgets/wxWidgets.git", "revisionId": "7a03d5fe9bca2d2a2cd81fc0620bcbd2cbc4c7b0" ```