20 KiB
- End-to-end example of CLI use
- Database Aquisition
- Repository Selection
- Starting the server
- Running the gh-mrva command-line client
- Running the VS Code plugin
- Footnotes
End-to-end example of CLI use
This document describes a complete cycle of the MRVA workflow. The steps included are
- aquiring CodeQL databases
- selection of databases
- configuration and use of the command-line client
- server startup
- submission of the jobs
- retrieval of the results
- examination of the results
Database Aquisition
General database aquisition is beyond the scope of this document as it is very specific to an organization's environment. Here we use an example for open-source repositories, mrva-open-source-download, which downloads the top 1000 databases for each of C/C++, Java, Python – 3000 CodeQL DBs in all.
The scripts in mrva-open-source-download were used to download on two distinct dates resulting in close to 6000 databases to choose from. The DBs were directly saved to the file system, resulting in paths like
.../mrva-open-source-download/repos-2024-04-29/google/re2/code-scanning/codeql/databases/cpp/db.zip
and
.../mrva-open-source-download/repos/google/re2/code-scanning/codeql/databases/cpp/db.zip
Note that the only information in these paths are (owner, repository, download date). The databases contain more information which is used in the Repository Selection section.
To get a collection of databases follow the instructions.
Repository Selection
Here we select a small subset of those repositories using a collection scripts made for the purpose, the qldbtools package. Clone the full repository before continuing:
mkdir -p ~/work-gh/mrva/
git clone git@github.com:hohn/mrvacommander.git
cd ~/work-gh/mrva/mrvacommander/client/qldbtools && mkdir -p scratch
After performing the installation steps, we can follow the command line use instructions to collect all the database information from the file system into a single table:
cd ~/work-gh/mrva/mrvacommander/client/qldbtools && mkdir -p scratch
source venv/bin/activate
./bin/mc-db-initial-info ~/work-gh/mrva/mrva-open-source-download > scratch/db-info-1.csv
The csvstat tool gives a good overview1; here is a pruned version of the
output
csvstat scratch/db-info-1.csv
1. "ctime"
Type of data: DateTime
...
2. "language"
Type of data: Text
Non-null values: 6000
Unique values: 3
Longest value: 6 characters
Most common values: cpp (2000x)
java (2000x)
python (2000x)
3. "name"
...
4. "owner"
Type of data: Text
Non-null values: 6000
Unique values: 2189
Longest value: 29 characters
Most common values: apache (258x)
google (86x)
microsoft (64x)
spring-projects (56x)
alibaba (42x)
5. "path"
...
6. "size"
Type of data: Number
Non-null values: 6000
Unique values: 5354
Smallest value: 0
Largest value: 1,885,008,701
Sum: 284,766,326,993
...
Row count: 6000
The information critial for selection are the columns
- owner
- name
- language
The size column is interesting: a smallest value of 0 indicates some error while our largest DB is 1.88 GB in size
This information is not sufficient, so we collect more. The following script extracts information from every database on disk and takes more time accordingly – about 30 seconds on my laptop.
./bin/mc-db-refine-info < scratch/db-info-1.csv > scratch/db-info-2.csv
This new table is a merge of all the available meta-information with the previous table causing the increase in the number of rows. The following columns are now present
0:$ csvstat scratch/db-info-2.csv
1. "ctime"
2. "language"
3. "name"
4. "owner"
5. "path"
6. "size"
7. "left_index"
8. "baselineLinesOfCode"
Type of data: Number
Contains null values: True (excluded from calculations)
Non-null values: 11920
Unique values: 4708
Smallest value: 0
Largest value: 22,028,732
Sum: 3,454,019,142
Mean: 289,766.707
Median: 54,870.5
9. "primaryLanguage"
10. "sha"
Type of data: Text
Contains null values: True (excluded from calculations)
Non-null values: 11920
Unique values: 4928
11. "cliVersion"
Type of data: Text
Contains null values: True (excluded from calculations)
Non-null values: 11920
Unique values: 59
Longest value: 6 characters
Most common values: 2.17.0 (3850x)
2.18.0 (3622x)
2.17.2 (1097x)
2.17.6 (703x)
2.16.3 (378x)
12. "creationTime"
Type of data: Text
Contains null values: True (excluded from calculations)
Non-null values: 11920
Unique values: 5345
Longest value: 32 characters
Most common values: None (19x)
2024-03-19 01:40:14.507823+00:00 (16x)
2024-02-29 19:12:59.785147+00:00 (16x)
2024-01-30 22:24:17.411939+00:00 (14x)
2024-04-05 09:34:03.774619+00:00 (14x)
13. "finalised"
Type of data: Boolean
Contains null values: True (excluded from calculations)
Non-null values: 11617
Unique values: 2
Most common values: True (11617x)
None (322x)
14. "db_lang"
15. "db_lang_displayName"
16. "db_lang_file_count"
17. "db_lang_linesOfCode"
Row count: 11939
There are several columns that are critical, namely
- "sha"
- "cliVersion"
- "creationTime"
The others may be useful, but they are not strictly required. The critical ones deserve more explanation:
- "sha": The
gitcommit SHA of the repository the CodeQL database was created from. Required to distinguish query results over the evolution of a code base. - "cliVersion": The version of the CodeQL CLI used to create the database. Required to identify advances/regressions originating from the CodeQL binary.
- "creationTime": The time the database was created. Required (or at least very handy) for following the evolution of query results over time.
This leaves us with a row count of 11939
To start reducing that count, start with
./bin/mc-db-unique cpp < scratch/db-info-2.csv > scratch/db-info-3.csv
and get a reduced count and a new column:
csvstat scratch/db-info-3.csv
3. "CID"
Type of data: Text
Contains null values: False
Non-null values: 5344
Unique values: 5344
Longest value: 6 characters
Most common values: 1f8d99 (1x)
9ab87a (1x)
76fdc7 (1x)
b21305 (1x)
4ae79b (1x)
From the docs: 'Read a table of CodeQL DB information and produce a table with unique entries adding the Cumulative ID (CID) column.'
The CID column combines
- cliVersion
- creationTime
- language
- sha
into a single 6-character string via hashing and with (owner, repo) provides a unique index for every DB.
We still have too many rows. The tables are all in CSV format, so you can use your favorite tool to narrow the selection for your needs. For this document, we simply use a pseudo-random selection of 11 databases via
./bin/mc-db-generate-selection -n 11 \
scratch/vscode-selection.json \
scratch/gh-mrva-selection.json \
< scratch/db-info-3.csv
Note that these use pseudo-random numbers, so the selection is in fact
deterministic. The selected databases in gh-mrva-selection.json, to be used
in section Running the gh-mrva command-line client, are the following:
{
"mirva-list": [
"NLPchina/elasticsearch-sqlctsj168cc4",
"LMAX-Exchange/disruptorctsj3e75ec",
"justauth/JustAuthctsj8a6177",
"FasterXML/jackson-modules-basectsj2fe248",
"ionic-team/capacitor-pluginsctsj38d457",
"PaddlePaddle/PaddleOCRctsj60e555",
"elastic/apm-agent-pythonctsj21dc64",
"flipkart-incubator/zjsonpatchctsjc4db35",
"stephane/libmodbusctsj54237e",
"wso2/carbon-kernelctsj5a8a6e",
"apache/servicecomb-packctsj4d98f5"
]
}
Starting the server
The full instructions for building and running the server are in ../README.md under 'Steps to build and run the server'
With docker-compose set up and this repository cloned as previously described, we just run
cd ~/work-gh/mrva/mrvacommander
docker-compose up --build
and wait until the log output no longer changes.
Then, use the following command to populate the mrvacommander database storage:
cd ~/work-gh/mrva/mrvacommander/client/qldbtools && \
./bin/mc-db-populate-minio -n 11 < scratch/db-info-3.csv
Running the gh-mrva command-line client
The first run uses the test query to verify basic functionality, but it returns no results.
Run MRVA from command line
-
Install mrva cli
mkdir -p ~/work-gh/mrva && cd ~/work-gh/mrva git clone https://github.com/hohn/gh-mrva.git cd ~/work-gh/mrva/gh-mrva && git checkout mrvacommander-end-to-end # Build it go mod edit -replace="github.com/GitHubSecurityLab/gh-mrva=$HOME/work-gh/mrva/gh-mrva" go build . # Sanity check ./gh-mrva -h -
Set up the configuration
mkdir -p ~/.config/gh-mrva cat > ~/.config/gh-mrva/config.yml <<eof # The following options are supported # codeql_path: Path to CodeQL distribution (checkout of codeql repo) # controller: NWO of the MRVA controller to use. Not used here. # list_file: Path to the JSON file containing the target repos # XX: codeql_path: $HOME/work-gh/not-used controller: not-used/mirva-controller list_file: $HOME/work-gh/mrva/gh-mrva/gh-mrva-selection.json eof -
Submit the mrva job
cp ~/work-gh/mrva/mrvacommander/client/qldbtools/scratch/gh-mrva-selection.json \ ~/work-gh/mrva/gh-mrva/gh-mrva-selection.json cd ~/work-gh/mrva/gh-mrva/ ./gh-mrva submit --language cpp --session mirva-session-4160 \ --list mirva-list \ --query ~/work-gh/mrva/gh-mrva/FlatBuffersFunc.ql -
Check the status
cd ~/work-gh/mrva/gh-mrva/ # Check the status ./gh-mrva status --session mirva-session-4160 -
Download the sarif files, optionally also get databases. For the current query / database combination there are zero result hence no downloads.
cd ~/work-gh/mrva/gh-mrva/ # Just download the sarif files ./gh-mrva download --session mirva-session-4160 \ --output-dir mirva-session-4160 # Download the sarif files and CodeQL dbs ./gh-mrva download --session mirva-session-4160 \ --download-dbs \ --output-dir mirva-session-4160
Write query that has some results
First, get the list of paths corresponding to the previously selected databases.
cd ~/work-gh/mrva/mrvacommander/client/qldbtools
. venv/bin/activate
./bin/mc-rows-from-mrva-list scratch/gh-mrva-selection.json \
scratch/db-info-3.csv > scratch/selection-full-info
csvcut -c path scratch/selection-full-info
Use one of these databases to write a query. It need not produce results.
cd ~/work-gh/mrva/gh-mrva/
code gh-mrva.code-workspace
In this case, the trivial findPrintf query, in the file Fprintf.ql
/**
,* @name findPrintf
,* @description find calls to plain fprintf
,* @kind problem
,* @id cpp-fprintf-call
,* @problem.severity warning
,*/
import cpp
from FunctionCall fc
where
fc.getTarget().getName() = "fprintf"
select fc, "call of fprintf"
Repeat the submit steps with this query
- –
- –
-
Submit the mrva job
cp ~/work-gh/mrva/mrvacommander/client/qldbtools/scratch/gh-mrva-selection.json \ ~/work-gh/mrva/gh-mrva/gh-mrva-selection.json cd ~/work-gh/mrva/gh-mrva/ ./gh-mrva submit --language cpp --session mirva-session-3660 \ --list mirva-list \ --query ~/work-gh/mrva/gh-mrva/Fprintf.ql -
Check the status
cd ~/work-gh/mrva/gh-mrva/ ./gh-mrva status --session mirva-session-3660This time we have results
... 0:$ Run name: mirva-session-3660 Status: succeeded Total runs: 1 Total successful scans: 11 Total failed scans: 0 Total skipped repositories: 0 Total skipped repositories due to access mismatch: 0 Total skipped repositories due to not found: 0 Total skipped repositories due to no database: 0 Total skipped repositories due to over limit: 0 Total repositories with findings: 8 Total findings: 7055 Repositories with findings: lz4/lz4ctsj2479c5 (cpp-fprintf-call): 307 Mbed-TLS/mbedtlsctsj17ef85 (cpp-fprintf-call): 6464 tsl0922/ttydctsj2e3faa (cpp-fprintf-call): 11 medooze/media-server-nodectsj5e30b3 (cpp-fprintf-call): 105 ampl/gslctsj4b270e (cpp-fprintf-call): 102 baidu/sofa-pbrpcctsjba3501 (cpp-fprintf-call): 24 dlundquist/sniproxyctsj3d83e7 (cpp-fprintf-call): 34 hyprwm/Hyprlandctsjc2425f (cpp-fprintf-call): 8 -
Download the sarif files, optionally also get databases.
cd ~/work-gh/mrva/gh-mrva/ # Just download the sarif files ./gh-mrva download --session mirva-session-3660 \ --output-dir mirva-session-3660 # Download the sarif files and CodeQL dbs ./gh-mrva download --session mirva-session-3660 \ --download-dbs \ --output-dir mirva-session-3660# And list them: \ls -la *3660* drwxr-xr-x@ 18 hohn staff 576 Nov 14 11:54 . drwxrwxr-x@ 56 hohn staff 1792 Nov 14 11:54 .. -rwxr-xr-x@ 1 hohn staff 9035554 Nov 14 11:54 Mbed-TLS_mbedtlsctsj17ef85_1.sarif -rwxr-xr-x@ 1 hohn staff 57714273 Nov 14 11:54 Mbed-TLS_mbedtlsctsj17ef85_1_db.zip -rwxr-xr-x@ 1 hohn staff 132484 Nov 14 11:54 ampl_gslctsj4b270e_1.sarif -rwxr-xr-x@ 1 hohn staff 99234414 Nov 14 11:54 ampl_gslctsj4b270e_1_db.zip -rwxr-xr-x@ 1 hohn staff 34419 Nov 14 11:54 baidu_sofa-pbrpcctsjba3501_1.sarif -rwxr-xr-x@ 1 hohn staff 55177796 Nov 14 11:54 baidu_sofa-pbrpcctsjba3501_1_db.zip -rwxr-xr-x@ 1 hohn staff 80744 Nov 14 11:54 dlundquist_sniproxyctsj3d83e7_1.sarif -rwxr-xr-x@ 1 hohn staff 2183836 Nov 14 11:54 dlundquist_sniproxyctsj3d83e7_1_db.zip -rwxr-xr-x@ 1 hohn staff 169079 Nov 14 11:54 hyprwm_Hyprlandctsjc2425f_1.sarif -rwxr-xr-x@ 1 hohn staff 21383303 Nov 14 11:54 hyprwm_Hyprlandctsjc2425f_1_db.zip -rwxr-xr-x@ 1 hohn staff 489064 Nov 14 11:54 lz4_lz4ctsj2479c5_1.sarif -rwxr-xr-x@ 1 hohn staff 2991310 Nov 14 11:54 lz4_lz4ctsj2479c5_1_db.zip -rwxr-xr-x@ 1 hohn staff 141336 Nov 14 11:54 medooze_media-server-nodectsj5e30b3_1.sarif -rwxr-xr-x@ 1 hohn staff 38217703 Nov 14 11:54 medooze_media-server-nodectsj5e30b3_1_db.zip -rwxr-xr-x@ 1 hohn staff 33861 Nov 14 11:54 tsl0922_ttydctsj2e3faa_1.sarif -rwxr-xr-x@ 1 hohn staff 5140183 Nov 14 11:54 tsl0922_ttydctsj2e3faa_1_db.zip -
Use the SARIF Viewer plugin in VS Code to open and review the results.
Prepare the source directory so the viewer can be pointed at it
cd ~/work-gh/mrva/gh-mrva/mirva-session-3660 unzip -qd ampl_gslctsj4b270e_1_db ampl_gslctsj4b270e_1_db.zip cd ampl_gslctsj4b270e_1_db/codeql_db unzip -qd src src.zipUse the viewer in VS Code
cd ~/work-gh/mrva/gh-mrva/mirva-session-3660 code ampl_gslctsj4b270e_1.sarif # For the file vegas.c, when asked, point the source viewer to find ~/work-gh/mrva/gh-mrva/mirva-session-3660/ampl_gslctsj4b270e_1_db/codeql_db/src/\ -name vegas.c # Here: ~/work-gh/mrva/gh-mrva/mirva-session-3660/ampl_gslctsj4b270e_1_db/codeql_db/src//home/runner/work/bulk-builder/bulk-builder/monte/vegas.c - (optional) Large result sets are more easily filtered via dataframes or spreadsheets. Convert the SARIF to CSV if needed; see sarif-cli.
Running the VS Code plugin
Compile and Load the Extension
cd ~/work-gh/mrva/vscode-codeql
git checkout mrva-standalone
# Install nvm
brew install nvm
[ -s "/opt/homebrew/opt/nvm/nvm.sh" ] && \. "/opt/homebrew/opt/nvm/nvm.sh"
# or
# curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash
# Install correct node version
cd ./extensions/ql-vscode
nvm install
# Build the extension
cd ~/work-gh/mrva/vscode-codeql/extensions/ql-vscode
npm install
npm run build
# Install extension
cd ~/work-gh/mrva/vscode-codeql/dist
code --force --install-extension vscode-codeql-*.vsix
# Extension 'vscode-codeql-1.13.2-dev.2024.12.10.23.51.57.vsix' was successfully installed.
Continue the CLI Sample using the Extension
Start VS Code
cd ~/work-gh/mrva/gh-mrva/
code .
Set up 'variant analysis repositories', continuing from the
scratch/vscode-selection.json file formed previously:
- Select '{}' and open db selection file
-
paste
~/work-gh/mrva/mrvacommander/client/qldbtools/scratch/vscode-selection.json
- open
Fprintf.ql - right click
>'run variant analysis'
The extension will assemble the pack, send it to the server, and display results as they arrive.
Footnotes
1The csvkit can be installed into the same Python virtual environment as
the qldbtools.