mrvacommander/cli-end-to-end-detailed.org at f809917c2e34151e5e5ddb262d49b6efc9d19bfb

mrva/mrvacommander

Fork 0

Files

Michael Hohn b61fbf8896 Small documentation update

2024-11-19 15:25:35 -08:00

19 KiB

Raw Blame History

End-to-end example of CLI use
Database Aquisition
Repository Selection
Starting the server
Running the gh-mrva command-line client
- Run MRVA from command line
- Write query that has some results
Footnotes

End-to-end example of CLI use

This document describes a complete cycle of the MRVA workflow. The steps included are

aquiring CodeQL databases
selection of databases
configuration and use of the command-line client
server startup
submission of the jobs
retrieval of the results
examination of the results

Database Aquisition

General database aquisition is beyond the scope of this document as it is very specific to an organization's environment. Here we use an example for open-source repositories, mrva-open-source-download, which downloads the top 1000 databases for each of C/C++, Java, Python – 3000 CodeQL DBs in all.

The scripts in mrva-open-source-download were used to download on two distinct dates resulting in close to 6000 databases to choose from. The DBs were directly saved to the file system, resulting in paths like

.../mrva-open-source-download/repos-2024-04-29/google/re2/code-scanning/codeql/databases/cpp/db.zip

and

.../mrva-open-source-download/repos/google/re2/code-scanning/codeql/databases/cpp/db.zip

Note that the only information in these paths are (owner, repository, download date). The databases contain more information which is used in the Repository Selection section.

To get a collection of databases follow the instructions.

Repository Selection

Here we select a small subset of those repositories using a collection scripts made for the purpose, the qldbtools package. Clone the full repository before continuing:

  mkdir -p ~/work-gh/mrva/
  git clone git@github.com:hohn/mrvacommander.git
  cd ~/work-gh/mrva/mrvacommander/client/qldbtools && mkdir -p scratch

After performing the installation steps, we can follow the command line use instructions to collect all the database information from the file system into a single table:

  cd ~/work-gh/mrva/mrvacommander/client/qldbtools && mkdir -p scratch
  source venv/bin/activate
  ./bin/mc-db-initial-info ~/work-gh/mrva/mrva-open-source-download > scratch/db-info-1.csv

The csvstat tool gives a good overview¹; here is a pruned version of the output

  csvstat  scratch/db-info-1.csv 
    1. "ctime"
        Type of data:          DateTime
        ...

    2. "language"
      Type of data:          Text
      Non-null values:       6000
      Unique values:         3
      Longest value:         6 characters
      Most common values:    cpp (2000x)
                             java (2000x)
                             python (2000x)
    3. "name"
       ...
    4. "owner"
      Type of data:          Text
      Non-null values:       6000
      Unique values:         2189
      Longest value:         29 characters
      Most common values:    apache (258x)
                             google (86x)
                             microsoft (64x)
                             spring-projects (56x)
                             alibaba (42x)
    5. "path"
       ...
    6. "size"
      Type of data:          Number
      Non-null values:       6000
      Unique values:         5354
      Smallest value:        0
      Largest value:         1,885,008,701
      Sum:                   284,766,326,993
      ...

  Row count: 6000

The information critial for selection are the columns

owner
name
language

The size column is interesting: a smallest value of 0 indicates some error while our largest DB is 1.88 GB in size

This information is not sufficient, so we collect more. The following script extracts information from every database on disk and takes more time accordingly – about 30 seconds on my laptop.

  ./bin/mc-db-refine-info < scratch/db-info-1.csv > scratch/db-info-2.csv

This new table is a merge of all the available meta-information with the previous table causing the increase in the number of rows. The following columns are now present

  0:$     csvstat  scratch/db-info-2.csv 
    1. "ctime"
    2. "language"
    3. "name"
    4. "owner"
    5. "path"
    6. "size"
    7. "left_index"
    8. "baselineLinesOfCode"
      Type of data:          Number
      Contains null values:  True (excluded from calculations)
      Non-null values:       11920
      Unique values:         4708
      Smallest value:        0
      Largest value:         22,028,732
      Sum:                   3,454,019,142
      Mean:                  289,766.707
      Median:                54,870.5
    9. "primaryLanguage"
   10. "sha"
      Type of data:          Text
      Contains null values:  True (excluded from calculations)
      Non-null values:       11920
      Unique values:         4928
   11. "cliVersion"
      Type of data:          Text
      Contains null values:  True (excluded from calculations)
      Non-null values:       11920
      Unique values:         59
      Longest value:         6 characters
      Most common values:    2.17.0 (3850x)
                             2.18.0 (3622x)
                             2.17.2 (1097x)
                             2.17.6 (703x)
                             2.16.3 (378x)
   12. "creationTime"
      Type of data:          Text
      Contains null values:  True (excluded from calculations)
      Non-null values:       11920
      Unique values:         5345
      Longest value:         32 characters
      Most common values:    None (19x)
                             2024-03-19 01:40:14.507823+00:00 (16x)
                             2024-02-29 19:12:59.785147+00:00 (16x)
                             2024-01-30 22:24:17.411939+00:00 (14x)
                             2024-04-05 09:34:03.774619+00:00 (14x)
   13. "finalised"
      Type of data:          Boolean
      Contains null values:  True (excluded from calculations)
      Non-null values:       11617
      Unique values:         2
      Most common values:    True (11617x)
                             None (322x)
   14. "db_lang"
   15. "db_lang_displayName"
   16. "db_lang_file_count"
   17. "db_lang_linesOfCode"

  Row count: 11939

There are several columns that are critical, namely

"sha"
"cliVersion"
"creationTime"

The others may be useful, but they are not strictly required. The critical ones deserve more explanation:

"sha": The git commit SHA of the repository the CodeQL database was created from. Required to distinguish query results over the evolution of a code base.
"cliVersion": The version of the CodeQL CLI used to create the database. Required to identify advances/regressions originating from the CodeQL binary.
"creationTime": The time the database was created. Required (or at least very handy) for following the evolution of query results over time.

This leaves us with a row count of 11939

To start reducing that count, start with

  ./bin/mc-db-unique cpp < scratch/db-info-2.csv > scratch/db-info-3.csv

and get a reduced count and a new column:

  csvstat  scratch/db-info-3.csv 
  3. "CID"

    Type of data:          Text
    Contains null values:  False
    Non-null values:       5344
    Unique values:         5344
    Longest value:         6 characters
    Most common values:    1f8d99 (1x)
                           9ab87a (1x)
                           76fdc7 (1x)
                           b21305 (1x)
                           4ae79b (1x)

From the docs: 'Read a table of CodeQL DB information and produce a table with unique entries adding the Cumulative ID (CID) column.'

The CID column combines

cliVersion
creationTime
language
sha

into a single 6-character string via hashing and with (owner, repo) provides a unique index for every DB.

We still have too many rows. The tables are all in CSV format, so you can use your favorite tool to narrow the selection for your needs. For this document, we simply use a pseudo-random selection of 11 databases via

  ./bin/mc-db-generate-selection -n 11 \
                                 scratch/vscode-selection.json \
                                 scratch/gh-mrva-selection.json \
                                 < scratch/db-info-3.csv

Note that these use pseudo-random numbers, so the selection is in fact deterministic. The selected databases in gh-mrva-selection.json, to be used in section Running the gh-mrva command-line client, are the following:

  {
      "mirva-list": [
          "NLPchina/elasticsearch-sqlctsj168cc4",
          "LMAX-Exchange/disruptorctsj3e75ec",
          "justauth/JustAuthctsj8a6177",
          "FasterXML/jackson-modules-basectsj2fe248",
          "ionic-team/capacitor-pluginsctsj38d457",
          "PaddlePaddle/PaddleOCRctsj60e555",
          "elastic/apm-agent-pythonctsj21dc64",
          "flipkart-incubator/zjsonpatchctsjc4db35",
          "stephane/libmodbusctsj54237e",
          "wso2/carbon-kernelctsj5a8a6e",
          "apache/servicecomb-packctsj4d98f5"
      ]
  }

Starting the server

The full instructions for building and running the server are in ../README.md under 'Steps to build and run the server'

With docker-compose set up and this repository cloned as previously described, we just run

      cd ~/work-gh/mrva/mrvacommander
      docker-compose up --build

and wait until the log output no longer changes.

Then, use the following command to populate the mrvacommander database storage:

  cd ~/work-gh/mrva/mrvacommander/client/qldbtools && \
      ./bin/mc-db-populate-minio -n 11 < scratch/db-info-3.csv

Running the gh-mrva command-line client

The first run uses the test query to verify basic functionality, but it returns no results.

Run MRVA from command line

Install mrva cli

  mkdir -p ~/work-gh/mrva && cd ~/work-gh/mrva
  git clone https://github.com/hohn/gh-mrva.git
  cd ~/work-gh/mrva/gh-mrva && git checkout mrvacommander-end-to-end

  # Build it
  go mod edit -replace="github.com/GitHubSecurityLab/gh-mrva=$HOME/work-gh/mrva/gh-mrva"
  go build .

  # Sanity check
  ./gh-mrva -h

Set up the configuration

  mkdir -p ~/.config/gh-mrva
  cat > ~/.config/gh-mrva/config.yml <<eof
  # The following options are supported
  # codeql_path: Path to CodeQL distribution (checkout of codeql repo)
  # controller: NWO of the MRVA controller to use.  Not used here.
  # list_file: Path to the JSON file containing the target repos

  # XX:
  codeql_path: $HOME/work-gh/not-used
  controller: not-used/mirva-controller
  list_file: $HOME/work-gh/mrva/gh-mrva/gh-mrva-selection.json
  eof

Submit the mrva job

  cp ~/work-gh/mrva/mrvacommander/client/qldbtools/scratch/gh-mrva-selection.json \
     ~/work-gh/mrva/gh-mrva/gh-mrva-selection.json 

  cd ~/work-gh/mrva/gh-mrva/
  ./gh-mrva submit --language cpp --session mirva-session-1360    \
            --list mirva-list                                     \
            --query ~/work-gh/mrva/gh-mrva/FlatBuffersFunc.ql

Check the status

  cd ~/work-gh/mrva/gh-mrva/

  # Check the status
  ./gh-mrva status --session mirva-session-1360

Download the sarif files, optionally also get databases. For the current query / database combination there are zero result hence no downloads.

  cd ~/work-gh/mrva/gh-mrva/
  # Just download the sarif files
  ./gh-mrva download --session mirva-session-1360 \
            --output-dir mirva-session-1360

  # Download the sarif files and CodeQL dbs
  ./gh-mrva download --session mirva-session-1360 \
            --download-dbs \
            --output-dir mirva-session-1360

Write query that has some results

First, get the list of paths corresponding to the previously selected databases.

  cd ~/work-gh/mrva/mrvacommander/client/qldbtools 
  . venv/bin/activate
  ./bin/mc-rows-from-mrva-list scratch/gh-mrva-selection.json \
                               scratch/db-info-3.csv > scratch/selection-full-info
  csvcut -c path scratch/selection-full-info

Use one of these databases to write a query. It need not produce results.

  cd ~/work-gh/mrva/gh-mrva/
  code gh-mrva.code-workspace

In this case, the trivial findPrintf query, in the file Fprintf.ql

  /**
   ,* @name findPrintf
   ,* @description find calls to plain fprintf
   ,* @kind problem
   ,* @id cpp-fprintf-call
   ,* @problem.severity warning
   ,*/

  import cpp

  from FunctionCall fc
  where
    fc.getTarget().getName() = "fprintf"
  select fc, "call of fprintf"

Repeat the submit steps with this query

Submit the mrva job

  cp ~/work-gh/mrva/mrvacommander/client/qldbtools/scratch/gh-mrva-selection.json \
     ~/work-gh/mrva/gh-mrva/gh-mrva-selection.json 

  cd ~/work-gh/mrva/gh-mrva/
  ./gh-mrva submit --language cpp --session mirva-session-3650    \
            --list mirva-list                                     \
            --query ~/work-gh/mrva/gh-mrva/Fprintf.ql

Check the status

  cd ~/work-gh/mrva/gh-mrva/
  ./gh-mrva status --session mirva-session-3650

This time we have results

          ...
  0:$ Run name: mirva-session-3650
  Status: succeeded
  Total runs: 1
  Total successful scans: 11
  Total failed scans: 0
  Total skipped repositories: 0
  Total skipped repositories due to access mismatch: 0
  Total skipped repositories due to not found: 0
  Total skipped repositories due to no database: 0
  Total skipped repositories due to over limit: 0
  Total repositories with findings: 8
  Total findings: 7055
  Repositories with findings:
    lz4/lz4ctsj2479c5 (cpp-fprintf-call): 307
    Mbed-TLS/mbedtlsctsj17ef85 (cpp-fprintf-call): 6464
    tsl0922/ttydctsj2e3faa (cpp-fprintf-call): 11
    medooze/media-server-nodectsj5e30b3 (cpp-fprintf-call): 105
    ampl/gslctsj4b270e (cpp-fprintf-call): 102
    baidu/sofa-pbrpcctsjba3501 (cpp-fprintf-call): 24
    dlundquist/sniproxyctsj3d83e7 (cpp-fprintf-call): 34
    hyprwm/Hyprlandctsjc2425f (cpp-fprintf-call): 8

Download the sarif files, optionally also get databases.

  cd ~/work-gh/mrva/gh-mrva/
  # Just download the sarif files
  ./gh-mrva download --session mirva-session-3650 \
            --output-dir mirva-session-3650

  # Download the sarif files and CodeQL dbs
  ./gh-mrva download --session mirva-session-3650 \
            --download-dbs \
            --output-dir mirva-session-3650

  # And list them:
  \ls -la *3650*
  drwxr-xr-x@ 18 hohn  staff       576 Nov 14 11:54 .
  drwxrwxr-x@ 56 hohn  staff      1792 Nov 14 11:54 ..
  -rwxr-xr-x@  1 hohn  staff   9035554 Nov 14 11:54 Mbed-TLS_mbedtlsctsj17ef85_1.sarif
  -rwxr-xr-x@  1 hohn  staff  57714273 Nov 14 11:54 Mbed-TLS_mbedtlsctsj17ef85_1_db.zip
  -rwxr-xr-x@  1 hohn  staff    132484 Nov 14 11:54 ampl_gslctsj4b270e_1.sarif
  -rwxr-xr-x@  1 hohn  staff  99234414 Nov 14 11:54 ampl_gslctsj4b270e_1_db.zip
  -rwxr-xr-x@  1 hohn  staff     34419 Nov 14 11:54 baidu_sofa-pbrpcctsjba3501_1.sarif
  -rwxr-xr-x@  1 hohn  staff  55177796 Nov 14 11:54 baidu_sofa-pbrpcctsjba3501_1_db.zip
  -rwxr-xr-x@  1 hohn  staff     80744 Nov 14 11:54 dlundquist_sniproxyctsj3d83e7_1.sarif
  -rwxr-xr-x@  1 hohn  staff   2183836 Nov 14 11:54 dlundquist_sniproxyctsj3d83e7_1_db.zip
  -rwxr-xr-x@  1 hohn  staff    169079 Nov 14 11:54 hyprwm_Hyprlandctsjc2425f_1.sarif
  -rwxr-xr-x@  1 hohn  staff  21383303 Nov 14 11:54 hyprwm_Hyprlandctsjc2425f_1_db.zip
  -rwxr-xr-x@  1 hohn  staff    489064 Nov 14 11:54 lz4_lz4ctsj2479c5_1.sarif
  -rwxr-xr-x@  1 hohn  staff   2991310 Nov 14 11:54 lz4_lz4ctsj2479c5_1_db.zip
  -rwxr-xr-x@  1 hohn  staff    141336 Nov 14 11:54 medooze_media-server-nodectsj5e30b3_1.sarif
  -rwxr-xr-x@  1 hohn  staff  38217703 Nov 14 11:54 medooze_media-server-nodectsj5e30b3_1_db.zip
  -rwxr-xr-x@  1 hohn  staff     33861 Nov 14 11:54 tsl0922_ttydctsj2e3faa_1.sarif
  -rwxr-xr-x@  1 hohn  staff   5140183 Nov 14 11:54 tsl0922_ttydctsj2e3faa_1_db.zip

Use the SARIF Viewer plugin in VS Code to open and review the results.

Prepare the source directory so the viewer can be pointed at it

  cd ~/work-gh/mrva/gh-mrva/mirva-session-3650

  unzip -qd ampl_gslctsj4b270e_1_db  ampl_gslctsj4b270e_1_db.zip

  cd ampl_gslctsj4b270e_1_db/codeql_db
  unzip -qd src  src.zip

Use the viewer in VS Code

  cd ~/work-gh/mrva/gh-mrva/mirva-session-3650
  code ampl_gslctsj4b270e_1.sarif 

  # For the file vegas.c, when asked, point the source viewer to 
  find ~/work-gh/mrva/gh-mrva/mirva-session-3650/ampl_gslctsj4b270e_1_db/codeql_db/src/\
       -name vegas.c

  # Here: ~/work-gh/mrva/gh-mrva/mirva-session-3650/ampl_gslctsj4b270e_1_db/codeql_db/src//home/runner/work/bulk-builder/bulk-builder/monte/vegas.c

(optional) Large result sets are more easily filtered via dataframes or spreadsheets. Convert the SARIF to CSV if needed; see sarif-cli.

Footnotes

¹The csvkit can be installed into the same Python virtual environment as the qldbtools.

19 KiB Raw Blame History Unescape Escape