# -*- coding: utf-8 -*-

* End-to-end example of CLI use
  This document describes a complete cycle of the MRVA workflow.  The steps
  included are 
  1. aquiring CodeQL databases
  2. selection of databases
  3. configuration and use of the command-line client
  4. server startup
  5. submission of the jobs
  6. retrieval of the results
  7. examination of the results

* Database Aquisition
  General database aquisition is beyond the scope of this document as it is very specific
  to an organization's environment.  Here we use an example for open-source
  repositories, [[https://github.com/hohn/mrva-open-source-download.git][mrva-open-source-download]], which downloads the top 1000 databases for each of
  C/C++, Java, Python -- 3000 CodeQL DBs in all.

  The scripts in [[https://github.com/hohn/mrva-open-source-download.git][mrva-open-source-download]] were used to download on two distinct dates
  resulting in close to 6000 databases to choose from.  The DBs were directly
  saved to the file system, resulting in paths like
  : .../mrva-open-source-download/repos-2024-04-29/google/re2/code-scanning/codeql/databases/cpp/db.zip
  and
  : .../mrva-open-source-download/repos/google/re2/code-scanning/codeql/databases/cpp/db.zip
  Note that the only information in these paths are (owner, repository, download
  date).  The databases contain more information which is used in the [[*Repository Selection][Repository
  Selection]] section.

  To get a collection of databases follow the [[https://github.com/hohn/mrva-open-source-download?tab=readme-ov-file#mrva-download][instructions]].

* Repository Selection
  Here we select a small subset of those repositories using a collection scripts
  made for the purpose, the [[https://github.com/hohn/mrvacommander/blob/hohn-0.1.21.2-improve-structure-and-docs/client/qldbtools/README.md#installation][qldbtools]] package.
  Clone the full repository before continuing:
  #+BEGIN_SRC sh 
    mkdir -p ~/work-gh/mrva/
    git clone git@github.com:hohn/mrvacommander.git
    cd ~/work-gh/mrva/mrvacommander/client/qldbtools && mkdir -p scratch
  #+END_SRC

  After performing the [[https://github.com/hohn/mrvacommander/blob/hohn-0.1.21.2-improve-structure-and-docs/client/qldbtools/README.md#installation][installation]] steps, we can follow the [[https://github.com/hohn/mrvacommander/blob/hohn-0.1.21.2-improve-structure-and-docs/client/qldbtools/README.md#command-line-use][command line]] use
  instructions to collect all the database information from the file system into a
  single table:

  #+BEGIN_SRC sh 
    cd ~/work-gh/mrva/mrvacommander/client/qldbtools && mkdir -p scratch
    source venv/bin/activate
    ./bin/mc-db-initial-info ~/work-gh/mrva/mrva-open-source-download > scratch/db-info-1.csv
  #+END_SRC

  The [[https://csvkit.readthedocs.io/en/latest/scripts/csvstat.html][=csvstat=]] tool gives a good overview[fn:1]; here is a pruned version of the
  output 
  #+BEGIN_SRC text
    csvstat  scratch/db-info-1.csv 
      1. "ctime"
          Type of data:          DateTime
          ...

      2. "language"
        Type of data:          Text
        Non-null values:       6000
        Unique values:         3
        Longest value:         6 characters
        Most common values:    cpp (2000x)
                               java (2000x)
                               python (2000x)
      3. "name"
         ...
      4. "owner"
        Type of data:          Text
        Non-null values:       6000
        Unique values:         2189
        Longest value:         29 characters
        Most common values:    apache (258x)
                               google (86x)
                               microsoft (64x)
                               spring-projects (56x)
                               alibaba (42x)
      5. "path"
         ...
      6. "size"
        Type of data:          Number
        Non-null values:       6000
        Unique values:         5354
        Smallest value:        0
        Largest value:         1,885,008,701
        Sum:                   284,766,326,993
        ...

    Row count: 6000

  #+END_SRC
  The information critial for selection are the columns
  1. owner
  2. name
  3. language
  The size column is interesting:  a smallest value of 0 indicates some error
  while our largest DB is 1.88 GB in size

  This information is not sufficient, so we collect more.  The following script
  extracts information from every database on disk and takes more time accordingly
  -- about 30 seconds on my laptop.
  #+BEGIN_SRC sh 
    ./bin/mc-db-refine-info < scratch/db-info-1.csv > scratch/db-info-2.csv
  #+END_SRC
  This new table is a merge of all the available meta-information with the
  previous table causing the increase in the number of rows.  The following
  columns are now present
  #+BEGIN_SRC text
    0:$     csvstat  scratch/db-info-2.csv 
      1. "ctime"
      2. "language"
      3. "name"
      4. "owner"
      5. "path"
      6. "size"
      7. "left_index"
      8. "baselineLinesOfCode"
        Type of data:          Number
        Contains null values:  True (excluded from calculations)
        Non-null values:       11920
        Unique values:         4708
        Smallest value:        0
        Largest value:         22,028,732
        Sum:                   3,454,019,142
        Mean:                  289,766.707
        Median:                54,870.5
      9. "primaryLanguage"
     10. "sha"
        Type of data:          Text
        Contains null values:  True (excluded from calculations)
        Non-null values:       11920
        Unique values:         4928
     11. "cliVersion"
        Type of data:          Text
        Contains null values:  True (excluded from calculations)
        Non-null values:       11920
        Unique values:         59
        Longest value:         6 characters
        Most common values:    2.17.0 (3850x)
                               2.18.0 (3622x)
                               2.17.2 (1097x)
                               2.17.6 (703x)
                               2.16.3 (378x)
     12. "creationTime"
        Type of data:          Text
        Contains null values:  True (excluded from calculations)
        Non-null values:       11920
        Unique values:         5345
        Longest value:         32 characters
        Most common values:    None (19x)
                               2024-03-19 01:40:14.507823+00:00 (16x)
                               2024-02-29 19:12:59.785147+00:00 (16x)
                               2024-01-30 22:24:17.411939+00:00 (14x)
                               2024-04-05 09:34:03.774619+00:00 (14x)
     13. "finalised"
        Type of data:          Boolean
        Contains null values:  True (excluded from calculations)
        Non-null values:       11617
        Unique values:         2
        Most common values:    True (11617x)
                               None (322x)
     14. "db_lang"
     15. "db_lang_displayName"
     16. "db_lang_file_count"
     17. "db_lang_linesOfCode"

    Row count: 11939
  #+END_SRC
  There are several columns that are critical, namely
     1. "sha"
     2. "cliVersion"
     3. "creationTime"
  The others may be useful, but they are not strictly required.
  The critical ones deserve more explanation:
     1. "sha": The =git= commit SHA of the repository the CodeQL database was
        created from.  Required to distinguish query results over the evolution of
        a code base.
     2. "cliVersion":  The version of the CodeQL CLI used to create the database.
        Required to identify advances/regressions originating from the CodeQL binary.
     3. "creationTime":  The time the database was created.  Required (or at least
        very handy) for following the evolution of query results over time.
  This leaves us with a row count of 11939

  To start reducing that count, start with
  #+BEGIN_SRC sh 
    ./bin/mc-db-unique cpp < scratch/db-info-2.csv > scratch/db-info-3.csv
  #+END_SRC
  and get a reduced count and a new column:
  #+BEGIN_SRC text
    csvstat  scratch/db-info-3.csv 
    3. "CID"

      Type of data:          Text
      Contains null values:  False
      Non-null values:       5344
      Unique values:         5344
      Longest value:         6 characters
      Most common values:    1f8d99 (1x)
                             9ab87a (1x)
                             76fdc7 (1x)
                             b21305 (1x)
                             4ae79b (1x)

  #+END_SRC
  From the docs:  'Read a table of CodeQL DB information and produce a table with unique entries 
  adding the Cumulative ID (CID) column.'

  The CID column combines 
  - cliVersion
  - creationTime
  - language
  - sha
  into a single 6-character string via hashing and with (owner, repo) provides a
  unique index for every DB.

  We still have too many rows.  The tables are all in CSV format, so you can use
  your favorite tool to narrow the selection for your needs.   For this document,
  we simply use a pseudo-random selection of 11 databases via
  #+BEGIN_SRC sh 
    ./bin/mc-db-generate-selection -n 11 \
                                   scratch/vscode-selection.json \
                                   scratch/gh-mrva-selection.json \
                                   < scratch/db-info-3.csv 
  #+END_SRC

  Note that these use pseudo-random numbers, so the selection is in fact
  deterministic.  The selected databases in =gh-mrva-selection.json=, to be used
  in section [[*Running the gh-mrva command-line client][Running the gh-mrva command-line client]], are the following:
  #+begin_src javascript
    {
        "mirva-list": [
            "NLPchina/elasticsearch-sqlctsj168cc4",
            "LMAX-Exchange/disruptorctsj3e75ec",
            "justauth/JustAuthctsj8a6177",
            "FasterXML/jackson-modules-basectsj2fe248",
            "ionic-team/capacitor-pluginsctsj38d457",
            "PaddlePaddle/PaddleOCRctsj60e555",
            "elastic/apm-agent-pythonctsj21dc64",
            "flipkart-incubator/zjsonpatchctsjc4db35",
            "stephane/libmodbusctsj54237e",
            "wso2/carbon-kernelctsj5a8a6e",
            "apache/servicecomb-packctsj4d98f5"
        ]
    }
  #+end_src

* Starting the server
  The full instructions for building and running the server are in [[../README.md]] under
  'Steps to build and run the server'

  With docker-compose set up and this repository cloned as previously described,
  we just run
  #+BEGIN_SRC sh 
        cd ~/work-gh/mrva/mrvacommander
        docker-compose up --build
  #+END_SRC
  and wait until the log output no longer changes.

  Then, use the following command to populate the mrvacommander database storage:
  #+BEGIN_SRC sh 
    cd ~/work-gh/mrva/mrvacommander/client/qldbtools && \
        ./bin/mc-db-populate-minio -n 11 < scratch/db-info-3.csv
  #+END_SRC
  
* Running the gh-mrva command-line client
  The first run uses the test query to verify basic functionality, but it returns
  no results.
** Run MRVA from command line
   1. Install mrva cli
      #+BEGIN_SRC sh 
        mkdir -p ~/work-gh/mrva && cd ~/work-gh/mrva
        git clone https://github.com/hohn/gh-mrva.git
        cd ~/work-gh/mrva/gh-mrva && git checkout mrvacommander-end-to-end

        # Build it
        go mod edit -replace="github.com/GitHubSecurityLab/gh-mrva=$HOME/work-gh/mrva/gh-mrva"
        go build .

        # Sanity check
        ./gh-mrva -h
      #+END_SRC

   2. Set up the configuration
      #+BEGIN_SRC sh 
        mkdir -p ~/.config/gh-mrva
        cat > ~/.config/gh-mrva/config.yml <<eof
        # The following options are supported
        # codeql_path: Path to CodeQL distribution (checkout of codeql repo)
        # controller: NWO of the MRVA controller to use.  Not used here.
        # list_file: Path to the JSON file containing the target repos

        # XX:
        codeql_path: $HOME/work-gh/not-used
        controller: not-used/mirva-controller
        list_file: $HOME/work-gh/mrva/gh-mrva/gh-mrva-selection.json
        eof
      #+END_SRC

   3. Submit the mrva job
      #+BEGIN_SRC sh 
        cp ~/work-gh/mrva/mrvacommander/client/qldbtools/scratch/gh-mrva-selection.json \
           ~/work-gh/mrva/gh-mrva/gh-mrva-selection.json 

        cd ~/work-gh/mrva/gh-mrva/
        ./gh-mrva submit --language cpp --session mirva-session-1360    \
                  --list mirva-list                                     \
                  --query ~/work-gh/mrva/gh-mrva/FlatBuffersFunc.ql
      #+END_SRC

   4. Check the status
      #+BEGIN_SRC sh 
        cd ~/work-gh/mrva/gh-mrva/

        # Check the status
        ./gh-mrva status --session mirva-session-1360
      #+END_SRC

   5. Download the sarif files, optionally also get databases.  For the current
      query / database combination there are zero result hence no downloads.
      #+BEGIN_SRC sh 
        cd ~/work-gh/mrva/gh-mrva/
        # Just download the sarif files
        ./gh-mrva download --session mirva-session-1360 \
                  --output-dir mirva-session-1360

        # Download the sarif files and CodeQL dbs
        ./gh-mrva download --session mirva-session-1360 \
                  --download-dbs \
                  --output-dir mirva-session-1360
      #+END_SRC

** Write query that has some results
   First, get the list of paths corresponding to the previously selected
   databases. 
   #+BEGIN_SRC sh 
     cd ~/work-gh/mrva/mrvacommander/client/qldbtools 
     . venv/bin/activate
     ./bin/mc-rows-from-mrva-list scratch/gh-mrva-selection.json \
                                  scratch/db-info-3.csv > scratch/selection-full-info
     csvcut -c path scratch/selection-full-info 
   #+END_SRC

   Use one of these databases to write a query.  It need not produce results.  
   #+BEGIN_SRC sh 
     cd ~/work-gh/mrva/gh-mrva/
     code gh-mrva.code-workspace
   #+END_SRC
   In this case, the trivial =findPrintf= query, in the file =Fprintf.ql=
   #+BEGIN_SRC java
     /**
      ,* @name findPrintf
      ,* @description find calls to plain fprintf
      ,* @kind problem
      ,* @id cpp-fprintf-call
      ,* @problem.severity warning
      ,*/

     import cpp

     from FunctionCall fc
     where
       fc.getTarget().getName() = "fprintf"
     select fc, "call of fprintf"
   #+END_SRC


   Repeat the submit steps with this query
   1. -- 
   2. --
   3. Submit the mrva job
      #+BEGIN_SRC sh 
        cp ~/work-gh/mrva/mrvacommander/client/qldbtools/scratch/gh-mrva-selection.json \
           ~/work-gh/mrva/gh-mrva/gh-mrva-selection.json 

        cd ~/work-gh/mrva/gh-mrva/
        ./gh-mrva submit --language cpp --session mirva-session-3650    \
                  --list mirva-list                                     \
                  --query ~/work-gh/mrva/gh-mrva/Fprintf.ql
      #+END_SRC

   4. Check the status
      #+BEGIN_SRC sh 
        cd ~/work-gh/mrva/gh-mrva/
        ./gh-mrva status --session mirva-session-3650
      #+END_SRC

      This time we have results
      #+BEGIN_SRC text
                ...
        0:$ Run name: mirva-session-3650
        Status: succeeded
        Total runs: 1
        Total successful scans: 11
        Total failed scans: 0
        Total skipped repositories: 0
        Total skipped repositories due to access mismatch: 0
        Total skipped repositories due to not found: 0
        Total skipped repositories due to no database: 0
        Total skipped repositories due to over limit: 0
        Total repositories with findings: 8
        Total findings: 7055
        Repositories with findings:
          lz4/lz4ctsj2479c5 (cpp-fprintf-call): 307
          Mbed-TLS/mbedtlsctsj17ef85 (cpp-fprintf-call): 6464
          tsl0922/ttydctsj2e3faa (cpp-fprintf-call): 11
          medooze/media-server-nodectsj5e30b3 (cpp-fprintf-call): 105
          ampl/gslctsj4b270e (cpp-fprintf-call): 102
          baidu/sofa-pbrpcctsjba3501 (cpp-fprintf-call): 24
          dlundquist/sniproxyctsj3d83e7 (cpp-fprintf-call): 34
          hyprwm/Hyprlandctsjc2425f (cpp-fprintf-call): 8
      #+END_SRC

   5. Download the sarif files, optionally also get databases.  


      #+BEGIN_SRC sh 
        cd ~/work-gh/mrva/gh-mrva/
        # Just download the sarif files
        ./gh-mrva download --session mirva-session-3650 \
                  --output-dir mirva-session-3650

        # Download the sarif files and CodeQL dbs
        ./gh-mrva download --session mirva-session-3650 \
                  --download-dbs \
                  --output-dir mirva-session-3650
      #+END_SRC
      #+BEGIN_SRC sh 
        # And list them:
        \ls -la *3650*
        drwxr-xr-x@ 18 hohn  staff       576 Nov 14 11:54 .
        drwxrwxr-x@ 56 hohn  staff      1792 Nov 14 11:54 ..
        -rwxr-xr-x@  1 hohn  staff   9035554 Nov 14 11:54 Mbed-TLS_mbedtlsctsj17ef85_1.sarif
        -rwxr-xr-x@  1 hohn  staff  57714273 Nov 14 11:54 Mbed-TLS_mbedtlsctsj17ef85_1_db.zip
        -rwxr-xr-x@  1 hohn  staff    132484 Nov 14 11:54 ampl_gslctsj4b270e_1.sarif
        -rwxr-xr-x@  1 hohn  staff  99234414 Nov 14 11:54 ampl_gslctsj4b270e_1_db.zip
        -rwxr-xr-x@  1 hohn  staff     34419 Nov 14 11:54 baidu_sofa-pbrpcctsjba3501_1.sarif
        -rwxr-xr-x@  1 hohn  staff  55177796 Nov 14 11:54 baidu_sofa-pbrpcctsjba3501_1_db.zip
        -rwxr-xr-x@  1 hohn  staff     80744 Nov 14 11:54 dlundquist_sniproxyctsj3d83e7_1.sarif
        -rwxr-xr-x@  1 hohn  staff   2183836 Nov 14 11:54 dlundquist_sniproxyctsj3d83e7_1_db.zip
        -rwxr-xr-x@  1 hohn  staff    169079 Nov 14 11:54 hyprwm_Hyprlandctsjc2425f_1.sarif
        -rwxr-xr-x@  1 hohn  staff  21383303 Nov 14 11:54 hyprwm_Hyprlandctsjc2425f_1_db.zip
        -rwxr-xr-x@  1 hohn  staff    489064 Nov 14 11:54 lz4_lz4ctsj2479c5_1.sarif
        -rwxr-xr-x@  1 hohn  staff   2991310 Nov 14 11:54 lz4_lz4ctsj2479c5_1_db.zip
        -rwxr-xr-x@  1 hohn  staff    141336 Nov 14 11:54 medooze_media-server-nodectsj5e30b3_1.sarif
        -rwxr-xr-x@  1 hohn  staff  38217703 Nov 14 11:54 medooze_media-server-nodectsj5e30b3_1_db.zip
        -rwxr-xr-x@  1 hohn  staff     33861 Nov 14 11:54 tsl0922_ttydctsj2e3faa_1.sarif
        -rwxr-xr-x@  1 hohn  staff   5140183 Nov 14 11:54 tsl0922_ttydctsj2e3faa_1_db.zip
      #+END_SRC

   6. Use the [[https://marketplace.visualstudio.com/items?itemName=MS-SarifVSCode.sarif-viewer][SARIF Viewer]] plugin in VS Code to open and review the results.

      Prepare the source directory so the viewer can be pointed at it
      #+BEGIN_SRC sh 
        cd ~/work-gh/mrva/gh-mrva/mirva-session-3650

        unzip -qd ampl_gslctsj4b270e_1_db  ampl_gslctsj4b270e_1_db.zip

        cd ampl_gslctsj4b270e_1_db/codeql_db
        unzip -qd src  src.zip
      #+END_SRC

      Use the viewer in VS Code
      #+BEGIN_SRC sh 
        cd ~/work-gh/mrva/gh-mrva/mirva-session-3650
        code ampl_gslctsj4b270e_1.sarif 

        # For the file vegas.c, when asked, point the source viewer to 
        find ~/work-gh/mrva/gh-mrva/mirva-session-3650/ampl_gslctsj4b270e_1_db/codeql_db/src/\
             -name vegas.c

        # Here: ~/work-gh/mrva/gh-mrva/mirva-session-3650/ampl_gslctsj4b270e_1_db/codeql_db/src//home/runner/work/bulk-builder/bulk-builder/monte/vegas.c
      #+END_SRC

   7. (optional) Large result sets are more easily filtered via
      dataframes or spreadsheets.  Convert the SARIF to CSV if needed; see [[https://github.com/hohn/sarif-cli/][sarif-cli]].

   
* Footnotes
[fn:1]The =csvkit= can be installed into the same Python virtual environment as
the =qldbtools=.