diff --git a/.gitignore b/.gitignore index 7d222f0..76e1f4e 100644 --- a/.gitignore +++ b/.gitignore @@ -47,3 +47,5 @@ go.work.sum venv/ *.egg-info __pycache__ +cli-end-to-end.html +README.html diff --git a/mrvacommander.code-workspace b/mrvacommander.code-workspace index 47363e5..c19f09d 100644 --- a/mrvacommander.code-workspace +++ b/mrvacommander.code-workspace @@ -6,6 +6,7 @@ ], "settings": { "sarif-viewer.connectToGithubCodeScanning": "off", - "codeQL.githubDatabase.download": "never" + "codeQL.githubDatabase.download": "never", + "makefile.configureOnOpen": false } } \ No newline at end of file diff --git a/notes/cli-end-to-end.org b/notes/cli-end-to-end.org new file mode 100644 index 0000000..a600c4b --- /dev/null +++ b/notes/cli-end-to-end.org @@ -0,0 +1,482 @@ +# -*- coding: utf-8 -*- +#+OPTIONS: H:2 num:t \n:nil @:t ::t |:t ^:{} f:t *:t TeX:t LaTeX:t skip:nil p:nil +#+OPTIONS: toc:nil +#+HTML_HEAD: +#+HTML:
+#+TOC: headlines 2 insert TOC here, with two headline levels +#+HTML:
+# +#+HTML:
+ +* End-to-end example of CLI use + This document describes a complete cycle of the MRVA workflow. The steps + included are + 1. aquiring CodeQL databases + 2. selection of databases + 3. configuration and use of the command-line client + 4. server startup + 5. submission of the jobs + 6. retrieval of the results + 7. examination of the results + +* Database Aquisition + General database aquisition is beyond the scope of this document as it is very specific + to an organization's environment. Here we use an example for open-source + repositories, [[https://github.com/hohn/mrva-open-source-download.git][mrva-open-source-download]], which downloads the top 1000 databases for each of + C/C++, Java, Python -- 3000 CodeQL DBs in all. + + The scripts in [[https://github.com/hohn/mrva-open-source-download.git][mrva-open-source-download]] were used to download on two distinct dates + resulting in close to 6000 databases to choose from. The DBs were directly + saved to the file system, resulting in paths like + : .../mrva-open-source-download/repos-2024-04-29/google/re2/code-scanning/codeql/databases/cpp/db.zip + and + : .../mrva-open-source-download/repos/google/re2/code-scanning/codeql/databases/cpp/db.zip + Note that the only information in these paths are (owner, repository, download + date). The databases contain more information which is used in the [[*Repository Selection][Repository + Selection]] section. + + To get a collection of databases follow the [[https://github.com/hohn/mrva-open-source-download?tab=readme-ov-file#mrva-download][instructions]]. + +* Repository Selection + Here we select a small subset of those repositories using a collection scripts + made for the purpose, the [[https://github.com/hohn/mrvacommander/blob/hohn-0.1.21.2-improve-structure-and-docs/client/qldbtools/README.md#installation][qldbtools]] package. + Clone the full repository before continuing: + #+BEGIN_SRC sh + mkdir -p ~/work-gh/mrva/ + git clone git@github.com:hohn/mrvacommander.git + cd ~/work-gh/mrva/mrvacommander/client/qldbtools && mkdir -p scratch + #+END_SRC + + After performing the [[https://github.com/hohn/mrvacommander/blob/hohn-0.1.21.2-improve-structure-and-docs/client/qldbtools/README.md#installation][installation]] steps, we can follow the [[https://github.com/hohn/mrvacommander/blob/hohn-0.1.21.2-improve-structure-and-docs/client/qldbtools/README.md#command-line-use][command line]] use + instructions to collect all the database information from the file system into a + single table: + + #+BEGIN_SRC sh + cd ~/work-gh/mrva/mrvacommander/client/qldbtools && mkdir -p scratch + ./bin/mc-db-initial-info ~/work-gh/mrva/mrva-open-source-download > scratch/db-info-1.csv + #+END_SRC + + The [[https://csvkit.readthedocs.io/en/latest/scripts/csvstat.html][=csvstat=]] tool gives a good overview[fn:1]; here is a pruned version of the + output + #+BEGIN_SRC text + csvstat scratch/db-info-1.csv + 1. "ctime" + Type of data: DateTime + ... + + 2. "language" + Type of data: Text + Non-null values: 6000 + Unique values: 3 + Longest value: 6 characters + Most common values: cpp (2000x) + java (2000x) + python (2000x) + 3. "name" + ... + 4. "owner" + Type of data: Text + Non-null values: 6000 + Unique values: 2189 + Longest value: 29 characters + Most common values: apache (258x) + google (86x) + microsoft (64x) + spring-projects (56x) + alibaba (42x) + 5. "path" + ... + 6. "size" + Type of data: Number + Non-null values: 6000 + Unique values: 5354 + Smallest value: 0 + Largest value: 1,885,008,701 + Sum: 284,766,326,993 + ... + + Row count: 6000 + + #+END_SRC + The information critial for selection are the columns + 1. owner + 2. name + 3. language + The size column is interesting: a smallest value of 0 indicates some error + while our largest DB is 1.88 GB in size + + This information is not sufficient, so we collect more. The following script + extracts information from every database on disk and takes more time accordingly + -- about 30 seconds on my laptop. + #+BEGIN_SRC sh + ./bin/mc-db-refine-info < scratch/db-info-1.csv > scratch/db-info-2.csv + #+END_SRC + This new table is a merge of all the available meta-information with the + previous table causing the increase in the number of rows. The following + columns are now present + #+BEGIN_SRC text + 0:$ csvstat scratch/db-info-2.csv + 1. "ctime" + 2. "language" + 3. "name" + 4. "owner" + 5. "path" + 6. "size" + 7. "left_index" + 8. "baselineLinesOfCode" + Type of data: Number + Contains null values: True (excluded from calculations) + Non-null values: 11920 + Unique values: 4708 + Smallest value: 0 + Largest value: 22,028,732 + Sum: 3,454,019,142 + Mean: 289,766.707 + Median: 54,870.5 + 9. "primaryLanguage" + 10. "sha" + Type of data: Text + Contains null values: True (excluded from calculations) + Non-null values: 11920 + Unique values: 4928 + 11. "cliVersion" + Type of data: Text + Contains null values: True (excluded from calculations) + Non-null values: 11920 + Unique values: 59 + Longest value: 6 characters + Most common values: 2.17.0 (3850x) + 2.18.0 (3622x) + 2.17.2 (1097x) + 2.17.6 (703x) + 2.16.3 (378x) + 12. "creationTime" + Type of data: Text + Contains null values: True (excluded from calculations) + Non-null values: 11920 + Unique values: 5345 + Longest value: 32 characters + Most common values: None (19x) + 2024-03-19 01:40:14.507823+00:00 (16x) + 2024-02-29 19:12:59.785147+00:00 (16x) + 2024-01-30 22:24:17.411939+00:00 (14x) + 2024-04-05 09:34:03.774619+00:00 (14x) + 13. "finalised" + Type of data: Boolean + Contains null values: True (excluded from calculations) + Non-null values: 11617 + Unique values: 2 + Most common values: True (11617x) + None (322x) + 14. "db_lang" + 15. "db_lang_displayName" + 16. "db_lang_file_count" + 17. "db_lang_linesOfCode" + + Row count: 11939 + #+END_SRC + There are several columns that are critical, namely + 1. "sha" + 2. "cliVersion" + 3. "creationTime" + The others may be useful, but they are not strictly required. + The critical ones deserve more explanation: + 1. "sha": The =git= commit SHA of the repository the CodeQL database was + created from. Required to distinguish query results over the evolution of + a code base. + 2. "cliVersion": The version of the CodeQL CLI used to create the database. + Required to identify advances/regressions originating from the CodeQL binary. + 3. "creationTime": The time the database was created. Required (or at least + very handy) for following the evolution of query results over time. + This leaves us with a row count of 11939 + + To start reducing that count, start with + #+BEGIN_SRC sh + ./bin/mc-db-unique < scratch/db-info-2.csv > scratch/db-info-3.csv + #+END_SRC + and get a reduced count and a new column: + #+BEGIN_SRC text + csvstat scratch/db-info-3.csv + 3. "CID" + + Type of data: Text + Contains null values: False + Non-null values: 5344 + Unique values: 5344 + Longest value: 6 characters + Most common values: 1f8d99 (1x) + 9ab87a (1x) + 76fdc7 (1x) + b21305 (1x) + 4ae79b (1x) + + Row count: 5344 + #+END_SRC + From the docs: 'Read a table of CodeQL DB information and produce a table with unique entries + adding the Cumulative ID (CID) column.' + + The CID column combines + - cliVersion + - creationTime + - language + - sha + into a single 6-character string via hashing and with (owner, repo) provides a + unique index for every DB. + + We still have too many rows. The tables are all in CSV format, so you can use + your favorite tool to narrow the selection for your needs. For this document, + we simply use a pseudo-random selection of 11 databases via + #+BEGIN_SRC sh + ./bin/mc-db-generate-selection -n 11 \ + scratch/vscode-selection.json \ + scratch/gh-mrva-selection.json \ + < scratch/db-info-3.csv + #+END_SRC + + Note that these use pseudo-random numbers, so the selection is in fact + deterministic. The selected databases in =gh-mrva-selection.json=, to be used + in section [[*Running the gh-mrva command-line client][Running the gh-mrva command-line client]], are the following: + #+begin_src javascript + { + "mirva-list": [ + "NLPchina/elasticsearch-sqlctsj168cc4", + "LMAX-Exchange/disruptorctsj3e75ec", + "justauth/JustAuthctsj8a6177", + "FasterXML/jackson-modules-basectsj2fe248", + "ionic-team/capacitor-pluginsctsj38d457", + "PaddlePaddle/PaddleOCRctsj60e555", + "elastic/apm-agent-pythonctsj21dc64", + "flipkart-incubator/zjsonpatchctsjc4db35", + "stephane/libmodbusctsj54237e", + "wso2/carbon-kernelctsj5a8a6e", + "apache/servicecomb-packctsj4d98f5" + ] + } + #+end_src + +* Starting the server + The full instructions for building and running the server are in [[../README.md]] under + 'Steps to build and run the server' + + With docker-compose set up and this repository cloned as previously described, + we just run + #+BEGIN_SRC sh + cd ~/work-gh/mrva/mrvacommander + docker-compose up --build + #+END_SRC + and wait until the log output no longer changes. + + Then, use the following command to populate the mrvacommander database storage: + #+BEGIN_SRC sh + cd ~/work-gh/mrva/mrvacommander/client/qldbtools && \ + ./bin/mc-db-populate-minio -n 11 < scratch/db-info-3.csv + #+END_SRC + +* Running the gh-mrva command-line client + The first run uses the test query to verify basic functionality, but it returns + no results. +** Run MRVA from command line + 1. Install mrva cli + #+BEGIN_SRC sh + mkdir -p ~/work-gh/mrva && cd ~/work-gh/mrva + git clone https://github.com/hohn/gh-mrva.git + cd ~/work-gh/mrva/gh-mrva && git checkout mrvacommander-end-to-end + + # Build it + go mod edit -replace="github.com/GitHubSecurityLab/gh-mrva=$HOME/work-gh/mrva/gh-mrva" + go build . + + # Sanity check + ./gh-mrva -h + #+END_SRC + + 2. Set up the configuration + #+BEGIN_SRC sh + mkdir -p ~/.config/gh-mrva + cat > ~/.config/gh-mrva/config.yml < scratch/selection-full-info + csvcut -c path scratch/selection-full-info + #+END_SRC + + Use one of these databases to write a query. It need not produce results. + #+BEGIN_SRC sh + cd ~/work-gh/mrva/gh-mrva/ + code gh-mrva.code-workspace + #+END_SRC + In this case, the trivial =findPrintf=: + #+BEGIN_SRC java + /** + ,* @name findPrintf + ,* @description find calls to plain fprintf + ,* @kind problem + ,* @id cpp-fprintf-call + ,* @problem.severity warning + ,*/ + + import cpp + + from FunctionCall fc + where + fc.getTarget().getName() = "fprintf" + select fc, "call of fprintf" + #+END_SRC + + + Repeat the submit steps with this query + 1. -- + 2. -- + 3. Submit the mrva job + #+BEGIN_SRC sh + cp ~/work-gh/mrva/mrvacommander/client/qldbtools/scratch/gh-mrva-selection.json \ + ~/work-gh/mrva/gh-mrva/gh-mrva-selection.json + + cd ~/work-gh/mrva/gh-mrva/ + ./gh-mrva submit --language cpp --session mirva-session-1480 \ + --list mirva-list \ + --query ~/work-gh/mrva/gh-mrva/Fprintf.ql + #+END_SRC + 4. Check the status + #+BEGIN_SRC sh + cd ~/work-gh/mrva/gh-mrva/ + ./gh-mrva status --session mirva-session-1480 + #+END_SRC + + This time we have results + #+BEGIN_SRC text + ... + Run name: mirva-session-1480 + Status: succeeded + Total runs: 1 + Total successful scans: 11 + Total failed scans: 0 + Total skipped repositories: 0 + Total skipped repositories due to access mismatch: 0 + Total skipped repositories due to not found: 0 + Total skipped repositories due to no database: 0 + Total skipped repositories due to over limit: 0 + Total repositories with findings: 7 + Total findings: 618 + Repositories with findings: + quickfix/quickfixctsjebfd13 (cpp-fprintf-call): 5 + libfuse/libfusectsj7a66a4 (cpp-fprintf-call): 146 + xoreaxeaxeax/movfuscatorctsj8f7e5b (cpp-fprintf-call): 80 + pocoproject/pococtsj26b932 (cpp-fprintf-call): 17 + BoomingTech/Piccoloctsj6d7177 (cpp-fprintf-call): 10 + tdlib/telegram-bot-apictsj8529d9 (cpp-fprintf-call): 247 + WinMerge/winmergectsj101305 (cpp-fprintf-call): 113 + #+END_SRC + 5. Download the sarif files, optionally also get databases. + #+BEGIN_SRC sh + cd ~/work-gh/mrva/gh-mrva/ + # Just download the sarif files + ./gh-mrva download --session mirva-session-1480 \ + --output-dir mirva-session-1480 + + # Download the sarif files and CodeQL dbs + ./gh-mrva download --session mirva-session-1480 \ + --download-dbs \ + --output-dir mirva-session-1480 + + # And list them: + \ls -la *1480* + -rwxr-xr-x@ 1 hohn staff 1915857 Aug 16 14:10 BoomingTech_Piccoloctsj6d7177_1.sarif + drwxr-xr-x@ 3 hohn staff 96 Aug 16 14:15 BoomingTech_Piccoloctsj6d7177_1_db + -rwxr-xr-x@ 1 hohn staff 89857056 Aug 16 14:11 BoomingTech_Piccoloctsj6d7177_1_db.zip + -rwxr-xr-x@ 1 hohn staff 3105663 Aug 16 14:10 WinMerge_winmergectsj101305_1.sarif + -rwxr-xr-x@ 1 hohn staff 227812131 Aug 16 14:12 WinMerge_winmergectsj101305_1_db.zip + -rwxr-xr-x@ 1 hohn staff 193976 Aug 16 14:10 libfuse_libfusectsj7a66a4_1.sarif + -rwxr-xr-x@ 1 hohn staff 12930693 Aug 16 14:10 libfuse_libfusectsj7a66a4_1_db.zip + -rwxr-xr-x@ 1 hohn staff 1240694 Aug 16 14:10 pocoproject_pococtsj26b932_1.sarif + -rwxr-xr-x@ 1 hohn staff 158924920 Aug 16 14:12 pocoproject_pococtsj26b932_1_db.zip + -rwxr-xr-x@ 1 hohn staff 888494 Aug 16 14:10 quickfix_quickfixctsjebfd13_1.sarif + -rwxr-xr-x@ 1 hohn staff 75023303 Aug 16 14:11 quickfix_quickfixctsjebfd13_1_db.zip + -rwxr-xr-x@ 1 hohn staff 1487363 Aug 16 14:10 tdlib_telegram-bot-apictsj8529d9_1.sarif + -rwxr-xr-x@ 1 hohn staff 373477635 Aug 16 14:14 tdlib_telegram-bot-apictsj8529d9_1_db.zip + -rwxr-xr-x@ 1 hohn staff 103657 Aug 16 14:10 xoreaxeaxeax_movfuscatorctsj8f7e5b_1.sarif + -rwxr-xr-x@ 1 hohn staff 9464225 Aug 16 14:10 xoreaxeaxeax_movfuscatorctsj8f7e5b_1_db.zip + #+END_SRC + + 6. Use the [[https://marketplace.visualstudio.com/items?itemName=MS-SarifVSCode.sarif-viewer][SARIF Viewer]] plugin in VS Code to open and review the results. + + Prepare the source directory so the viewer can be pointed at it + #+BEGIN_SRC sh + cd ~/work-gh/mrva/gh-mrva/mirva-session-1480 + + unzip -qd BoomingTech_Piccoloctsj6d7177_1_db BoomingTech_Piccoloctsj6d7177_1_db.zip + + cd BoomingTech_Piccoloctsj6d7177_1_db/codeql_db/ + unzip -qd src src.zip + #+END_SRC + + Use the viewer + #+BEGIN_SRC sh + code BoomingTech_Piccoloctsj6d7177_1.sarif + + # For lauxlib.c, point the source viewer to + find ~/work-gh/mrva/gh-mrva/mirva-session-1480/BoomingTech_Piccoloctsj6d7177_1_db/codeql_db/src/home/runner/work/bulk-builder/bulk-builder -name lauxlib.c + + # Here: ~/work-gh/mrva/gh-mrva/mirva-session-1480/BoomingTech_Piccoloctsj6d7177_1_db/codeql_db/src/home/runner/work/bulk-builder/bulk-builder/engine/3rdparty/lua-5.4.4/lauxlib.c + #+END_SRC + + 7. (optional) Large result sets are more easily filtered via + dataframes or spreadsheets. Convert the SARIF to CSV if needed; see [[https://github.com/hohn/sarif-cli/][sarif-cli]]. + + + + +* Footnotes +[fn:1]The =csvkit= can be installed into the same Python virtual environment as +the =qldbtools=. + +#+HTML:
diff --git a/notes/l3style.css b/notes/l3style.css new file mode 100644 index 0000000..9b71bbd --- /dev/null +++ b/notes/l3style.css @@ -0,0 +1,170 @@ + +/* The sum of width and margin percentages must not exceed 100.*/ +div#toc { + /* Use a moving table of contents (scrolled away for long contents) */ + /* + * float: left; + */ + /* OR */ + /* use a fixed-position toc */ + position: fixed; + top: 80px; + left: 0px; + + /* match toc, org-content, postamble */ + width: 26%; + margin-right: 1%; + margin-left: 1%; +} + +div#org-content { + float: right; + width: 70%; + /* match toc, org-content, postamble */ + margin-left: 28%; +} + +div#postamble { + float: right; + width: 70%; + /* match toc, org-content, postamble */ + margin-left: 28%; +} + + +p.author { + clear: both; + font-size: 1em; + margin-left: 25%; +} + +p.date { + clear: both; + font-size: 1em; + margin-left: 25%; +} + +#toc * { + font-size:1em; +} + +#toc h3 { + font-weight:normal; + margin:1em 0 0 0; + padding: 4px 0; + border-bottom:1px solid #666; + text-transform:uppercase; +} + +#toc ul, #toc li { + margin:0; + padding:0; + list-style:none; +} + +#toc li { + display:inline; +} + +#toc ul li a { + text-decoration:none; + display:block; + margin:0; + padding:4px 6px; + color:#990000; + border-bottom:1px solid #aaa; +} + +#toc ul ul li a { + padding-left:18px; + color:#666; +} + +#toc ul li a:hover { + background-color:#F6F6F6; +} + + +/* Description lists. */ +dt { + font-style: bold; + background-color:#F6F6F6; +} + + +/* From org-mode page. */ +body { + font-family: avenir, Lao Sangam MN, Myanmar Sangam MN, Songti SC, Kohinoor Devanagari, Menlo, avenir, helvetica, verdana, sans-serif; + font-size: 100%; + margin-top: 5%; + margin-bottom: 8%; + background: white; color: black; + margin-left: 3% !important; margin-right: 3% !important; +} + +h1 { + font-size: 2em; + color: #cc8c00; +/* padding-top: 5px; */ + border-bottom: 2px solid #aaa; + width: 70%; + /* match toc, org-content, postamble */ + margin-left: 28%; /* Align with div#content */ +} + +h2 { + font-size: 1.5em; + padding-top: 1em; + border-bottom: 1px solid #ccc; +} + +h3 { + font-size: 1.2em; + padding-top: 0.5em; + border-bottom: 1px solid #eee; +} + +.todo, .deadline { color: red; font-style: italic } +.done { color: green; font-style: italic } +.timestamp { color: grey } +.timestamp-kwd { color: CadetBlue; } +.tag { background-color:lightblue; font-weight:normal; } + +.target { background-color: lavender; } + +.menu { + color: #666; +} + +.menu a:link { + color: #888; +} +.menu a:active { + color: #888; +} +.menu a:visited { + color: #888; +} + +img { align: center; } + +pre { + padding: 5pt; + font-family: andale mono, vera sans mono, monospace, courier ; + font-size: 0.8em; + background-color: #f0f0f0; +} + +code { + font-family: andale mono, vera sans mono, monospace, courier ; + font-size: 0.8em; + background-color: #f0f0f0; +} + +table { border-collapse: collapse; } + +td, th { + vertical-align: top; + border: 1pt solid #ADB9CC; +} +