189 Commits

Author SHA1 Message Date
Michael Hohn
f60b55f181 Storate container simplification
Only one is really needed for large storage for the dbstore container.  The demo containers can contain their own data -- it's small
and the containers are made for demonstration anyway.
2024-09-13 12:04:30 -07:00
Michael Hohn
727381dc5a Fix: use explicit file names instead of $@ 2024-09-13 11:55:02 -07:00
Michael Hohn
a35fc619e6 Use mk. prefix for Makefile time stamps and make git ignore them 2024-09-13 09:44:08 -07:00
Michael Hohn
8dd6c94918 Set up and push fully configured vs code container 2024-09-12 14:05:59 -07:00
Michael Hohn
34958e4cf4 WIP: Working individual containers and docker compose demo 2024-09-12 09:49:25 -07:00
Michael Hohn
259bac55fb Add container mapping diagram 2024-09-12 09:46:28 -07:00
Michael Hohn
41f6db5de0 Add Makefile to push mrva agent container image 2024-09-06 14:42:31 -07:00
Michael Hohn
19330c3a0f Add Makefile to push mrva server container image 2024-09-06 11:40:13 -07:00
Michael Hohn
1e2df515e3 Set up and push Docker containers for demonstration purposes
These containers take the place of a desktop install
2024-09-04 15:52:18 -07:00
Michael Hohn
681fcdab8c Add new containers to streamline setup 2024-08-29 13:22:59 -07:00
Michael Hohn
5021fc824b Fix: include minio in requirements.txt 2024-08-23 08:17:27 -07:00
Michael Hohn
7d27b910cd Fix: include CID when filtering in mc-rows-from-mrva-list 2024-08-22 13:58:07 -07:00
Michael Hohn
0d3f4c5e40 Updated requirements for container 2024-08-21 16:25:13 -07:00
Michael Hohn
a86f955aab Clarify notes/cli-end-to-end.org -> notes/cli-end-to-end-detailed.org 2024-08-21 11:19:00 -07:00
Michael Hohn
c556605e44 Run containers without mitmweb proxy 2024-08-21 11:17:20 -07:00
Michael Hohn
7b06484b29 get github to render cli-end-to-end.org 2024-08-16 15:06:10 -07:00
Michael Hohn
fc751ae08f Add full walkthrough description in notes/cli-end-to-end.org 2024-08-16 14:39:44 -07:00
Michael Hohn
d956f47db3 Fix: Produce complete SARIF output in agent
The problem was missing fields in the SARIF output.  After the debugging
below, the cause were conversions from JSON to Go to JSON; the Go ->
JSON conversion only output the fields defined in the Go struct.

Because SARIF has so many optional fields, no attempt is made to enforce
a statically-defined structure.  Instead, the JSON -> Go conversion is
now to a fully dynamic structure; unused fields are simply passed through

Debugging:

       Comparing two SARIF files shows

           {
             "$schema" : "https://json.schemastore.org/sarif-2.1.0.json",
             "version" : "2.1.0",
             "runs" : [ {...
             } ]
           }

       and

           {
             "runs": [...
             ]
           }

       so there are missing fields.

    The Problem
     1. Problem origin

          // Modify the sarif: start by extracting
          var sarif Sarif
          if err := json.Unmarshal(sarifData, &sarif); err != nil {
              return nil, fmt.Errorf("failed to unmarshal SARIF: %v", err)
          }
          ...
          // now inject version control info
          ...
          // and write it back
          sarifBytes, err := json.Marshal(sarif)
          if err != nil {
              return nil, fmt.Errorf("failed to marshal SARIF: %v", err)
          }

     2. But the struct only has one of the needed fields

          type Sarif struct {
              Runs []SarifRun `json:"runs"`
          }

     3. From the docs:

          // To unmarshal JSON into a struct, Unmarshal matches incoming object
          // keys to the keys used by [Marshal] (either the struct field name or its tag),
          // preferring an exact match but also accepting a case-insensitive match. By
          // default, object keys which don't have a corresponding struct field are
          // ignored (see [Decoder.DisallowUnknownFields] for an alternative).
2024-08-16 14:27:46 -07:00
Michael Hohn
0a52b729cd Expand the codeql db download response
End-to-end testing contained an unhandled CodeQL database download
request.  The handlers are added in this patch.  Debugging info is below
for reference.

The mrvacommander *server* fails with the following.  The source code is
: func setupEndpoints(c CommanderAPI)
See mrvacommander/pkg/server/server.go, endpoints for getting a URL to download artifacts.

  Original

          Downloading artifacts for tdlib_telegram-bot-apictsj8529d9_2
          ...
          Downloading database tdlib/telegram-bot-apictsj8529d9 cpp mirva-session-1400 tdlib_telegram-bot-apictsj8529d9_2
          ...
          2024/08/13 12:31:38 >> GET http://localhost:8080/repos/tdlib/telegram-bot-apictsj8529d9/code-scanning/codeql/databases/cpp
          ...
          2024/08/13 12:31:38 << 404 http://localhost:8080/repos/tdlib/telegram-bot-apictsj8529d9/code-scanning/codeql/databases/cpp
          ...
          -rwxr-xr-x@  1 hohn  staff  169488 Aug 13 12:29 tdlib_telegram-bot-apictsj8529d9_2.sarif*
          -rwxr-xr-x@  1 hohn  staff      10 Aug 13 12:31 tdlib_telegram-bot-apictsj8529d9_2_db.zip*

  Server log

          server         | 2024/08/13 19:31:38 ERROR Unhandled endpoint method=GET uri=/repos/tdlib/telegram-bot-apictsj8529d9/code-scanning/codeql/databases/cpp

  Try a manual download from the server

          8:$ wget http://localhost:8080/repos/tdlib/telegram-bot-apictsj8529d9/code-scanning/codeql/databases/cpp
          --2024-08-13 12:56:05--  http://localhost:8080/repos/tdlib/telegram-bot-apictsj8529d9/code-scanning/codeql/databases/cpp
          Resolving localhost (localhost)... ::1, 127.0.0.1
          Connecting to localhost (localhost)|::1|:8080... connected.
          HTTP request sent, awaiting response... 404 Not Found
          2024-08-13 12:56:05 ERROR 404: Not Found.

          server         | 2024/08/13 19:56:05 ERROR Unhandled endpoint method=GET uri=/repos/tdlib/telegram-bot-apictsj8529d9/code-scanning/codeql/databases/cpp

  The full info for the DB
          tdlib,telegram-bot-api,8529d9,2.17.0,2024-05-09 08:02:49.545174+00:00,cpp,f95d406da67adb8ac13d9c562291aa57c65398e0,306106.0,/Users/hohn/work-gh/mrva/mrva-open-source-download/repos-2024-04-29/tdlib/telegram-bot-api/code-scanning/codeql/databases/cpp/db.zip,cpp,C/C++,1244.0,306106.0,2024-05-13T15:54:54.749093,cpp,True,3375,373477635

The gh-mrva *client* sends the following.  The source is
gh-mrva/utils/utils.go,
    client.Get(fmt.Sprintf("http://localhost:8080/repos/%s/code-scanning/codeql/databases/%s", task.Nwo, task.Language))

We have
  cd /Users/hohn/work-gh/mrva/gh-mrva
  0:$ rg 'repos/.*/code-scanning/codeql/databases'

          ...
          utils/utils.go
          625:	// resp, err := client.Get(fmt.Sprintf("https://api.github.com/repos/%s/code-scanning/codeql/databases/%s", task.Nwo, task.Language))
          626:	resp, err := client.Get(fmt.Sprintf("http://localhost:8080/repos/%s/code-scanning/codeql/databases/%s", task.Nwo, task.Language))

  And
          resp, err := client.Get(fmt.Sprintf("http://localhost:8080/repos/%s/code-scanning/codeql/databases/%s", task.Nwo, task.Language))

The original DB upload was
  cd ~/work-gh/mrva/mrvacommander/client/qldbtools && \
      ./bin/mc-db-populate-minio -n 11 < scratch/db-info-3.csv

  ...
  2024-08-14 09:29:19 [INFO] Uploaded /Users/hohn/work-gh/mrva/mrva-open-source-download/repos-2024-04-29/tdlib/telegram-bot-api/code-scanning/codeql/databases/cpp/db.zip as tdlib$telegram-bot-apictsj8529d9.zip to bucket qldb
  ...
2024-08-14 13:01:15 -07:00
Michael Hohn
6bebf4abfc Remove interactive debug statements 2024-08-13 09:27:13 -07:00
Michael Hohn
9d60489908 wip: Handle varying CodeQL DB formats. This code contains debugging features
This patch fixes the following

     - [X] Wrong db metadata path.  Fixed via
       : globRecursively(databasePath, "codeql-database.yml")

       The log output for reference:

                 agent          | 2024/08/09 21:16:40 DEBUG XX:getDataBaseMetadata databasePath=/tmp/ce523549-a217-4b54-a118-7224ce444870/db "Waiting for SIGUSR1 or SIGUSR2..."=<nil>
                 agent          | 2024/08/09 21:16:40 DEBUG XX:getDataBaseMetadata databasePath=/tmp/bc24fe72-b520-4e72-9634-a98d630cb75e/db "Waiting for SIGUSR1 or SIGUSR2..."=<nil>
                 agent          | 2024/08/09 21:16:40 DEBUG Received signal: %s "user defined signal 1"=<nil>
                 agent          | 2024/08/09 21:16:40 DEBUG XX:getDataBaseMetadata databasePath=/tmp/41fcf5cc-e151-4a11-bccc-481d599aa426/db "Waiting for SIGUSR1 or SIGUSR2..."=<nil>

            From

                 func getDatabaseMetadata(databasePath string) (*DatabaseMetadata, error) {
                 data, err := os.ReadFile(filepath.Join(databasePath, "codeql-database.yml"))
                 ...}

            And some inspection:

                 root@3fa4b8013336:~# find /tmp |grep ql-datab
                 /tmp/27f09b9f-254f-4ef5-abf5-9a1a2927906b/db/cpp/codeql-database.yml
                 /tmp/d7e14cd4-8789-4176-81bc-2ac1957ed9fd/db/codeql_db/codeql-database.yml
                 /tmp/41fcf5cc-e151-4a11-bccc-481d599aa426/db/codeql_db/codeql-database.yml
                 /tmp/bc24fe72-b520-4e72-9634-a98d630cb75e/db/codeql_db/codeql-database.yml
                 /tmp/ce523549-a217-4b54-a118-7224ce444870/db/codeql_db/codeql-database.yml

     - [X] Wrong db path.  Fixed via
       : findDBDir(databasePath)

       The log output for reference:

                 agent          | 2024/08/09 21:51:09 ERROR Failed to run analysis job error="failed to run analysis: failed to run queries: exit status 2\nOutput: A fatal error occurred: /tmp/91c61e0b-dfd9-4dd3-a3ad-cb77dbc1cbfd/db is not a recognized CodeQL database.\n"
                 agent          | 2024/08/09 21:51:09 INFO Running analysis job job="{Spec:{SessionID:1 NameWithOwner:{Owner:USCiLab Repo:cerealctsj264953}} QueryPackLocation:{Key:1 Bucket:packs} QueryLanguage:cpp}"
                 agent          | 2024/08/09 21:51:09 ERROR Failed to run analysis job error="failed to run analysis: failed to run queries: exit status 2\nOutput: A fatal error occurred: /tmp/1b8ffeba-8ad1-465e-8ec7-36cda449a5f5/db is not a recognized CodeQL database.\n"
                 ...

            This is easily confirmed:

                 root@171b5417e05f:~# /opt/codeql/codeql database upgrade  /tmp/7ed27578-d7ea-42e0-902a-effbc4df05f2/
                 A fatal error occurred: /tmp/7ed27578-d7ea-42e0-902a-effbc4df05f2 is not a recognized CodeQL database.

            Another try:

                 root@171b5417e05f:~# /opt/codeql/codeql database upgrade  /tmp/7ed27578-d7ea-42e0-902a-effbc4df05f2/database.zip
                 A fatal error occurred: Database root /tmp/7ed27578-d7ea-42e0-902a-effbc4df05f2/database.zip is not a directory.

             This one is correct:

                 root@171b5417e05f:~# /opt/codeql/codeql database upgrade /tmp/7ed27578-d7ea-42e0-902a-effbc4df05f2/db/codeql_db
                 /tmp/7ed27578-d7ea-42e0-902a-effbc4df05f2/db/codeql_db/db-cpp is up to date.

     - [X] Wrong database source prefix.  Also fixed via
       : findDBDir(databasePath)

       Similar log entries:

                 agent          | 2024/08/13 15:40:14 ERROR Failed to run analysis job error="failed to run analysis: failed to get source location prefix: failed to resolve database: exit status 2\nOutput: A fatal error occurred: /tmp/da420844-a284-4d82-9470-fa189a5b4ee6/db is not a recognized CodeQL database.\n"
                 agent          | 2024/08/13 15:40:14 INFO Worker stopping due to reduction in worker count
                 agent          | 2024/08/13 15:40:18 ERROR Failed to run analysis job error="failed to run analysis: failed to get source location prefix: failed to resolve database: exit status 2\nOutput: A fatal error occurred: /tmp/eebfc52c-3ecf-490d-bbf4-23c305d6ba18/db is not a recognized CodeQL database.\n"

            and
                 agent          | 2024/08/13 15:49:33 ERROR Failed to resolve database err="exit status 2" output="A fatal error occurred: /tmp/b5c4941a-5692-4640-aa79-9810bcab39f4/db is not a recognized CodeQL database.\n"
                 agent          | 2024/08/13 15:49:33 DEBUG XX: RunQuery failed to get source location prefixdatabasePath=/tmp/b5c4941a-5692-4640-aa79-9810bcab39f4/db "Waiting for SIGUSR1 or SIGUSR2..."=<nil>
                 agent          | 2024/08/13 15:49:35 INFO Modifying worker count current=3 new=2
                 agent          | 2024/08/13 15:49:35 ERROR Failed to resolve database err="exit status 2" output="A fatal error occurred: /tmp/eda30582-81a3-4582-8897-65f8904e8501/db is not a recognized CodeQL database.\n"
                 agent          | 2024/08/13 15:49:35 DEBUG XX: RunQuery failed to get source location prefixdatabasePath=/tmp/eda30582-81a3-4582-8897-65f8904e8501/db "Waiting for SIGUSR1 or SIGUSR2..."=<nil>

            And this fails

                 root@51464985499f:~# /opt/codeql/codeql resolve database /tmp/eda30582-81a3-4582-8897-65f8904e8501/db/
                 A fatal error occurred: /tmp/eda30582-81a3-4582-8897-65f8904e8501/db is not a recognized CodeQL database.

            But this works:

                 root@51464985499f:~# /opt/codeql/codeql resolve database /tmp/eda30582-81a3-4582-8897-65f8904e8501/db/codeql_db/
                 {
                   "sourceLocationPrefix" : "/home/runner/work/bulk-builder/bulk-builder",
                   "columnKind" : "utf8",
                   "unicodeNewlines" : false,
                   "sourceArchiveZip" : "/tmp/eda30582-81a3-4582-8897-65f8904e8501/db/codeql_db/src.zip",
                   "sourceArchiveRoot" : "/tmp/eda30582-81a3-4582-8897-65f8904e8501/db/codeql_db/src",
                   "datasetFolder" : "/tmp/eda30582-81a3-4582-8897-65f8904e8501/db/codeql_db/db-cpp",
                   "logsFolder" : "/tmp/eda30582-81a3-4582-8897-65f8904e8501/db/codeql_db/log",
                   "languages" : [
                     "cpp"
                   ],
                   "scratchDir" : "/tmp/eda30582-81a3-4582-8897-65f8904e8501/db/codeql_db/working"
                }
2024-08-13 09:22:24 -07:00
Michael Hohn
35100f89a7 Add html generation/view targets 2024-08-13 00:03:16 -07:00
Michael Hohn
742b059a49 Add script to list full details for a mrva-list file 2024-08-09 08:37:31 -07:00
Michael Hohn
d1f56ae196 Add explicit language selection 2024-08-09 08:36:48 -07:00
Michael Hohn
6262197c8d Introduce distinct doc/ and notes/ directories 2024-08-05 18:44:16 -07:00
Michael Hohn
781571044d fix minor doc typo 2024-08-05 18:43:17 -07:00
Michael Hohn
b183cee78d Reformat / rearrange comments 2024-08-02 14:10:54 -07:00
Michael Hohn
5a95f0ea08 Add module comment 2024-08-02 14:04:06 -07:00
Michael Hohn
349d758c14 Move session scripts to separate directory 2024-08-02 13:56:47 -07:00
Michael Hohn
582d933130 Improve example data layout and README 2024-08-01 14:30:40 -07:00
Michael Hohn
b7b4839fe0 Enforce CID uniqueness and save raw refined info immediately
Previously, the refined info was collected and the CID computed before saving.
This was a major development time sink, so the CID is now computed in the
following step (bin/mc-db-unique).

The columns previously chosen for the CID are not enough.  If these columns are
empty for any reason, the CID repeats.  Just including the owner/name won't help,
because those are duplicates.

Some possibilities considered and rejected:
1. Could use a random number for missing columns.  But this makes
   the CID nondeterministic.
2. Switch to the file system ctime?  Not unique across owner/repo pairs,
   but unique within one.  Also, this could be changed externally and cause
   *very* subtle bugs.
3. Use the file system path?  It has to be unique at ingestion time, but
   repo collections can move.

Instead, this patch
4. Drops rows that don't have the
   | cliVersion   |
   | creationTime |
   | language     |
   | sha          |
   columns.  There are very few (16 out of 6000) and their DBs are
   quesionable.
2024-08-01 11:09:04 -07:00
Michael Hohn
06dcf50728 Sort utils.cid_hash() entries for legibility 2024-07-31 15:20:43 -07:00
Michael Hohn
8f151ab002 Comment update 2024-07-30 16:08:05 -07:00
Michael Hohn
65cdf9a883 Merge branch 'hohn-0.1.21-static-type-afstore' into hohn-0.1.20-codeql-db-selector 2024-07-30 12:36:36 -07:00
Michael Hohn
1e1daf9330 Include custom id (CID) to distinguish CodeQL databases
The current api (<2024-07-26 Fri>) is set up only for (owner,name).  This is
insufficient for distinguishing CodeQL databases.

Other differences must be considered;  this patch combines the fields
    | cliVersion   |
    | creationTime |
    | language     |
    | sha          |
into one called CID.  The CID field is a hash of these others and therefore can be
changed in the future without affecting workflows or the server.

The cid is combined with the owner/name to form one
identifier.  This requires no changes to server or client -- the db
selection's interface is separate from VS Code and gh-mrva in any case.

To test this, this version imports multiple versions of the same owner/repo pairs from multiple directories.  In this case, from
    ~/work-gh/mrva/mrva-open-source-download/repos
and
    ~/work-gh/mrva/mrva-open-source-download/repos-2024-04-29/
The unique database count increases from 3000 to 5360 -- see README.md,
    ./bin/mc-db-view-info < db-info-3.csv &

Other code modifications:
    - Push (owner,repo,cid) names to minio
    - Generate databases.json for use in vs code extension
    -  Generate list-databases.json for use by gh-mrva client
2024-07-30 10:47:29 -07:00
Michael Hohn
b4f1a2b8a6 Minor comment fix 2024-07-29 13:53:12 -07:00
Michael Hohn
f652a6719c Comment fix 2024-07-29 13:41:15 -07:00
Michael Hohn
81c44ab14a Add mc-db-unique as default single-(owner,repo) selector 2024-07-26 14:18:14 -07:00
Michael Hohn
92ca709458 Add mc-db-view-info to view available DBs 2024-07-26 08:40:41 -07:00
Michael Hohn
242ba3fc1e Add script to populate minio using dataframe previously chosen 2024-07-25 15:14:37 -07:00
Michael Hohn
26dd69c976 minor doc update 2024-07-23 15:18:32 -07:00
Michael Hohn
731b44b187 Add scripts for automatic codeql db data and metadata collection
- updated instructions
- cli scripts mirror the interactive session*.py files
2024-07-23 15:05:03 -07:00
Michael Hohn
aaeafa9e88 Automate metadata collection for all DBs
Several errors are handled; on extraction
    ExtractNotZipfile:
    ExtractNoCQLDB:

On detail extraction
    DetailsMissing:
2024-07-22 19:12:12 -07:00
Michael Hohn
129b8cc302 interim: collect metadata from one DB zip file 2024-07-22 12:54:57 -07:00
Michael Hohn
d64522d168 Collect CodeQL database information from the file system and save as CSV
This collection already provides significant meta-information

    ctime : str = '2024-05-13T12:04:01.593586'
    language : str = 'cpp'
    name : str = 'nanobind'
    owner : str = 'wjakob'
    path : Path = Path('/Users/hohn/work-gh/mrva/mrva-open-source-download/repos/wjakob/nanobind/code-scanning/codeql/databases/cpp/db.zip')
    size : int = 63083064

There is some more in the db.zip files, to be added
2024-07-22 11:07:00 -07:00
Michael Hohn
6b4e753e69 Experiment with formats for saving/loading the database index
The .csv.gz format is the simplest and most universal.  It's also the smallest
on disk.
The comparison of saved/reloaded dataframe shows no difference.
The ctime_raw column caused serialization problems, so only ctime (in
iso-8601 format) is used.
2024-07-12 14:41:05 -07:00
Michael Hohn
3df1cac5ae Clean up package info 2024-07-10 15:38:59 -07:00
Michael Hohn
dcc32ea8ab Add documentation style sheet and Makefile entry 2024-07-10 15:27:09 -07:00
Michael Hohn
3c8db9cbe4 Put the DB code into a package 2024-07-10 15:04:09 -07:00
Michael Hohn
be1304bdd9 Add to .dockerignore to reduce build time 2024-07-10 13:23:08 -07:00