Commit Graph

212 Commits

Author SHA1 Message Date
Michael Hohn
5a95f0ea08 Add module comment 2024-08-02 14:04:06 -07:00
Michael Hohn
349d758c14 Move session scripts to separate directory 2024-08-02 13:56:47 -07:00
Michael Hohn
582d933130 Improve example data layout and README 2024-08-01 14:30:40 -07:00
Michael Hohn
b7b4839fe0 Enforce CID uniqueness and save raw refined info immediately
Previously, the refined info was collected and the CID computed before saving.
This was a major development time sink, so the CID is now computed in the
following step (bin/mc-db-unique).

The columns previously chosen for the CID are not enough.  If these columns are
empty for any reason, the CID repeats.  Just including the owner/name won't help,
because those are duplicates.

Some possibilities considered and rejected:
1. Could use a random number for missing columns.  But this makes
   the CID nondeterministic.
2. Switch to the file system ctime?  Not unique across owner/repo pairs,
   but unique within one.  Also, this could be changed externally and cause
   *very* subtle bugs.
3. Use the file system path?  It has to be unique at ingestion time, but
   repo collections can move.

Instead, this patch
4. Drops rows that don't have the
   | cliVersion   |
   | creationTime |
   | language     |
   | sha          |
   columns.  There are very few (16 out of 6000) and their DBs are
   quesionable.
2024-08-01 11:09:04 -07:00
Michael Hohn
06dcf50728 Sort utils.cid_hash() entries for legibility 2024-07-31 15:20:43 -07:00
Michael Hohn
8f151ab002 Comment update 2024-07-30 16:08:05 -07:00
Michael Hohn
65cdf9a883 Merge branch 'hohn-0.1.21-static-type-afstore' into hohn-0.1.20-codeql-db-selector 2024-07-30 12:36:36 -07:00
Michael Hohn
1e1daf9330 Include custom id (CID) to distinguish CodeQL databases
The current api (<2024-07-26 Fri>) is set up only for (owner,name).  This is
insufficient for distinguishing CodeQL databases.

Other differences must be considered;  this patch combines the fields
    | cliVersion   |
    | creationTime |
    | language     |
    | sha          |
into one called CID.  The CID field is a hash of these others and therefore can be
changed in the future without affecting workflows or the server.

The cid is combined with the owner/name to form one
identifier.  This requires no changes to server or client -- the db
selection's interface is separate from VS Code and gh-mrva in any case.

To test this, this version imports multiple versions of the same owner/repo pairs from multiple directories.  In this case, from
    ~/work-gh/mrva/mrva-open-source-download/repos
and
    ~/work-gh/mrva/mrva-open-source-download/repos-2024-04-29/
The unique database count increases from 3000 to 5360 -- see README.md,
    ./bin/mc-db-view-info < db-info-3.csv &

Other code modifications:
    - Push (owner,repo,cid) names to minio
    - Generate databases.json for use in vs code extension
    -  Generate list-databases.json for use by gh-mrva client
2024-07-30 10:47:29 -07:00
Michael Hohn
b4f1a2b8a6 Minor comment fix 2024-07-29 13:53:12 -07:00
Michael Hohn
f652a6719c Comment fix 2024-07-29 13:41:15 -07:00
Michael Hohn
81c44ab14a Add mc-db-unique as default single-(owner,repo) selector 2024-07-26 14:18:14 -07:00
Michael Hohn
92ca709458 Add mc-db-view-info to view available DBs 2024-07-26 08:40:41 -07:00
Michael Hohn
242ba3fc1e Add script to populate minio using dataframe previously chosen 2024-07-25 15:14:37 -07:00
Michael Hohn
26dd69c976 minor doc update 2024-07-23 15:18:32 -07:00
Michael Hohn
731b44b187 Add scripts for automatic codeql db data and metadata collection
- updated instructions
- cli scripts mirror the interactive session*.py files
2024-07-23 15:05:03 -07:00
Michael Hohn
aaeafa9e88 Automate metadata collection for all DBs
Several errors are handled; on extraction
    ExtractNotZipfile:
    ExtractNoCQLDB:

On detail extraction
    DetailsMissing:
2024-07-22 19:12:12 -07:00
Michael Hohn
129b8cc302 interim: collect metadata from one DB zip file 2024-07-22 12:54:57 -07:00
Michael Hohn
d64522d168 Collect CodeQL database information from the file system and save as CSV
This collection already provides significant meta-information

    ctime : str = '2024-05-13T12:04:01.593586'
    language : str = 'cpp'
    name : str = 'nanobind'
    owner : str = 'wjakob'
    path : Path = Path('/Users/hohn/work-gh/mrva/mrva-open-source-download/repos/wjakob/nanobind/code-scanning/codeql/databases/cpp/db.zip')
    size : int = 63083064

There is some more in the db.zip files, to be added
2024-07-22 11:07:00 -07:00
Michael Hohn
6b4e753e69 Experiment with formats for saving/loading the database index
The .csv.gz format is the simplest and most universal.  It's also the smallest
on disk.
The comparison of saved/reloaded dataframe shows no difference.
The ctime_raw column caused serialization problems, so only ctime (in
iso-8601 format) is used.
2024-07-12 14:41:05 -07:00
Michael Hohn
3df1cac5ae Clean up package info 2024-07-10 15:38:59 -07:00
Michael Hohn
dcc32ea8ab Add documentation style sheet and Makefile entry 2024-07-10 15:27:09 -07:00
Michael Hohn
3c8db9cbe4 Put the DB code into a package 2024-07-10 15:04:09 -07:00
Michael Hohn
be1304bdd9 Add to .dockerignore to reduce build time 2024-07-10 13:23:08 -07:00
Michael Hohn
8965725e42 Replaced the dynamic table type ArtifactLocation with struct keys
The original is present in comment form for reference
2024-07-10 13:08:40 -07:00
Michael Hohn
2df48b9f98 Collect DB information from file system and render it 2024-07-10 09:11:21 -07:00
Michael Hohn
8d80272922 Add client/ setup and plan 2024-07-09 10:37:41 -07:00
Michael Hohn
e3f4d9f012 Use QL_DB_BUCKET_NAME in shell and go 2024-07-08 14:23:25 -07:00
Michael Hohn
3566f5169e Type checking fix: Restrict the keys / values for ArtifactLocation and centralize the common ones 2024-07-08 12:07:46 -07:00
Michael Hohn
b3cf7a4f65 Introduce explicit type QueryLanguage = string and update code to clarify
Previously:
- There is confusion between nameWithOwner and queryLanguage.  Both are strings.
  Between

        runResult, err := codeql.RunQuery(databasePath, job.QueryLanguage, queryPackPath, tempDir)
    (agent.go l205)

  and

        func RunQuery(database string, nwo string, queryPackPath string, tempDir string) (*RunQueryResult, error)

  QueryLanguage is suddenly name with owner in the code.

  Added some debugging, the value is the query language in the two places it gets used:

        server         | 2024/07/03 18:30:15 DEBUG Processed request info location="{Data:map[bucket:packs key:1]}" language=cpp
        ...
        agent          | 2024/07/03 18:30:15 DEBUG XX: is nwo a name/owner, or the original callers' queryLanguage? nwo=cpp
        ...
        agent          | 2024/07/03 18:30:19 DEBUG XX: 2: is nwo a name/owner, or the original callers' queryLanguage? nwo=cpp

Changes:
- Introduce explicit type QueryLanguage = string and update code to clarify
- inline trivial function
2024-07-03 13:30:02 -07:00
Michael Hohn
07f93f3d27 Tested zero-length repository list
By submitting a non-existent target repositories.
2024-07-03 08:55:27 -07:00
Michael Hohn
7413e23bab Add reverse proxy and configuration for replay capability 2024-07-01 12:03:24 -07:00
Michael Hohn
380e90135a Add the submitEmptyStatusResponse special case 2024-07-01 10:54:46 -07:00
Michael Hohn
1642894ccf Added note about querypackurl 2024-06-27 14:53:52 -07:00
Michael Hohn
c54bda8432 fix regression from 0cffb3c8 2024-06-27 14:22:52 -07:00
Michael Hohn
17bf9049e4 Add note about docker-compose force rebuild 2024-06-27 11:59:34 -07:00
Michael Hohn
62a7b227f0 Add 'revive' linter setup 2024-06-26 10:35:10 -07:00
Michael Hohn
b543cebfac Add golangci linter setup 2024-06-26 10:21:52 -07:00
Michael Hohn
d145731c4b WIP: marked special case of 0 jobs 2024-06-26 09:27:27 -07:00
Michael Hohn
0cffb3c849 Simplify struct SessionInfo and adjoining code 2024-06-25 18:57:27 -07:00
Michael Hohn
9d1a891c72 Add to README, add Makefile with development targets 2024-06-25 12:04:46 -07:00
Nicolas Will
b4d9833da3 Resolve status logic error and refactor server.go 2024-06-24 22:31:19 -04:00
Nicolas Will
e0cbc01d21 Fully implement local and container MRVA 2024-06-24 01:31:28 -04:00
Nicolas Will
ef7552c43f Update .gitignore 2024-06-17 13:16:06 +02:00
Nicolas Will
50da8eefe8 Fix ZipSlip vuln and integer conversion issue 2024-06-17 12:55:28 +02:00
Nicolas Will
fc9fcc7ae6 Add server queue logic and refactor 2024-06-17 11:30:46 +02:00
Michael Hohn
8b310e43ad Fix storage modules types and interfaces to compile server 2024-06-16 20:16:26 -07:00
Michael Hohn
6229c08900 Remove postgres and references to it 2024-06-16 19:43:29 -07:00
Michael Hohn
b756668e70 Fix merge so server compiles 2024-06-16 19:36:31 -07:00
Michael Hohn
2c5ecd3a1e Merge the agent-impl branch into the server branch 2024-06-16 19:21:42 -07:00
Michael Hohn
3be3229ecd Remove duplicate test DB 2024-06-16 19:09:32 -07:00