Commit Graph

17 Commits

Author SHA1 Message Date
Michael Hohn
06dcf50728 Sort utils.cid_hash() entries for legibility 2024-07-31 15:20:43 -07:00
Michael Hohn
8f151ab002 Comment update 2024-07-30 16:08:05 -07:00
Michael Hohn
1e1daf9330 Include custom id (CID) to distinguish CodeQL databases
The current api (<2024-07-26 Fri>) is set up only for (owner,name).  This is
insufficient for distinguishing CodeQL databases.

Other differences must be considered;  this patch combines the fields
    | cliVersion   |
    | creationTime |
    | language     |
    | sha          |
into one called CID.  The CID field is a hash of these others and therefore can be
changed in the future without affecting workflows or the server.

The cid is combined with the owner/name to form one
identifier.  This requires no changes to server or client -- the db
selection's interface is separate from VS Code and gh-mrva in any case.

To test this, this version imports multiple versions of the same owner/repo pairs from multiple directories.  In this case, from
    ~/work-gh/mrva/mrva-open-source-download/repos
and
    ~/work-gh/mrva/mrva-open-source-download/repos-2024-04-29/
The unique database count increases from 3000 to 5360 -- see README.md,
    ./bin/mc-db-view-info < db-info-3.csv &

Other code modifications:
    - Push (owner,repo,cid) names to minio
    - Generate databases.json for use in vs code extension
    -  Generate list-databases.json for use by gh-mrva client
2024-07-30 10:47:29 -07:00
Michael Hohn
b4f1a2b8a6 Minor comment fix 2024-07-29 13:53:12 -07:00
Michael Hohn
f652a6719c Comment fix 2024-07-29 13:41:15 -07:00
Michael Hohn
81c44ab14a Add mc-db-unique as default single-(owner,repo) selector 2024-07-26 14:18:14 -07:00
Michael Hohn
92ca709458 Add mc-db-view-info to view available DBs 2024-07-26 08:40:41 -07:00
Michael Hohn
242ba3fc1e Add script to populate minio using dataframe previously chosen 2024-07-25 15:14:37 -07:00
Michael Hohn
26dd69c976 minor doc update 2024-07-23 15:18:32 -07:00
Michael Hohn
731b44b187 Add scripts for automatic codeql db data and metadata collection
- updated instructions
- cli scripts mirror the interactive session*.py files
2024-07-23 15:05:03 -07:00
Michael Hohn
aaeafa9e88 Automate metadata collection for all DBs
Several errors are handled; on extraction
    ExtractNotZipfile:
    ExtractNoCQLDB:

On detail extraction
    DetailsMissing:
2024-07-22 19:12:12 -07:00
Michael Hohn
129b8cc302 interim: collect metadata from one DB zip file 2024-07-22 12:54:57 -07:00
Michael Hohn
d64522d168 Collect CodeQL database information from the file system and save as CSV
This collection already provides significant meta-information

    ctime : str = '2024-05-13T12:04:01.593586'
    language : str = 'cpp'
    name : str = 'nanobind'
    owner : str = 'wjakob'
    path : Path = Path('/Users/hohn/work-gh/mrva/mrva-open-source-download/repos/wjakob/nanobind/code-scanning/codeql/databases/cpp/db.zip')
    size : int = 63083064

There is some more in the db.zip files, to be added
2024-07-22 11:07:00 -07:00
Michael Hohn
6b4e753e69 Experiment with formats for saving/loading the database index
The .csv.gz format is the simplest and most universal.  It's also the smallest
on disk.
The comparison of saved/reloaded dataframe shows no difference.
The ctime_raw column caused serialization problems, so only ctime (in
iso-8601 format) is used.
2024-07-12 14:41:05 -07:00
Michael Hohn
3df1cac5ae Clean up package info 2024-07-10 15:38:59 -07:00
Michael Hohn
dcc32ea8ab Add documentation style sheet and Makefile entry 2024-07-10 15:27:09 -07:00
Michael Hohn
3c8db9cbe4 Put the DB code into a package 2024-07-10 15:04:09 -07:00