Include custom id (CID) to distinguish CodeQL databases

The current api (<2024-07-26 Fri>) is set up only for (owner,name).  This is
insufficient for distinguishing CodeQL databases.

Other differences must be considered;  this patch combines the fields
    | cliVersion   |
    | creationTime |
    | language     |
    | sha          |
into one called CID.  The CID field is a hash of these others and therefore can be
changed in the future without affecting workflows or the server.

The cid is combined with the owner/name to form one
identifier.  This requires no changes to server or client -- the db
selection's interface is separate from VS Code and gh-mrva in any case.

To test this, this version imports multiple versions of the same owner/repo pairs from multiple directories.  In this case, from
    ~/work-gh/mrva/mrva-open-source-download/repos
and
    ~/work-gh/mrva/mrva-open-source-download/repos-2024-04-29/
The unique database count increases from 3000 to 5360 -- see README.md,
    ./bin/mc-db-view-info < db-info-3.csv &

Other code modifications:
    - Push (owner,repo,cid) names to minio
    - Generate databases.json for use in vs code extension
    -  Generate list-databases.json for use by gh-mrva client
This commit is contained in:
Michael Hohn
2024-07-30 10:47:29 -07:00
committed by =Michael Hohn
parent b4f1a2b8a6
commit 1e1daf9330
8 changed files with 322 additions and 52 deletions

View File

@@ -1,7 +1,8 @@
#!/usr/bin/env python
""" Read a table of CodeQL DB information,
group entries by (owner,name), sort each group by
creationTime and keep only the top (newest) element.
group entries by (owner,name,CID),
sort each group by creationTime,
and keep only the top (newest) element.
"""
import argparse
import logging
@@ -32,8 +33,8 @@ import sys
df0 = pd.read_csv(sys.stdin)
df_sorted = df0.sort_values(by=['owner', 'name', 'creationTime'])
df_unique = df_sorted.groupby(['owner', 'name']).first().reset_index()
df_sorted = df0.sort_values(by=['owner', 'name', 'CID', 'creationTime'])
df_unique = df_sorted.groupby(['owner', 'name', 'CID']).first().reset_index()
df_unique.to_csv(sys.stdout, index=False)