Include custom id (CID) to distinguish CodeQL databases

The current api (<2024-07-26 Fri>) is set up only for (owner,name). This is insufficient for distinguishing CodeQL databases. Other differences must be considered; this patch combines the fields | cliVersion | | creationTime | | language | | sha | into one called CID. The CID field is a hash of these others and therefore can be changed in the future without affecting workflows or the server. The cid is combined with the owner/name to form one identifier. This requires no changes to server or client -- the db selection's interface is separate from VS Code and gh-mrva in any case. To test this, this version imports multiple versions of the same owner/repo pairs from multiple directories. In this case, from ~/work-gh/mrva/mrva-open-source-download/repos and ~/work-gh/mrva/mrva-open-source-download/repos-2024-04-29/ The unique database count increases from 3000 to 5360 -- see README.md, ./bin/mc-db-view-info < db-info-3.csv & Other code modifications: - Push (owner,repo,cid) names to minio - Generate databases.json for use in vs code extension - Generate list-databases.json for use by gh-mrva client
2024-07-30 10:47:29 -07:00
parent b4f1a2b8a6
commit 1e1daf9330
8 changed files with 322 additions and 52 deletions
--- a/client/qldbtools/bin/mc-db-unique
+++ b/client/qldbtools/bin/mc-db-unique
@@ -1,7 +1,8 @@
 #!/usr/bin/env python
 """ Read a table of CodeQL DB information, 
-    group entries by (owner,name),  sort each group by
-    creationTime and keep only the top (newest) element.
+    group entries by (owner,name,CID),  
+    sort each group by creationTime,
+    and keep only the top (newest) element.
 """
 import argparse
 import logging
@@ -32,8 +33,8 @@ import sys

 df0 = pd.read_csv(sys.stdin)

-df_sorted = df0.sort_values(by=['owner', 'name', 'creationTime'])
-df_unique = df_sorted.groupby(['owner', 'name']).first().reset_index()
+df_sorted = df0.sort_values(by=['owner', 'name', 'CID', 'creationTime'])
+df_unique = df_sorted.groupby(['owner', 'name', 'CID']).first().reset_index()

 df_unique.to_csv(sys.stdout, index=False)