Include custom id (CID) to distinguish CodeQL databases

The current api (<2024-07-26 Fri>) is set up only for (owner,name). This is insufficient for distinguishing CodeQL databases. Other differences must be considered; this patch combines the fields | cliVersion | | creationTime | | language | | sha | into one called CID. The CID field is a hash of these others and therefore can be changed in the future without affecting workflows or the server. The cid is combined with the owner/name to form one identifier. This requires no changes to server or client -- the db selection's interface is separate from VS Code and gh-mrva in any case. To test this, this version imports multiple versions of the same owner/repo pairs from multiple directories. In this case, from ~/work-gh/mrva/mrva-open-source-download/repos and ~/work-gh/mrva/mrva-open-source-download/repos-2024-04-29/ The unique database count increases from 3000 to 5360 -- see README.md, ./bin/mc-db-view-info < db-info-3.csv & Other code modifications: - Push (owner,repo,cid) names to minio - Generate databases.json for use in vs code extension - Generate list-databases.json for use by gh-mrva client
2024-07-30 10:47:29 -07:00
parent b4f1a2b8a6
commit 1e1daf9330
8 changed files with 322 additions and 52 deletions
--- a/client/qldbtools/bin/mc-db-refine-info
+++ b/client/qldbtools/bin/mc-db-refine-info
@@ -43,6 +43,14 @@ for left_index in range(0, len(d)-1):
 joiners_df = pd.concat(joiners, axis=0)
 full_df = pd.merge(d, joiners_df, left_index=True, right_on='left_index', how='outer')    

+#** Add single uniqueness field -- CID (Cumulative ID)
+full_df['CID'] = full_df.apply(lambda row: 
+                               utils.cid_hash( (row['creationTime'],
+                                                row['sha'], 
+                                                row['cliVersion'], 
+                                                row['language'])
+                                              ), axis=1)
+
 #** Re-order the dataframe columns by importance
 # - Much of the data
 #   1. Is only conditionally present
@@ -70,11 +78,13 @@ full_df = pd.merge(d, joiners_df, left_index=True, right_on='left_index', how='o
 #     | primaryLanguage     |
 #     | finalised           |

-final_df = full_df.reindex(columns=['owner', 'name', 'language', 'size', 'cliVersion',
-	                                'creationTime', 'sha', 'baselineLinesOfCode', 'path',
-	                                'db_lang', 'db_lang_displayName', 'db_lang_file_count',
-	                                'db_lang_linesOfCode', 'ctime', 'primaryLanguage',
-	                                'finalised', 'left_index'])
+final_df = full_df.reindex( columns=['owner', 'name', 'cliVersion',
+                                     'creationTime', 'language', 'sha','CID',
+                                     'baselineLinesOfCode', 'path', 'db_lang',
+                                     'db_lang_displayName', 'db_lang_file_count',
+                                     'db_lang_linesOfCode', 'ctime',
+                                     'primaryLanguage', 'finalised', 'left_index',
+                                     'size'])

 final_df.to_csv(sys.stdout, index=False)