Enforce CID uniqueness and save raw refined info immediately

Previously, the refined info was collected and the CID computed before saving. This was a major development time sink, so the CID is now computed in the following step (bin/mc-db-unique). The columns previously chosen for the CID are not enough. If these columns are empty for any reason, the CID repeats. Just including the owner/name won't help, because those are duplicates. Some possibilities considered and rejected: 1. Could use a random number for missing columns. But this makes the CID nondeterministic. 2. Switch to the file system ctime? Not unique across owner/repo pairs, but unique within one. Also, this could be changed externally and cause *very* subtle bugs. 3. Use the file system path? It has to be unique at ingestion time, but repo collections can move. Instead, this patch 4. Drops rows that don't have the | cliVersion | | creationTime | | language | | sha | columns. There are very few (16 out of 6000) and their DBs are quesionable.
2024-08-01 11:09:04 -07:00
parent 06dcf50728
commit b7b4839fe0
4 changed files with 117 additions and 59 deletions
--- a/client/qldbtools/qldbtools/session-4-unique.py
+++ b/client/qldbtools/qldbtools/session-4-unique.py
@@ -1,13 +1,38 @@
-# Experimental work with utils.py, to be merged into it.
-from utils import *
+# Experimental work for ../bin/mc-db-unique, to be merged into it.
+import qldbtools.utils as utils
 from pprint import pprint
+import pandas as pd
+# cd ../

 #* Reload CSV file to continue work
-df2 = pd.read_csv('db-info-2.csv')
+df2 = df_refined = pd.read_csv('db-info-2.csv')

+# Identify rows missing specific entries
+rows = ( df2['cliVersion'].isna() | 
+         df2['creationTime'].isna() |
+         df2['language'].isna() |
+         df2['sha'].isna() )
+df2[rows]
+df3 = df2[~rows]
+df3
+
+#* post-save work
+df4 = pd.read_csv('db-info-3.csv')
+
+# Sort and group
+df_sorted = df4.sort_values(by=['owner', 'name', 'CID', 'creationTime'])
+df_unique = df_sorted.groupby(['owner', 'name', 'CID']).first().reset_index()
+
+# Find duplicates
+df_dups = df_unique[df_unique['CID'].duplicated(keep=False)]
+len(df_dups)
+df_dups['CID']
+
+# Set display options
+pd.set_option('display.max_colwidth', None)
+pd.set_option('display.max_columns', None)
+pd.set_option('display.width', 140)

-df_sorted = df2.sort_values(by=['owner', 'name', 'creationTime'])
-df_unique = df_sorted.groupby(['owner', 'name']).first().reset_index()

 # 
 # Local Variables: