Commit Graph

18 Commits

Author SHA1 Message Date
Michael Hohn
a22d8d77f2 Indent data extraction / assembly 2024-11-27 23:06:17 -08:00
Michael Hohn
3db629e2ca A go version of hepc-serve 2024-11-27 22:56:47 -08:00
Michael Hohn
95d2638546 Add hepc-{init,serve} to provide codeql database access via http 2024-11-27 13:52:59 -08:00
Michael Hohn
ff96b34f5e minor doc update 2024-11-23 00:30:06 -08:00
Michael Hohn
537ebdf19d Change hepc-init to Python and add debugging configuration
This heavily uses plumbum to retain a shell-script style but add data
structures that will be needed
2024-11-20 11:02:38 -08:00
Michael Hohn
dd776e312a Add type information 2024-11-19 15:24:41 -08:00
Michael Hohn
18333bfdb1 Start hepc-init: the data collector for DBs on the file system 2024-11-19 15:23:20 -08:00
Michael Hohn
7d27b910cd Fix: include CID when filtering in mc-rows-from-mrva-list 2024-08-22 13:58:07 -07:00
Michael Hohn
742b059a49 Add script to list full details for a mrva-list file 2024-08-09 08:37:31 -07:00
Michael Hohn
d1f56ae196 Add explicit language selection 2024-08-09 08:36:48 -07:00
Michael Hohn
b7b4839fe0 Enforce CID uniqueness and save raw refined info immediately
Previously, the refined info was collected and the CID computed before saving.
This was a major development time sink, so the CID is now computed in the
following step (bin/mc-db-unique).

The columns previously chosen for the CID are not enough.  If these columns are
empty for any reason, the CID repeats.  Just including the owner/name won't help,
because those are duplicates.

Some possibilities considered and rejected:
1. Could use a random number for missing columns.  But this makes
   the CID nondeterministic.
2. Switch to the file system ctime?  Not unique across owner/repo pairs,
   but unique within one.  Also, this could be changed externally and cause
   *very* subtle bugs.
3. Use the file system path?  It has to be unique at ingestion time, but
   repo collections can move.

Instead, this patch
4. Drops rows that don't have the
   | cliVersion   |
   | creationTime |
   | language     |
   | sha          |
   columns.  There are very few (16 out of 6000) and their DBs are
   quesionable.
2024-08-01 11:09:04 -07:00
Michael Hohn
06dcf50728 Sort utils.cid_hash() entries for legibility 2024-07-31 15:20:43 -07:00
Michael Hohn
8f151ab002 Comment update 2024-07-30 16:08:05 -07:00
Michael Hohn
1e1daf9330 Include custom id (CID) to distinguish CodeQL databases
The current api (<2024-07-26 Fri>) is set up only for (owner,name).  This is
insufficient for distinguishing CodeQL databases.

Other differences must be considered;  this patch combines the fields
    | cliVersion   |
    | creationTime |
    | language     |
    | sha          |
into one called CID.  The CID field is a hash of these others and therefore can be
changed in the future without affecting workflows or the server.

The cid is combined with the owner/name to form one
identifier.  This requires no changes to server or client -- the db
selection's interface is separate from VS Code and gh-mrva in any case.

To test this, this version imports multiple versions of the same owner/repo pairs from multiple directories.  In this case, from
    ~/work-gh/mrva/mrva-open-source-download/repos
and
    ~/work-gh/mrva/mrva-open-source-download/repos-2024-04-29/
The unique database count increases from 3000 to 5360 -- see README.md,
    ./bin/mc-db-view-info < db-info-3.csv &

Other code modifications:
    - Push (owner,repo,cid) names to minio
    - Generate databases.json for use in vs code extension
    -  Generate list-databases.json for use by gh-mrva client
2024-07-30 10:47:29 -07:00
Michael Hohn
81c44ab14a Add mc-db-unique as default single-(owner,repo) selector 2024-07-26 14:18:14 -07:00
Michael Hohn
92ca709458 Add mc-db-view-info to view available DBs 2024-07-26 08:40:41 -07:00
Michael Hohn
242ba3fc1e Add script to populate minio using dataframe previously chosen 2024-07-25 15:14:37 -07:00
Michael Hohn
731b44b187 Add scripts for automatic codeql db data and metadata collection
- updated instructions
- cli scripts mirror the interactive session*.py files
2024-07-23 15:05:03 -07:00