Files
mrvacommander/client/qldbtools/README.md
Michael Hohn b7b4839fe0 Enforce CID uniqueness and save raw refined info immediately
Previously, the refined info was collected and the CID computed before saving.
This was a major development time sink, so the CID is now computed in the
following step (bin/mc-db-unique).

The columns previously chosen for the CID are not enough.  If these columns are
empty for any reason, the CID repeats.  Just including the owner/name won't help,
because those are duplicates.

Some possibilities considered and rejected:
1. Could use a random number for missing columns.  But this makes
   the CID nondeterministic.
2. Switch to the file system ctime?  Not unique across owner/repo pairs,
   but unique within one.  Also, this could be changed externally and cause
   *very* subtle bugs.
3. Use the file system path?  It has to be unique at ingestion time, but
   repo collections can move.

Instead, this patch
4. Drops rows that don't have the
   | cliVersion   |
   | creationTime |
   | language     |
   | sha          |
   columns.  There are very few (16 out of 6000) and their DBs are
   quesionable.
2024-08-01 11:09:04 -07:00

94 lines
3.3 KiB
Markdown

# qldbtools
qldbtools is a Python package for working with CodeQL databases
## Installation
- Set up the virtual environment and install tools
cd ~/work-gh/mrva/mrvacommander/client/qldbtools/
python3.11 -m venv venv
source venv/bin/activate
pip install --upgrade pip
# From requirements.txt
pip install -r requirements.txt
# Or explicitly
pip install jupyterlab pandas ipython
pip install lckr-jupyterlab-variableinspector
- Run jupyterlab
cd ~/work-gh/mrva/mrvacommander/client
source venv/bin/activate
jupyter lab &
The variable inspector is a right-click on an open console or notebook.
The `jupyter` command produces output including
Jupyter Server 2.14.1 is running at:
http://127.0.0.1:8888/lab?token=4c91308819786fe00a33b76e60f3321840283486457516a1
Use this to connect multiple front ends
- Local development
```bash
cd ~/work-gh/mrva/mrvacommander/client/qldbtools
source venv/bin/activate
pip install --editable .
```
The `--editable` *should* use symlinks for all scripts; use `./bin/*` to be sure.
- Full installation
```bash
pip install qldbtools
```
## Use as library
```python
import qldbtools as ql
```
## Command-line use
Initial information collection requires a unique file path so it can be run
repeatedly over DB collections with the same (owner,name) but other differences
-- namely, in one or more of
- creationTime
- sha
- cliVersion
- language
Those fields are collected and a single name addenum formed in
`bin/mc-db-refine-info`.
The command sequence, grouped by data files, is
cd ~/work-gh/mrva/mrvacommander/client/qldbtools
./bin/mc-db-initial-info ~/work-gh/mrva/mrva-open-source-download > db-info-1.csv
./bin/mc-db-refine-info < db-info-1.csv > db-info-2.csv
./bin/mc-db-view-info < db-info-2.csv &
./bin/mc-db-unique < db-info-2.csv > db-info-3.csv
./bin/mc-db-view-info < db-info-3.csv &
./bin/mc-db-populate-minio -n 23 < db-info-3.csv
./bin/mc-db-generate-selection -n 23 vscode-selection.json gh-mrva-selection.json < db-info-3.csv
## Notes
The preview-data plugin for VS Code has a bug; it displays `0` instead of
`0e3379` for the following. There are other entries with similar malfunction.
CleverRaven,Cataclysm-DDA,0e3379,2.17.0,2024-05-08 12:13:10.038007+00:00,cpp,5ca7f4e59c2d7b0a93fb801a31138477f7b4a761,578098.0,/Users/hohn/work-gh/mrva/mrva-open-source-download/repos-2024-04-29/CleverRaven/Cataclysm-DDA/code-scanning/codeql/databases/cpp/db.zip,cpp,C/C++,1228.0,578098.0,2024-05-13T12:14:54.650648,cpp,True,4245,563435469
CleverRaven,Cataclysm-DDA,3231f7,2.18.0,2024-07-18 11:13:01.673231+00:00,cpp,db3435138781937e9e0e999abbaa53f1d3afb5b7,579532.0,/Users/hohn/work-gh/mrva/mrva-open-source-download/repos/CleverRaven/Cataclysm-DDA/code-scanning/codeql/databases/cpp/db.zip,cpp,C/C++,1239.0,579532.0,2024-07-24T02:33:23.900885,cpp,True,1245,573213726