* Introduction to hepc -- HTTP End Point for CodeQL
  #+BEGIN_SRC sh 
    1:$ ./bin/hepc-init --db_collection_dir db-collection --starting_path ~/work-gh/mrva/mrva-open-source-download
    [2024-11-19 14:12:06] [INFO] searching for db.zip files
    [2024-11-19 14:12:08] [INFO] collecting information from db.zip files
    [2024-11-19 14:12:08] [INFO] Extracting from /Users/hohn/work-gh/mrva/mrva-open-source-download/repos/aircrack-ng/aircrack-ng/code-scanning/codeql/databases/cpp/db.zip
    [2024-11-19 14:12:08] [INFO] Adding record to db-collection/metadata.json
  #+END_SRC

* Introduction to qldbtools
=qldbtools= is a Python package for selecting sets of CodeQL databases
to work on. It uses a (pandas) dataframe in the implementation, but all
results sets are available as CSV files to provide flexibility in the
tools you want to work with.

The rationale is simple: When working with larger collections of CodeQL
databases, spread over time, languages, etc., many criteria can be used
to select the subset of interest. This package addresses that aspect of
MRVA (multi repository variant analysis).

For example, consider this scenario from an enterprise. We have 10,000
repositories in C/C++, 5,000 in Python. We build CodeQL dabases weekly
and keep the last 2 years worth. This means for the last 2 years there
are

#+begin_example
(10000 + 5000) * 52 * 2 = 1560000
#+end_example

databases to select from for a single MRVA run. 1.5 Million rows are
readily handled by a pandas (or R) dataframe.

The full list of criteria currently encoded via the columns is

- owner
- name
- CID
- cliVersion
- creationTime
- language
- sha -- git commit sha of the code the CodeQL database is built against
- baselineLinesOfCode
- path
- db_lang
- db_lang_displayName
- db_lang_file_count
- db_lang_linesOfCode
- ctime
- primaryLanguage
- finalised
- left_index
- size

The minimal criteria needed to distinguish databases in the above
scenario are

- cliVersion
- creationTime
- language
- sha

These are encoded in the single custom id column 'CID'.

Thus, a database can be fully specified using a (owner,name,CID) tuple
and this is encoded in the names used by the MRVA server and clients.
The selection of databases can of course be done using the whole table.

For an example of the workflow, see [[#command-line-use][section
'command line use']].

A small sample of a full table:

|   | owner    | name           | CID    | cliVersion | creationTime                     | language | sha                                      | baselineLinesOfCode | path                                                                                                                          | db_lang     | db_lang_displayName | db_lang_file_count | db_lang_linesOfCode | ctime                      | primaryLanguage | finalised | left_index | size     |
|---+----------+----------------+--------+------------+----------------------------------+----------+------------------------------------------+---------------------+-------------------------------------------------------------------------------------------------------------------------------+-------------+---------------------+--------------------+---------------------+----------------------------+-----------------+-----------+------------+----------|
| 0 | 1adrianb | face-alignment | 1f8d99 | 2.16.1     | 2024-02-08 14:18:20.983830+00:00 | python   | c94dd024b1f5410ef160ff82a8423141e2bbb6b4 | 1839                | /Users/hohn/work-gh/mrva/mrva-open-source-download/repos/1adrianb/face-alignment/code-scanning/codeql/databases/python/db.zip | python      | Python              | 25                 | 1839                | 2024-07-24T14:09:02.187201 | python          | 1         | 1454       | 24075001 |
| 1 | 2shou    | TextGrocery    | 9ab87a | 2.12.1     | 2023-02-17T11:32:30.863093193Z   | cpp      | 8a4e41349a9b0175d9a73bc32a6b2eb6bfb51430 | 3939                | /Users/hohn/work-gh/mrva/mrva-open-source-download/repos/2shou/TextGrocery/code-scanning/codeql/databases/cpp/db.zip          | no-language | no-language         | 0                  | -1                  | 2024-07-24T06:25:55.347568 | cpp             | nan       | 1403       | 3612535  |
| 2 | 3b1b     | manim          | 76fdc7 | 2.17.5     | 2024-06-27 17:37:20.587627+00:00 | python   | 88c7e9d2c96be1ea729b089c06cabb1bd3b2c187 | 19905               | /Users/hohn/work-gh/mrva/mrva-open-source-download/repos/3b1b/manim/code-scanning/codeql/databases/python/db.zip              | python      | Python              | 94                 | 19905               | 2024-07-24T13:23:04.716286 | python          | 1         | 1647       | 26407541 |

** Installation
- Set up the virtual environment and install tools

  #+begin_example
          cd ~/work-gh/mrva/mrvacommander/client/qldbtools/
          python3.11 -m venv venv
          source venv/bin/activate
          pip install --upgrade pip

          # From requirements.txt
          pip install -r requirements.txt
          # Or explicitly
          pip install jupyterlab pandas ipython
          pip install lckr-jupyterlab-variableinspector
  #+end_example

- Local development

  #+begin_example
  ```bash
  cd ~/work-gh/mrva/mrvacommander/client/qldbtools
  source venv/bin/activate
  pip install --editable .
  ```

  The `--editable` *should* use symlinks for all scripts; use `./bin/*` to be sure.
  #+end_example

- Full installation

  #+begin_example
  ```bash
  pip install qldbtools
  ```
  #+end_example

** Use as library
The best way to examine the code is starting from the high-level scripts
in =bin/=.

** Command line use
Initial information collection requires a unique file path so it can be
run repeatedly over DB collections with the same (owner,name) but other
differences -- namely, in one or more of

- creationTime
- sha
- cliVersion
- language

Those fields are collected in =bin/mc-db-refine-info=.

An example workflow with commands grouped by data files follows.

#+begin_example
    cd ~/work-gh/mrva/mrvacommander/client/qldbtools && mkdir -p scratch
    ./bin/mc-db-initial-info ~/work-gh/mrva/mrva-open-source-download > scratch/db-info-1.csv
    ./bin/mc-db-refine-info < scratch/db-info-1.csv > scratch/db-info-2.csv
   
    ./bin/mc-db-view-info < scratch/db-info-2.csv &
    ./bin/mc-db-unique cpp < scratch/db-info-2.csv > scratch/db-info-3.csv
    ./bin/mc-db-view-info < scratch/db-info-3.csv &

    ./bin/mc-db-populate-minio -n 11 < scratch/db-info-3.csv
    ./bin/mc-db-generate-selection -n 11 \
        scratch/vscode-selection.json \
        scratch/gh-mrva-selection.json \
        < scratch/db-info-3.csv 
#+end_example

To see the full information for a selection, use
=mc-rows-from-mrva-list=:

#+begin_example
    ./bin/mc-rows-from-mrva-list scratch/gh-mrva-selection.json \
        scratch/db-info-3.csv > scratch/selection-full-info
#+end_example

To check, e.g., the =language= column:

#+begin_example
    csvcut -c language scratch/selection-full-info 
#+end_example

** Notes
The =preview-data= plugin for VS Code has a bug; it displays =0= instead
of =0e3379= for the following. There are other entries with similar
malfunction.

#+begin_example
    CleverRaven,Cataclysm-DDA,0e3379,2.17.0,2024-05-08 12:13:10.038007+00:00,cpp,5ca7f4e59c2d7b0a93fb801a31138477f7b4a761,578098.0,/Users/hohn/work-gh/mrva/mrva-open-source-download/repos-2024-04-29/CleverRaven/Cataclysm-DDA/code-scanning/codeql/databases/cpp/db.zip,cpp,C/C++,1228.0,578098.0,2024-05-13T12:14:54.650648,cpp,True,4245,563435469
    CleverRaven,Cataclysm-DDA,3231f7,2.18.0,2024-07-18 11:13:01.673231+00:00,cpp,db3435138781937e9e0e999abbaa53f1d3afb5b7,579532.0,/Users/hohn/work-gh/mrva/mrva-open-source-download/repos/CleverRaven/Cataclysm-DDA/code-scanning/codeql/databases/cpp/db.zip,cpp,C/C++,1239.0,579532.0,2024-07-24T02:33:23.900885,cpp,True,1245,573213726
#+end_example