Improve example data layout and README
This commit is contained in:
committed by
=Michael Hohn
parent
b7b4839fe0
commit
582d933130
@@ -1,6 +1,69 @@
|
|||||||
# qldbtools
|
# Introduction to qldbtools
|
||||||
|
|
||||||
qldbtools is a Python package for working with CodeQL databases
|
`qldbtools` is a Python package for selecting sets of CodeQL databases to work on.
|
||||||
|
It uses a (pandas) dataframe in the implementation, but all results sets are
|
||||||
|
available as CSV files to provide flexibility in the tools you want to work with.
|
||||||
|
|
||||||
|
The rationale is simple: When working with larger collections of CodeQL databases,
|
||||||
|
spread over time, languages, etc., many criteria can be used to select the subset
|
||||||
|
of interest. This package addresses that aspect of MRVA (multi repository
|
||||||
|
variant analysis).
|
||||||
|
|
||||||
|
For example, consider this scenario from an enterprise. We have 10,000
|
||||||
|
repositories in C/C++, 5,000 in Python. We build CodeQL dabases weekly and keep
|
||||||
|
the last 2 years worth.
|
||||||
|
This means for the last 2 years there are
|
||||||
|
|
||||||
|
(10000 + 5000) * 52 * 2 = 1560000
|
||||||
|
|
||||||
|
databases to select from for a single MRVA run. 1.5 Million rows are readily
|
||||||
|
handled by a pandas (or R) dataframe.
|
||||||
|
|
||||||
|
The full list of criteria currently encoded via the columns is
|
||||||
|
|
||||||
|
- owner
|
||||||
|
- name
|
||||||
|
- CID
|
||||||
|
- cliVersion
|
||||||
|
- creationTime
|
||||||
|
- language
|
||||||
|
- sha -- git commit sha of the code the CodeQL database is built against
|
||||||
|
- baselineLinesOfCode
|
||||||
|
- path
|
||||||
|
- db_lang
|
||||||
|
- db_lang_displayName
|
||||||
|
- db_lang_file_count
|
||||||
|
- db_lang_linesOfCode
|
||||||
|
- ctime
|
||||||
|
- primaryLanguage
|
||||||
|
- finalised
|
||||||
|
- left_index
|
||||||
|
- size
|
||||||
|
|
||||||
|
The minimal criteria needed to distinguish databases in the above scenario are
|
||||||
|
|
||||||
|
- cliVersion
|
||||||
|
- creationTime
|
||||||
|
- language
|
||||||
|
- sha
|
||||||
|
|
||||||
|
These are encoded in the single custom id column 'CID'.
|
||||||
|
|
||||||
|
Thus, a database can be fully specified using a (owner,name,CID) tuple and this is
|
||||||
|
encoded in the names used by the MRVA server and clients. The selection of
|
||||||
|
databases can of course be done using the whole table.
|
||||||
|
|
||||||
|
For an example of the workflow, see [section 'command line use'](#command-line-use).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
A small sample of a full table:
|
||||||
|
|
||||||
|
| | owner | name | CID | cliVersion | creationTime | language | sha | baselineLinesOfCode | path | db_lang | db_lang_displayName | db_lang_file_count | db_lang_linesOfCode | ctime | primaryLanguage | finalised | left_index | size |
|
||||||
|
|---:|:---------|:---------------|:-------|:-------------|:---------------------------------|:-----------|:-----------------------------------------|----------------------:|:------------------------------------------------------------------------------------------------------------------------------|:------------|:----------------------|---------------------:|----------------------:|:---------------------------|:------------------|------------:|-------------:|---------:|
|
||||||
|
| 0 | 1adrianb | face-alignment | 1f8d99 | 2.16.1 | 2024-02-08 14:18:20.983830+00:00 | python | c94dd024b1f5410ef160ff82a8423141e2bbb6b4 | 1839 | /Users/hohn/work-gh/mrva/mrva-open-source-download/repos/1adrianb/face-alignment/code-scanning/codeql/databases/python/db.zip | python | Python | 25 | 1839 | 2024-07-24T14:09:02.187201 | python | 1 | 1454 | 24075001 |
|
||||||
|
| 1 | 2shou | TextGrocery | 9ab87a | 2.12.1 | 2023-02-17T11:32:30.863093193Z | cpp | 8a4e41349a9b0175d9a73bc32a6b2eb6bfb51430 | 3939 | /Users/hohn/work-gh/mrva/mrva-open-source-download/repos/2shou/TextGrocery/code-scanning/codeql/databases/cpp/db.zip | no-language | no-language | 0 | -1 | 2024-07-24T06:25:55.347568 | cpp | nan | 1403 | 3612535 |
|
||||||
|
| 2 | 3b1b | manim | 76fdc7 | 2.17.5 | 2024-06-27 17:37:20.587627+00:00 | python | 88c7e9d2c96be1ea729b089c06cabb1bd3b2c187 | 19905 | /Users/hohn/work-gh/mrva/mrva-open-source-download/repos/3b1b/manim/code-scanning/codeql/databases/python/db.zip | python | Python | 94 | 19905 | 2024-07-24T13:23:04.716286 | python | 1 | 1647 | 26407541 |
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
@@ -17,21 +80,6 @@ qldbtools is a Python package for working with CodeQL databases
|
|||||||
pip install jupyterlab pandas ipython
|
pip install jupyterlab pandas ipython
|
||||||
pip install lckr-jupyterlab-variableinspector
|
pip install lckr-jupyterlab-variableinspector
|
||||||
|
|
||||||
- Run jupyterlab
|
|
||||||
|
|
||||||
cd ~/work-gh/mrva/mrvacommander/client
|
|
||||||
source venv/bin/activate
|
|
||||||
jupyter lab &
|
|
||||||
|
|
||||||
The variable inspector is a right-click on an open console or notebook.
|
|
||||||
|
|
||||||
The `jupyter` command produces output including
|
|
||||||
|
|
||||||
Jupyter Server 2.14.1 is running at:
|
|
||||||
http://127.0.0.1:8888/lab?token=4c91308819786fe00a33b76e60f3321840283486457516a1
|
|
||||||
|
|
||||||
Use this to connect multiple front ends
|
|
||||||
|
|
||||||
- Local development
|
- Local development
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@@ -51,12 +99,10 @@ qldbtools is a Python package for working with CodeQL databases
|
|||||||
|
|
||||||
|
|
||||||
## Use as library
|
## Use as library
|
||||||
|
The best way to examine the code is starting from the high-level scripts in
|
||||||
|
`bin/`.
|
||||||
|
|
||||||
```python
|
## Command line use
|
||||||
import qldbtools as ql
|
|
||||||
```
|
|
||||||
|
|
||||||
## Command-line use
|
|
||||||
|
|
||||||
Initial information collection requires a unique file path so it can be run
|
Initial information collection requires a unique file path so it can be run
|
||||||
repeatedly over DB collections with the same (owner,name) but other differences
|
repeatedly over DB collections with the same (owner,name) but other differences
|
||||||
@@ -67,26 +113,28 @@ import qldbtools as ql
|
|||||||
- cliVersion
|
- cliVersion
|
||||||
- language
|
- language
|
||||||
|
|
||||||
Those fields are collected and a single name addenum formed in
|
Those fields are collected in `bin/mc-db-refine-info`.
|
||||||
`bin/mc-db-refine-info`.
|
|
||||||
|
|
||||||
The command sequence, grouped by data files, is
|
An example workflow with commands grouped by data files follows.
|
||||||
|
|
||||||
cd ~/work-gh/mrva/mrvacommander/client/qldbtools
|
cd ~/work-gh/mrva/mrvacommander/client/qldbtools && mkdir -p scratch
|
||||||
./bin/mc-db-initial-info ~/work-gh/mrva/mrva-open-source-download > db-info-1.csv
|
./bin/mc-db-initial-info ~/work-gh/mrva/mrva-open-source-download > scratch/db-info-1.csv
|
||||||
./bin/mc-db-refine-info < db-info-1.csv > db-info-2.csv
|
./bin/mc-db-refine-info < scratch/db-info-1.csv > scratch/db-info-2.csv
|
||||||
|
|
||||||
./bin/mc-db-view-info < db-info-2.csv &
|
./bin/mc-db-view-info < scratch/db-info-2.csv &
|
||||||
./bin/mc-db-unique < db-info-2.csv > db-info-3.csv
|
./bin/mc-db-unique < scratch/db-info-2.csv > scratch/db-info-3.csv
|
||||||
./bin/mc-db-view-info < db-info-3.csv &
|
./bin/mc-db-view-info < scratch/db-info-3.csv &
|
||||||
|
|
||||||
./bin/mc-db-populate-minio -n 23 < db-info-3.csv
|
./bin/mc-db-populate-minio -n 23 < scratch/db-info-3.csv
|
||||||
./bin/mc-db-generate-selection -n 23 vscode-selection.json gh-mrva-selection.json < db-info-3.csv
|
./bin/mc-db-generate-selection -n 23 \
|
||||||
|
scratch/vscode-selection.json \
|
||||||
|
scratch/gh-mrva-selection.json \
|
||||||
|
< scratch/db-info-3.csv
|
||||||
|
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
The preview-data plugin for VS Code has a bug; it displays `0` instead of
|
The `preview-data` plugin for VS Code has a bug; it displays `0` instead of
|
||||||
`0e3379` for the following. There are other entries with similar malfunction.
|
`0e3379` for the following. There are other entries with similar malfunction.
|
||||||
|
|
||||||
CleverRaven,Cataclysm-DDA,0e3379,2.17.0,2024-05-08 12:13:10.038007+00:00,cpp,5ca7f4e59c2d7b0a93fb801a31138477f7b4a761,578098.0,/Users/hohn/work-gh/mrva/mrva-open-source-download/repos-2024-04-29/CleverRaven/Cataclysm-DDA/code-scanning/codeql/databases/cpp/db.zip,cpp,C/C++,1228.0,578098.0,2024-05-13T12:14:54.650648,cpp,True,4245,563435469
|
CleverRaven,Cataclysm-DDA,0e3379,2.17.0,2024-05-08 12:13:10.038007+00:00,cpp,5ca7f4e59c2d7b0a93fb801a31138477f7b4a761,578098.0,/Users/hohn/work-gh/mrva/mrva-open-source-download/repos-2024-04-29/CleverRaven/Cataclysm-DDA/code-scanning/codeql/databases/cpp/db.zip,cpp,C/C++,1228.0,578098.0,2024-05-13T12:14:54.650648,cpp,True,4245,563435469
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ import pandas as pd
|
|||||||
# cd ../
|
# cd ../
|
||||||
|
|
||||||
#* Reload CSV file to continue work
|
#* Reload CSV file to continue work
|
||||||
df2 = df_refined = pd.read_csv('db-info-2.csv')
|
df2 = df_refined = pd.read_csv('scratch/db-info-2.csv')
|
||||||
|
|
||||||
# Identify rows missing specific entries
|
# Identify rows missing specific entries
|
||||||
rows = ( df2['cliVersion'].isna() |
|
rows = ( df2['cliVersion'].isna() |
|
||||||
@@ -17,7 +17,7 @@ df3 = df2[~rows]
|
|||||||
df3
|
df3
|
||||||
|
|
||||||
#* post-save work
|
#* post-save work
|
||||||
df4 = pd.read_csv('db-info-3.csv')
|
df4 = pd.read_csv('scratch/db-info-3.csv')
|
||||||
|
|
||||||
# Sort and group
|
# Sort and group
|
||||||
df_sorted = df4.sort_values(by=['owner', 'name', 'CID', 'creationTime'])
|
df_sorted = df4.sort_values(by=['owner', 'name', 'CID', 'creationTime'])
|
||||||
|
|||||||
@@ -13,7 +13,7 @@ import numpy as np
|
|||||||
import importlib
|
import importlib
|
||||||
importlib.reload(utils)
|
importlib.reload(utils)
|
||||||
|
|
||||||
df0 = pd.read_csv('db-info-3.csv')
|
df0 = pd.read_csv('scratch/db-info-3.csv')
|
||||||
|
|
||||||
# Use num_entries, chosen via pseudo-random numbers
|
# Use num_entries, chosen via pseudo-random numbers
|
||||||
df1 = df0.sample(n=3, random_state=np.random.RandomState(4242))
|
df1 = df0.sample(n=3, random_state=np.random.RandomState(4242))
|
||||||
|
|||||||
@@ -9,7 +9,7 @@ from pathlib import Path
|
|||||||
#
|
#
|
||||||
#* Collect the information and select subset
|
#* Collect the information and select subset
|
||||||
#
|
#
|
||||||
df = pd.read_csv('db-info-2.csv')
|
df = pd.read_csv('scratch/db-info-2.csv')
|
||||||
seed = 4242
|
seed = 4242
|
||||||
if 0:
|
if 0:
|
||||||
# Use all entries
|
# Use all entries
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ import pandas as pd
|
|||||||
#
|
#
|
||||||
#* Collect the information
|
#* Collect the information
|
||||||
#
|
#
|
||||||
df1 = pd.read_csv("db-info-2.csv")
|
df1 = pd.read_csv("scratch/db-info-2.csv")
|
||||||
|
|
||||||
# Add single uniqueness field -- CID (Cumulative ID) -- using
|
# Add single uniqueness field -- CID (Cumulative ID) -- using
|
||||||
# - creationTime
|
# - creationTime
|
||||||
|
|||||||
@@ -135,7 +135,7 @@ python-json-logger==2.0.7
|
|||||||
pytz==2024.1
|
pytz==2024.1
|
||||||
PyYAML==6.0.1
|
PyYAML==6.0.1
|
||||||
pyzmq==26.0.3
|
pyzmq==26.0.3
|
||||||
-e git+ssh://git@github.com/advanced-security/mrvacommander.git@26dd69c9767c315a8ffb782eedf3b55eac574d45#egg=qldbtools&subdirectory=client/qldbtools
|
-e git+ssh://git@github.com/advanced-security/mrvacommander.git@b7b4839fe0760287b80b8e2887c29d736c5cae33#egg=qldbtools&subdirectory=client/qldbtools
|
||||||
qtstylish==0.1.5
|
qtstylish==0.1.5
|
||||||
referencing==0.35.1
|
referencing==0.35.1
|
||||||
requests==2.32.3
|
requests==2.32.3
|
||||||
@@ -155,6 +155,7 @@ stack-data==0.6.3
|
|||||||
statsmodels==0.14.2
|
statsmodels==0.14.2
|
||||||
strsimpy==0.2.1
|
strsimpy==0.2.1
|
||||||
tables==3.9.2
|
tables==3.9.2
|
||||||
|
tabulate==0.9.0
|
||||||
tenacity==8.5.0
|
tenacity==8.5.0
|
||||||
terminado==0.18.1
|
terminado==0.18.1
|
||||||
threadpoolctl==3.5.0
|
threadpoolctl==3.5.0
|
||||||
|
|||||||
Reference in New Issue
Block a user