mirror of
https://github.com/hohn/sarif-cli.git
synced 2025-12-16 17:23:03 +01:00
Before, the query_id was ==> results.csv <== query_id STRING, -- git commit id of the ql query set now, it's query_id STRING, -- @id from the CodeQL query
327 lines
8.7 KiB
Org Mode
327 lines
8.7 KiB
Org Mode
# -*- coding: utf-8 -*-
|
|
# Created [Apr-19-2022]
|
|
#+TITLE:
|
|
#+AUTHOR: Michael Hohn
|
|
#+LANGUAGE: en
|
|
#+TEXT:
|
|
#+OPTIONS: ^:{} H:2 num:t \n:nil @:t ::t |:t ^:nil f:t *:t TeX:t LaTeX:t skip:nil p:nil
|
|
#+OPTIONS: toc:nil
|
|
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="./l3style.css"/>
|
|
#+HTML: <div id="toc">
|
|
#+TOC: headlines 2 insert TOC here, with two headline levels
|
|
#+HTML: </div>
|
|
#
|
|
#+HTML: <div id="org-content">
|
|
|
|
* Overview
|
|
The current ==> project.csv <== table is broken. It's a combination of project
|
|
meta info (to be moved to a separate =project= table) and the entry point to a
|
|
single =project='s sarif results
|
|
|
|
* Currently Exported Tables
|
|
Tables exported by sarif-extract-multi, commit d5390bb87, [Mar-23-2022]
|
|
|
|
#+BEGIN_SRC text
|
|
==> artifacts.csv <==
|
|
artifacts_id
|
|
index
|
|
uri
|
|
uriBaseId
|
|
#+END_SRC
|
|
|
|
#+BEGIN_SRC text
|
|
==> codeflows.csv <==
|
|
codeflow_id
|
|
codeflow_index
|
|
threadflow_index
|
|
location_index
|
|
endColumn
|
|
endLine
|
|
startColumn
|
|
startLine
|
|
artifact_index
|
|
uri
|
|
uriBaseId
|
|
message
|
|
#+END_SRC
|
|
|
|
#+BEGIN_SRC text
|
|
==> kind_pathproblem.csv <==
|
|
results_array_id
|
|
results_array_index
|
|
codeFlows_id
|
|
ruleId
|
|
ruleIndex
|
|
location_array_index
|
|
location_id
|
|
location_endColumn
|
|
location_endLine
|
|
location_startColumn
|
|
location_startLine
|
|
location_index
|
|
location_uri
|
|
location_uriBaseId
|
|
location_message
|
|
relatedLocation_array_index
|
|
relatedLocation_id
|
|
relatedLocation_endColumn
|
|
relatedLocation_endLine
|
|
relatedLocation_startColumn
|
|
relatedLocation_startLine
|
|
relatedLocation_index
|
|
relatedLocation_uri
|
|
relatedLocation_uriBaseId
|
|
relatedLocation_message
|
|
message_text
|
|
primaryLocationLineHash
|
|
primaryLocationStartColumnFingerprint
|
|
rule_id
|
|
rule_index
|
|
|
|
#+END_SRC
|
|
|
|
#+BEGIN_SRC text
|
|
==> kind_problem.csv <==
|
|
results_array_id
|
|
results_array_index
|
|
ruleId
|
|
ruleIndex
|
|
location_array_index
|
|
location_id
|
|
location_endColumn
|
|
location_endLine
|
|
location_startColumn
|
|
location_startLine
|
|
location_index
|
|
location_uri
|
|
location_uriBaseId
|
|
location_message
|
|
relatedLocation_array_index
|
|
relatedLocation_id
|
|
relatedLocation_endColumn
|
|
relatedLocation_endLine
|
|
relatedLocation_startColumn
|
|
relatedLocation_startLine
|
|
relatedLocation_index
|
|
relatedLocation_uri
|
|
relatedLocation_uriBaseId
|
|
relatedLocation_message
|
|
message_text
|
|
primaryLocationLineHash
|
|
primaryLocationStartColumnFingerprint
|
|
rule_id
|
|
rule_index
|
|
|
|
#+END_SRC
|
|
|
|
The parts above =$schema= in the =projects.csv= table is ad-hoc and the
|
|
information for those fields is not yet collected. They can be discarded.
|
|
#+BEGIN_SRC text
|
|
==> project.csv <==
|
|
creation_date
|
|
primary_language
|
|
project_name
|
|
query_commit_id
|
|
sarif_file_name
|
|
scan_id
|
|
scan_start_date
|
|
scan_stop_date
|
|
tool_name
|
|
tool_version
|
|
$schema
|
|
sarif_version
|
|
run_index
|
|
artifacts
|
|
columnKind
|
|
results
|
|
semmle.formatSpecifier
|
|
semmle.sourceLanguage
|
|
driver_name
|
|
organization
|
|
rules
|
|
driver_version
|
|
repositoryUri
|
|
revisionId
|
|
|
|
#+END_SRC
|
|
|
|
|
|
#+BEGIN_SRC text
|
|
==> relatedLocations.csv <==
|
|
struct_id
|
|
uri
|
|
startLine
|
|
startColumn
|
|
endLine
|
|
endColumn
|
|
message
|
|
|
|
#+END_SRC
|
|
|
|
|
|
#+BEGIN_SRC text
|
|
==> rules.csv <==
|
|
rules_array_id
|
|
rules_array_index
|
|
id
|
|
name
|
|
enabled
|
|
level
|
|
fullDescription
|
|
shortDescription
|
|
kind
|
|
precision
|
|
security-severity
|
|
severity
|
|
sub-severity
|
|
tag_index
|
|
tag_text
|
|
#+END_SRC
|
|
|
|
* Tables or entries to be removed
|
|
The top of the [Mar-23-2022] =projects.csv= table, enumerated below, is ad-hoc
|
|
and included in the other tables below; the information for its fields is not
|
|
yet collected so it can be discarded.
|
|
|
|
#+BEGIN_SRC text
|
|
==> project-meta.csv <==
|
|
creation_date
|
|
primary_language
|
|
project_name
|
|
query_commit_id
|
|
sarif_file_name
|
|
scan_id
|
|
scan_start_date
|
|
scan_stop_date
|
|
tool_name
|
|
tool_version
|
|
#+END_SRC
|
|
|
|
This information was used to expand the sarif tree (see Struct3452 and Array7481
|
|
in typegraph-multi-with-tables.pdf and the code). In retrospect, that was a
|
|
poor choice. All additional information needed can be represented by one or
|
|
more tables, so sarif-extract* post commit 30e3dd3a3 do so.
|
|
|
|
The minimal information required to drive the sarif-to-table conversion is
|
|
| project_id | 13243 | |
|
|
| scan_id | 123456 | |
|
|
| sarif_file_name | "2021-12-09/results.sarif" | |
|
|
|
|
|
|
* New tables to be exported
|
|
This section enumerates new tables intended for reporting infrastructure.
|
|
|
|
Using the github API starting points
|
|
#+BEGIN_SRC python
|
|
# Code scanning information
|
|
# Get the full list:
|
|
r02 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses')
|
|
|
|
# Work with one entry
|
|
_, analysis_id = pathval(r02, 0, 'id')
|
|
r02s01 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses/{analysis_id}')
|
|
|
|
r02s02 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses/{analysis_id}',
|
|
headers = {'Accept': 'application/sarif+json'})
|
|
|
|
# Repository information via GET /repos/{owner}/{repo}
|
|
r03 = gith(GET, f'/repos/{owner}/{repo}')
|
|
#+END_SRC
|
|
we can populate the =project.csv= and =scans.csv= tables:
|
|
#+BEGIN_SRC sql
|
|
==> project.csv <==
|
|
id
|
|
project_name -- pathval(r03, 'full_name')
|
|
creation_date -- pathval(r03, 'created_at')
|
|
owner -- r03
|
|
repo -- r03 = gith(GET, f'/repos/{owner}/{repo}')
|
|
repository_url -- pathval(r03, 'clone_url')
|
|
primary_language -- pathval(r03, 'language')
|
|
languages_analyzed --
|
|
#+END_SRC
|
|
#+BEGIN_SRC sql
|
|
==> scans.csv <==
|
|
id --
|
|
commit_id -- pathval(r02s01, 'commit_sha')
|
|
project_id -- project.id
|
|
db_create_start -- pathval(r02s01, 'created_at')
|
|
db_create_stop
|
|
scan_start_date
|
|
scan_stop_date
|
|
tool_name -- pathval(r02s01, 'tool', 'name')
|
|
tool_version -- pathval(r02s01, 'tool', 'version')
|
|
tool_query_commit_id -- pathval(r02, 0, 'tool', 'version') is sufficient
|
|
sarif_content -- r02s02
|
|
sarif_file_name -- used on upload
|
|
sarif_id -- pathval(r02s01, 'sarif_id')
|
|
results_count -- pathval(r02s01, 'results_count')
|
|
rules_count -- pathval(r02s01, 'rules_count')
|
|
#+END_SRC
|
|
|
|
The sarif upload from codeql analysis to github uses the following API and
|
|
parameters which naturally are the minimal parameters needed to run the
|
|
analysis.
|
|
#+BEGIN_SRC python
|
|
# untested
|
|
r04 = gith(POST, f'/repos/{owner}/{repo}/code-scanning/sarifs',
|
|
fields={'commit_sha': 'aa22233',
|
|
'ref': 'refs/heads/<branch name>',
|
|
'sarif': 'gzip < sarif | base64 -w0',
|
|
'tool_name' : 'codeql',
|
|
'started_at': 'when the analysis started',
|
|
},
|
|
headers = {'Accept': 'application/sarif+json'})
|
|
#+END_SRC
|
|
|
|
The scan results from =project.csv= are the root of the sarif tree, so this is a
|
|
required base table.
|
|
#+BEGIN_SRC sql
|
|
==> project-scan-result.csv <==
|
|
$schema
|
|
sarif_version
|
|
run_index
|
|
artifacts
|
|
columnKind
|
|
results
|
|
semmle.formatSpecifier
|
|
semmle.sourceLanguage
|
|
driver_name
|
|
organization
|
|
rules
|
|
driver_version
|
|
repositoryUri
|
|
revisionId
|
|
#+END_SRC
|
|
|
|
Using joins of the =project-scan-result.csv= table and the
|
|
other [[*Currently Exported Tables][Currently Exported Tables]], the =results.csv= table can be formed:
|
|
#+BEGIN_SRC sql
|
|
==> results.csv <==
|
|
id INT, -- primary key
|
|
scan_id INT, -- scans.id
|
|
query_id STRING, -- @id from the CodeQL query
|
|
location STRING,
|
|
message STRING,
|
|
message_object OBJ,
|
|
-- for kind_path_problem, use distinct source / sink
|
|
-- for kind_problem, use the same location for both
|
|
result_type STRING, -- kind_problem | kind_path_problem
|
|
-- link to codeflows (kind_pathproblem.csv only, NULL otherwise)
|
|
codeFlow_id INT,
|
|
--
|
|
source_startLine int,
|
|
source_startCol int,
|
|
source_endLine int,
|
|
source_endCol int,
|
|
--
|
|
sink_startLine int,
|
|
sink_startCol int,
|
|
sink_endLine int,
|
|
sink_endCol int,
|
|
--
|
|
source_object STRING, -- higher-level info: 'args', 'request', etc.
|
|
sink_object string, -- higher level: 'execute', 'sql statement', etc.
|
|
#+END_SRC
|
|
|
|
#+HTML: </div>
|