8.2 KiB
Overview
The current => project.csv <= table is broken. It's a combination of project
meta info (to be moved to a separate project table) and the entry point to a
single project's sarif results
Currently Exported Tables
Tables exported by sarif-extract-multi, commit d5390bb87, [Mar-23-2022]
==> artifacts.csv <==
artifacts_id
index
uri
uriBaseId
==> codeflows.csv <==
codeflow_id
codeflow_index
threadflow_index
location_index
endColumn
endLine
startColumn
startLine
artifact_index
uri
uriBaseId
message
==> kind_pathproblem.csv <==
results_array_id
results_array_index
codeFlows_id
ruleId
ruleIndex
location_array_index
location_id
location_endColumn
location_endLine
location_startColumn
location_startLine
location_index
location_uri
location_uriBaseId
location_message
relatedLocation_array_index
relatedLocation_id
relatedLocation_endColumn
relatedLocation_endLine
relatedLocation_startColumn
relatedLocation_startLine
relatedLocation_index
relatedLocation_uri
relatedLocation_uriBaseId
relatedLocation_message
message_text
primaryLocationLineHash
primaryLocationStartColumnFingerprint
rule_id
rule_index
==> kind_problem.csv <==
results_array_id
results_array_index
ruleId
ruleIndex
location_array_index
location_id
location_endColumn
location_endLine
location_startColumn
location_startLine
location_index
location_uri
location_uriBaseId
location_message
relatedLocation_array_index
relatedLocation_id
relatedLocation_endColumn
relatedLocation_endLine
relatedLocation_startColumn
relatedLocation_startLine
relatedLocation_index
relatedLocation_uri
relatedLocation_uriBaseId
relatedLocation_message
message_text
primaryLocationLineHash
primaryLocationStartColumnFingerprint
rule_id
rule_index
The parts above $schema in the projects.csv table is ad-hoc and the
information for those fields is not yet collected. They can be discarded.
==> project.csv <==
creation_date
primary_language
project_name
query_commit_id
sarif_file_name
scan_id
scan_start_date
scan_stop_date
tool_name
tool_version
$schema
sarif_version
run_index
artifacts
columnKind
results
semmle.formatSpecifier
semmle.sourceLanguage
driver_name
organization
rules
driver_version
repositoryUri
revisionId
==> relatedLocations.csv <==
struct_id
uri
startLine
startColumn
endLine
endColumn
message
==> rules.csv <==
rules_array_id
rules_array_index
id
name
enabled
level
fullDescription
shortDescription
kind
precision
security-severity
severity
sub-severity
tag_index
tag_text
Tables or entries to be removed
The top of the [Mar-23-2022] projects.csv table, enumerated below, is ad-hoc
and included in the other tables below; the information for its fields is not
yet collected to it can be discarded.
==> project-meta.csv <==
creation_date
primary_language
project_name
query_commit_id
sarif_file_name
scan_id
scan_start_date
scan_stop_date
tool_name
tool_version
New tables to be exported
This section enumerates new tables intended for reporting infrastructure.
Using the github API starting points
# Code scanning information
# Get the full list:
r02 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses')
# Work with one entry
_, analysis_id = pathval(r02, 0, 'id')
r02s01 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses/{analysis_id}')
r02s02 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses/{analysis_id}',
headers = {'Accept': 'application/sarif+json'})
# Repository information via GET /repos/{owner}/{repo}
r03 = gith(GET, f'/repos/{owner}/{repo}')
we can populate the project.csv and scans.csv tables:
==> project.csv <==
id
project_name -- pathval(r03, 'full_name')
creation_date -- pathval(r03, 'created_at')
owner -- r03
repo -- r03 = gith(GET, f'/repos/{owner}/{repo}')
repository_url -- pathval(r03, 'clone_url')
primary_language -- pathval(r03, 'language')
languages_analyzed --
==> scans.csv <==
id --
commit_id -- pathval(r02s01, 'commit_sha')
project_id -- project.id
db_create_start -- pathval(r02s01, 'created_at')
db_create_stop
scan_start_date
scan_stop_date
tool_name -- pathval(r02s01, 'tool', 'name')
tool_version -- pathval(r02s01, 'tool', 'version')
tool_query_commit_id -- pathval(r02, 0, 'tool', 'version') is sufficient
sarif_content -- r02s02
sarif_file_name -- used on upload
sarif_id -- pathval(r02s01, 'sarif_id')
results_count -- pathval(r02s01, 'results_count')
rules_count -- pathval(r02s01, 'rules_count')
The sarif upload from codeql analysis to github uses the following API and parameters which naturally are the minimal parameters needed to run the analysis.
# untested
r04 = gith(POST, f'/repos/{owner}/{repo}/code-scanning/sarifs',
fields={'commit_sha': 'aa22233',
'ref': 'refs/heads/<branch name>',
'sarif': 'gzip < sarif | base64 -w0',
'tool_name' : 'codeql',
'started_at': 'when the analysis started',
},
headers = {'Accept': 'application/sarif+json'})
The scan results from project.csv are the root of the sarif tree, so this is a
required base table.
==> project-scan-result.csv <==
$schema
sarif_version
run_index
artifacts
columnKind
results
semmle.formatSpecifier
semmle.sourceLanguage
driver_name
organization
rules
driver_version
repositoryUri
revisionId
Using joins of the project-scan-result.csv table and the
other Currently Exported Tables, the results.csv table can be formed:
==> results.csv <==
id INT, -- primary key
scan_id INT, -- scans.id
query_id STRING, -- git commit id of the ql query set
location STRING,
message STRING,
message_object OBJ,
-- for kind_path_problem, use distinct source / sink
-- for kind_problem, use the same location for both
result_type STRING, -- kind_problem | kind_path_problem
-- link to codeflows (kind_pathproblem.csv only, NULL otherwise)
codeFlow_id INT,
--
source_startLine int,
source_startCol int,
source_endLine int,
source_endCol int,
--
sink_startLine int,
sink_startCol int,
sink_endLine int,
sink_endCol int,
--
source_object STRING, -- higher-level info: 'args', 'request', etc.
sink_object string, -- higher level: 'execute', 'sql statement', etc.