* Overview The current ==> project.csv <== table is broken. It's a combination of project meta info (to be moved to a separate =project= table) and the entry point to a single =project='s sarif results * Currently Exported Tables Tables exported by sarif-extract-multi, commit d5390bb87, [Mar-23-2022] #+BEGIN_SRC text ==> artifacts.csv <== artifacts_id index uri uriBaseId #+END_SRC #+BEGIN_SRC text ==> codeflows.csv <== codeflow_id codeflow_index threadflow_index location_index endColumn endLine startColumn startLine artifact_index uri uriBaseId message #+END_SRC #+BEGIN_SRC text ==> kind_pathproblem.csv <== results_array_id results_array_index codeFlows_id ruleId ruleIndex location_array_index location_id location_endColumn location_endLine location_startColumn location_startLine location_index location_uri location_uriBaseId location_message relatedLocation_array_index relatedLocation_id relatedLocation_endColumn relatedLocation_endLine relatedLocation_startColumn relatedLocation_startLine relatedLocation_index relatedLocation_uri relatedLocation_uriBaseId relatedLocation_message message_text primaryLocationLineHash primaryLocationStartColumnFingerprint rule_id rule_index #+END_SRC #+BEGIN_SRC text ==> kind_problem.csv <== results_array_id results_array_index ruleId ruleIndex location_array_index location_id location_endColumn location_endLine location_startColumn location_startLine location_index location_uri location_uriBaseId location_message relatedLocation_array_index relatedLocation_id relatedLocation_endColumn relatedLocation_endLine relatedLocation_startColumn relatedLocation_startLine relatedLocation_index relatedLocation_uri relatedLocation_uriBaseId relatedLocation_message message_text primaryLocationLineHash primaryLocationStartColumnFingerprint rule_id rule_index #+END_SRC The parts above =$schema= in the =projects.csv= table is ad-hoc and the information for those fields is not yet collected. They can be discarded. #+BEGIN_SRC text ==> project.csv <== creation_date primary_language project_name query_commit_id sarif_file_name scan_id scan_start_date scan_stop_date tool_name tool_version $schema sarif_version run_index artifacts columnKind results semmle.formatSpecifier semmle.sourceLanguage driver_name organization rules driver_version repositoryUri revisionId #+END_SRC #+BEGIN_SRC text ==> relatedLocations.csv <== struct_id uri startLine startColumn endLine endColumn message #+END_SRC #+BEGIN_SRC text ==> rules.csv <== rules_array_id rules_array_index id name enabled level fullDescription shortDescription kind precision security-severity severity sub-severity tag_index tag_text #+END_SRC * Tables or entries to be removed The top of the [Mar-23-2022] =projects.csv= table, enumerated below, is ad-hoc and included in the other tables below; the information for its fields is not yet collected so it can be discarded. #+BEGIN_SRC text ==> project-meta.csv <== creation_date primary_language project_name query_commit_id sarif_file_name scan_id scan_start_date scan_stop_date tool_name tool_version #+END_SRC This information was used to expand the sarif tree (see Struct3452 and Array7481 in typegraph-multi-with-tables.pdf and the code). In retrospect, that was a poor choice. All additional information needed can be represented by one or more tables, so sarif-extract* post commit 30e3dd3a3 do so. The minimal information required to drive the sarif-to-table conversion is | project_id | 13243 | | | scan_id | 123456 | | | sarif_file_name | "2021-12-09/results.sarif" | | * New tables to be exported This section enumerates new tables intended for reporting infrastructure. Using the github API starting points #+BEGIN_SRC python # Code scanning information # Get the full list: r02 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses') # Work with one entry _, analysis_id = pathval(r02, 0, 'id') r02s01 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses/{analysis_id}') r02s02 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses/{analysis_id}', headers = {'Accept': 'application/sarif+json'}) # Repository information via GET /repos/{owner}/{repo} r03 = gith(GET, f'/repos/{owner}/{repo}') #+END_SRC we can populate the =project.csv= and =scans.csv= tables: #+BEGIN_SRC sql ==> project.csv <== id project_name -- pathval(r03, 'full_name') creation_date -- pathval(r03, 'created_at') owner -- r03 repo -- r03 = gith(GET, f'/repos/{owner}/{repo}') repository_url -- pathval(r03, 'clone_url') primary_language -- pathval(r03, 'language') languages_analyzed -- #+END_SRC #+BEGIN_SRC sql ==> scans.csv <== id -- commit_id -- pathval(r02s01, 'commit_sha') project_id -- project.id db_create_start -- pathval(r02s01, 'created_at') db_create_stop scan_start_date scan_stop_date tool_name -- pathval(r02s01, 'tool', 'name') tool_version -- pathval(r02s01, 'tool', 'version') tool_query_commit_id -- pathval(r02, 0, 'tool', 'version') is sufficient sarif_content -- r02s02 sarif_file_name -- used on upload sarif_id -- pathval(r02s01, 'sarif_id') results_count -- pathval(r02s01, 'results_count') rules_count -- pathval(r02s01, 'rules_count') #+END_SRC The sarif upload from codeql analysis to github uses the following API and parameters which naturally are the minimal parameters needed to run the analysis. #+BEGIN_SRC python # untested r04 = gith(POST, f'/repos/{owner}/{repo}/code-scanning/sarifs', fields={'commit_sha': 'aa22233', 'ref': 'refs/heads/', 'sarif': 'gzip < sarif | base64 -w0', 'tool_name' : 'codeql', 'started_at': 'when the analysis started', }, headers = {'Accept': 'application/sarif+json'}) #+END_SRC The scan results from =project.csv= are the root of the sarif tree, so this is a required base table. #+BEGIN_SRC sql ==> project-scan-result.csv <== $schema sarif_version run_index artifacts columnKind results semmle.formatSpecifier semmle.sourceLanguage driver_name organization rules driver_version repositoryUri revisionId #+END_SRC Using joins of the =project-scan-result.csv= table and the other [[*Currently Exported Tables][Currently Exported Tables]], the =results.csv= table can be formed: #+BEGIN_SRC sql ==> results.csv <== id INT, -- primary key scan_id INT, -- scans.id query_id STRING, -- @id from the CodeQL query location STRING, message STRING, message_object OBJ, -- for kind_path_problem, use distinct source / sink -- for kind_problem, use the same location for both result_type STRING, -- kind_problem | kind_path_problem -- link to codeflows (kind_pathproblem.csv only, NULL otherwise) codeFlow_id INT, -- source_startLine int, source_startCol int, source_endLine int, source_endCol int, -- sink_startLine int, sink_startCol int, sink_endLine int, sink_endCol int, -- source_object STRING, -- higher-level info: 'args', 'request', etc. sink_object string, -- higher level: 'execute', 'sql statement', etc. #+END_SRC #+HTML: