Files
sarif-cli/notes/tables.org

8.2 KiB

Overview

The current => project.csv <= table is broken. It's a combination of project meta info (to be moved to a separate project table) and the entry point to a single project's sarif results

Currently Exported Tables

Tables exported by sarif-extract-multi, commit d5390bb87, [Mar-23-2022]

  ==> artifacts.csv <==
  artifacts_id
  index
  uri
  uriBaseId
  ==> codeflows.csv <==
  codeflow_id
  codeflow_index
  threadflow_index
  location_index
  endColumn
  endLine
  startColumn
  startLine
  artifact_index
  uri
  uriBaseId
  message
  ==> kind_pathproblem.csv <==
  results_array_id
  results_array_index
  codeFlows_id
  ruleId
  ruleIndex
  location_array_index
  location_id
  location_endColumn
  location_endLine
  location_startColumn
  location_startLine
  location_index
  location_uri
  location_uriBaseId
  location_message
  relatedLocation_array_index
  relatedLocation_id
  relatedLocation_endColumn
  relatedLocation_endLine
  relatedLocation_startColumn
  relatedLocation_startLine
  relatedLocation_index
  relatedLocation_uri
  relatedLocation_uriBaseId
  relatedLocation_message
  message_text
  primaryLocationLineHash
  primaryLocationStartColumnFingerprint
  rule_id
  rule_index
  ==> kind_problem.csv <==
  results_array_id
  results_array_index
  ruleId
  ruleIndex
  location_array_index
  location_id
  location_endColumn
  location_endLine
  location_startColumn
  location_startLine
  location_index
  location_uri
  location_uriBaseId
  location_message
  relatedLocation_array_index
  relatedLocation_id
  relatedLocation_endColumn
  relatedLocation_endLine
  relatedLocation_startColumn
  relatedLocation_startLine
  relatedLocation_index
  relatedLocation_uri
  relatedLocation_uriBaseId
  relatedLocation_message
  message_text
  primaryLocationLineHash
  primaryLocationStartColumnFingerprint
  rule_id
  rule_index

The parts above $schema in the projects.csv table is ad-hoc and the information for those fields is not yet collected. They can be discarded.

  ==> project.csv <==
  creation_date
  primary_language
  project_name
  query_commit_id
  sarif_file_name
  scan_id
  scan_start_date
  scan_stop_date
  tool_name
  tool_version
  $schema
  sarif_version
  run_index
  artifacts
  columnKind
  results
  semmle.formatSpecifier
  semmle.sourceLanguage
  driver_name
  organization
  rules
  driver_version
  repositoryUri
  revisionId
  ==> relatedLocations.csv <==
  struct_id
  uri
  startLine
  startColumn
  endLine
  endColumn
  message
  ==> rules.csv <==
  rules_array_id
  rules_array_index
  id
  name
  enabled
  level
  fullDescription
  shortDescription
  kind
  precision
  security-severity
  severity
  sub-severity
  tag_index
  tag_text

Tables or entries to be removed

The top of the [Mar-23-2022] projects.csv table, enumerated below, is ad-hoc and included in the other tables below; the information for its fields is not yet collected to it can be discarded.

  ==> project-meta.csv <==
  creation_date
  primary_language
  project_name
  query_commit_id
  sarif_file_name
  scan_id
  scan_start_date
  scan_stop_date
  tool_name
  tool_version

New tables to be exported

This section enumerates new tables intended for reporting infrastructure.

Using the github API starting points

  # Code scanning information
  # Get the full list:
  r02 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses')

  # Work with one entry
  _, analysis_id = pathval(r02, 0, 'id')
  r02s01 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses/{analysis_id}')

  r02s02 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses/{analysis_id}',
                headers = {'Accept': 'application/sarif+json'})

  # Repository information via GET /repos/{owner}/{repo}
  r03 = gith(GET, f'/repos/{owner}/{repo}')

we can populate the project.csv and scans.csv tables:

  ==> project.csv <==
  id
  project_name                    -- pathval(r03, 'full_name')
  creation_date                   -- pathval(r03, 'created_at')
  owner                           -- r03
  repo                            -- r03 = gith(GET, f'/repos/{owner}/{repo}')
  repository_url                  -- pathval(r03, 'clone_url')
  primary_language                -- pathval(r03, 'language')
  languages_analyzed              --
  ==> scans.csv <==
  id                              --
  commit_id                       -- pathval(r02s01, 'commit_sha')
  project_id                      -- project.id
  db_create_start                 -- pathval(r02s01, 'created_at')
  db_create_stop
  scan_start_date
  scan_stop_date
  tool_name                       -- pathval(r02s01, 'tool', 'name')
  tool_version                    -- pathval(r02s01, 'tool', 'version')
  tool_query_commit_id            -- pathval(r02, 0, 'tool', 'version') is sufficient
  sarif_content                   -- r02s02
  sarif_file_name                 -- used on upload
  sarif_id                        -- pathval(r02s01, 'sarif_id')
  results_count                   -- pathval(r02s01, 'results_count')
  rules_count                     -- pathval(r02s01, 'rules_count')

The sarif upload from codeql analysis to github uses the following API and parameters which naturally are the minimal parameters needed to run the analysis.

  # untested
  r04 = gith(POST, f'/repos/{owner}/{repo}/code-scanning/sarifs',
             fields={'commit_sha': 'aa22233',
                     'ref': 'refs/heads/<branch name>',
                     'sarif': 'gzip < sarif | base64 -w0',
                     'tool_name' : 'codeql',
                     'started_at': 'when the analysis started',
                     },
             headers = {'Accept': 'application/sarif+json'})

The scan results from project.csv are the root of the sarif tree, so this is a required base table.

  ==> project-scan-result.csv <==
  $schema
  sarif_version
  run_index
  artifacts
  columnKind
  results
  semmle.formatSpecifier
  semmle.sourceLanguage
  driver_name
  organization
  rules
  driver_version
  repositoryUri
  revisionId

Using joins of the project-scan-result.csv table and the other Currently Exported Tables, the results.csv table can be formed:

  ==> results.csv <==
  id INT,                  -- primary key
  scan_id INT,             -- scans.id
  query_id STRING,         -- git commit id of the ql query set
  location STRING,
  message STRING,
  message_object OBJ,
  -- for kind_path_problem, use distinct source / sink
  -- for kind_problem, use the same location for both
  result_type STRING,      -- kind_problem | kind_path_problem
  -- link to codeflows (kind_pathproblem.csv only, NULL otherwise)
  codeFlow_id INT,
  --
  source_startLine int,
  source_startCol int,
  source_endLine int,
  source_endCol int,
  --
  sink_startLine int,
  sink_startCol int,
  sink_endLine int,
  sink_endCol int,
  --
  source_object STRING, -- higher-level info: 'args', 'request', etc.
  sink_object string, -- higher level: 'execute', 'sql statement', etc.