sarif-cli/notes/tables.org

# -*- coding: utf-8 -*-
# Created [Apr-19-2022]
#+TITLE:
#+AUTHOR: Michael Hohn
#+LANGUAGE:  en
#+TEXT:
#+OPTIONS: ^:{} H:2 num:t \n:nil @:t ::t |:t ^:nil f:t *:t TeX:t LaTeX:t skip:nil p:nil
#+OPTIONS: toc:nil
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="./l3style.css"/>
#+HTML: <div id="toc">
#+TOC: headlines 2        insert TOC here, with two headline levels
#+HTML: </div>
#
#+HTML: <div id="org-content">

* Overview
  The current ==> project.csv <== table is broken.  It's a combination of project
  meta info (to be moved to a separate =project= table) and the entry point to a
  single =project='s sarif results

* Currently Exported Tables
  Tables exported by sarif-extract-multi, commit d5390bb87, [Mar-23-2022]

  #+BEGIN_SRC text
    ==> artifacts.csv <==
    artifacts_id
    index
    uri
    uriBaseId
  #+END_SRC

  #+BEGIN_SRC text
    ==> codeflows.csv <==
    codeflow_id
    codeflow_index
    threadflow_index
    location_index
    endColumn
    endLine
    startColumn
    startLine
    artifact_index
    uri
    uriBaseId
    message
  #+END_SRC

  #+BEGIN_SRC text
    ==> kind_pathproblem.csv <==
    results_array_id
    results_array_index
    codeFlows_id
    ruleId
    ruleIndex
    location_array_index
    location_id
    location_endColumn
    location_endLine
    location_startColumn
    location_startLine
    location_index
    location_uri
    location_uriBaseId
    location_message
    relatedLocation_array_index
    relatedLocation_id
    relatedLocation_endColumn
    relatedLocation_endLine
    relatedLocation_startColumn
    relatedLocation_startLine
    relatedLocation_index
    relatedLocation_uri
    relatedLocation_uriBaseId
    relatedLocation_message
    message_text
    primaryLocationLineHash
    primaryLocationStartColumnFingerprint
    rule_id
    rule_index

  #+END_SRC

  #+BEGIN_SRC text
    ==> kind_problem.csv <==
    results_array_id
    results_array_index
    ruleId
    ruleIndex
    location_array_index
    location_id
    location_endColumn
    location_endLine
    location_startColumn
    location_startLine
    location_index
    location_uri
    location_uriBaseId
    location_message
    relatedLocation_array_index
    relatedLocation_id
    relatedLocation_endColumn
    relatedLocation_endLine
    relatedLocation_startColumn
    relatedLocation_startLine
    relatedLocation_index
    relatedLocation_uri
    relatedLocation_uriBaseId
    relatedLocation_message
    message_text
    primaryLocationLineHash
    primaryLocationStartColumnFingerprint
    rule_id
    rule_index

  #+END_SRC

  The parts above =$schema= in the =projects.csv= table is ad-hoc and the
  information for those fields is not yet collected.  They can be discarded.
  #+BEGIN_SRC text
    ==> project.csv <==
    creation_date
    primary_language
    project_name
    query_commit_id
    sarif_file_name
    scan_id
    scan_start_date
    scan_stop_date
    tool_name
    tool_version
    $schema
    sarif_version
    run_index
    artifacts
    columnKind
    results
    semmle.formatSpecifier
    semmle.sourceLanguage
    driver_name
    organization
    rules
    driver_version
    repositoryUri
    revisionId

  #+END_SRC


  #+BEGIN_SRC text
    ==> relatedLocations.csv <==
    struct_id
    uri
    startLine
    startColumn
    endLine
    endColumn
    message

  #+END_SRC


  #+BEGIN_SRC text
    ==> rules.csv <==
    rules_array_id
    rules_array_index
    id
    name
    enabled
    level
    fullDescription
    shortDescription
    kind
    precision
    security-severity
    severity
    sub-severity
    tag_index
    tag_text
  #+END_SRC

* Tables or entries to be removed
  The top of the [Mar-23-2022] =projects.csv= table, enumerated below, is ad-hoc
  and included in the other tables below; the information for its fields is not
  yet collected so it can be discarded.

  #+BEGIN_SRC text
    ==> project-meta.csv <==
    creation_date
    primary_language
    project_name
    query_commit_id
    sarif_file_name
    scan_id
    scan_start_date
    scan_stop_date
    tool_name
    tool_version
  #+END_SRC

  This information was used to expand the sarif tree (see Struct3452 and Array7481
  in typegraph-multi-with-tables.pdf and the code).  In retrospect, that was a
  poor choice.  All additional information needed can be represented by one or
  more tables, so sarif-extract* post commit 30e3dd3a3 do so.

  The minimal information required to drive the sarif-to-table conversion is
  | project_id      |                      13243 |   |
  | scan_id         |                     123456 |   |
  | sarif_file_name | "2021-12-09/results.sarif" |   |


* New tables to be exported
  This section enumerates new tables intended for reporting infrastructure.

  Using the github API starting points
  #+BEGIN_SRC python
    # Code scanning information
    # Get the full list:
    r02 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses')

    # Work with one entry
    _, analysis_id = pathval(r02, 0, 'id')
    r02s01 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses/{analysis_id}')

    r02s02 = gith(GET, f'/repos/{owner}/{repo}/code-scanning/analyses/{analysis_id}',
                  headers = {'Accept': 'application/sarif+json'})

    # Repository information via GET /repos/{owner}/{repo}
    r03 = gith(GET, f'/repos/{owner}/{repo}')
  #+END_SRC
  we can populate the =project.csv= and =scans.csv= tables:
  #+BEGIN_SRC sql
    ==> project.csv <==
    id
    project_name                    -- pathval(r03, 'full_name')
    creation_date                   -- pathval(r03, 'created_at')
    owner                           -- r03
    repo                            -- r03 = gith(GET, f'/repos/{owner}/{repo}')
    repository_url                  -- pathval(r03, 'clone_url')
    primary_language                -- pathval(r03, 'language')
    languages_analyzed              --
  #+END_SRC
  #+BEGIN_SRC sql
    ==> scans.csv <==
    id                              --
    commit_id                       -- pathval(r02s01, 'commit_sha')
    project_id                      -- project.id
    db_create_start                 -- pathval(r02s01, 'created_at')
    db_create_stop
    scan_start_date
    scan_stop_date
    tool_name                       -- pathval(r02s01, 'tool', 'name')
    tool_version                    -- pathval(r02s01, 'tool', 'version')
    tool_query_commit_id            -- pathval(r02, 0, 'tool', 'version') is sufficient
    sarif_content                   -- r02s02
    sarif_file_name                 -- used on upload
    sarif_id                        -- pathval(r02s01, 'sarif_id')
    results_count                   -- pathval(r02s01, 'results_count')
    rules_count                     -- pathval(r02s01, 'rules_count')
  #+END_SRC

  The sarif upload from codeql analysis to github uses the following API and
  parameters which naturally are the minimal parameters needed to run the
  analysis.
  #+BEGIN_SRC python
    # untested
    r04 = gith(POST, f'/repos/{owner}/{repo}/code-scanning/sarifs',
               fields={'commit_sha': 'aa22233',
                       'ref': 'refs/heads/<branch name>',
                       'sarif': 'gzip < sarif | base64 -w0',
                       'tool_name' : 'codeql',
                       'started_at': 'when the analysis started',
                       },
               headers = {'Accept': 'application/sarif+json'})
  #+END_SRC

  The scan results from =project.csv= are the root of the sarif tree, so this is a
  required base table.
  #+BEGIN_SRC sql
    ==> project-scan-result.csv <==
    $schema
    sarif_version
    run_index
    artifacts
    columnKind
    results
    semmle.formatSpecifier
    semmle.sourceLanguage
    driver_name
    organization
    rules
    driver_version
    repositoryUri
    revisionId
  #+END_SRC

  Using joins of the =project-scan-result.csv= table and the
  other [[*Currently Exported Tables][Currently Exported Tables]], the =results.csv= table can be formed:
  #+BEGIN_SRC sql
    ==> results.csv <==
    id INT,                  -- primary key
    scan_id INT,             -- scans.id
    query_id STRING,         -- @id from the CodeQL query
    location STRING,
    message STRING,
    message_object OBJ,
    -- for kind_path_problem, use distinct source / sink
    -- for kind_problem, use the same location for both
    result_type STRING,      -- kind_problem | kind_path_problem
    -- link to codeflows (kind_pathproblem.csv only, NULL otherwise)
    codeFlow_id INT,
    --
    source_startLine int,
    source_startCol int,
    source_endLine int,
    source_endCol int,
    --
    sink_startLine int,
    sink_startCol int,
    sink_endLine int,
    sink_endCol int,
    --
    source_object STRING, -- higher-level info: 'args', 'request', etc.
    sink_object string, -- higher level: 'execute', 'sql statement', etc.
  #+END_SRC

#+HTML: </div>