Commit Graph

99 Commits

Author SHA1 Message Date
Michael Hohn
82a8e7a6dc fix: set id and scan_id type to uint64 to suppress float conversion 2022-06-01 13:00:37 -07:00
Michael Hohn
0fc6eb3cce Improve error reporting in sarif destructuring routines 2022-05-30 00:09:13 -07:00
Michael Hohn
f5e258de52 Enhance the fillsig() routines to supplement lgtm.com/lgtm enterprise signature differences 2022-05-30 00:08:09 -07:00
Michael Hohn
eb8e2f18e9 Initial version of sarif-extract-scans, to be tested
Running

    cd ~/local/sarif-cli/data/treeio
    sarif-extract-scans scan-spec-0.json test-scan

produces the 2 derived and one sarif-based table (codeflows.csv):

    ls test-scan/
    codeflows.csv  results.csv  scans.csv

Adding -r via

    sarif-extract-scans -r scan-spec-0.json test-scan

writes all tables:

    ls test-scan/
    artifacts.csv  kind_pathproblem.csv  project.csv           results.csv  scans.csv
    codeflows.csv  kind_problem.csv      relatedLocations.csv  rules.csv
2022-05-16 18:58:53 -07:00
Michael Hohn
154b0bdc56 WIP: assemble derived 'results' table 2022-05-13 17:01:18 -07:00
Michael Hohn
b212423907 WIP: sarif-extract-scans: back to single sarif file handling, incorporate multi-file libraries 2022-05-10 19:01:38 -07:00
Michael Hohn
8e5d9c464b Add snowflake implementation 2022-04-11 19:24:12 -07:00
Michael Hohn
d5390bb87e Full revision of the base tables derived from multiple sarif input files
The new base tables produced by `sarif-extract-multi` are
    artifacts
    codeflows
    kind_pathproblem
    kind_problem
    project
    relatedLocations
    rules

The revised table overview is in the jupyter notebook
scripts/multi-table-overview.ipynb

The file notes/typegraph-multi-with-tables.pdf illustrates what original (sarif)
tables are used to form the base (derived) tables.
2022-03-23 16:37:41 -07:00
Michael Hohn
db00f17137 Some cleanup based on pyflakes output 2022-03-17 17:23:53 -07:00
Michael Hohn
b82c620a1e Add overview of the base tables derived from multi-sarif input; add rules.csv
The table overview is in the jupyter notebook
scripts/multi-table-overview.ipynb and makes use of some formatting
customizations to actually get an overview.

The initial `projects` table had far too many entries; the `rules` part
is now in a separate `rules` table.
2022-03-16 16:54:14 -07:00
Michael Hohn
926e083991 Added field to multi-file signature; the steps are documented in adding-to-typegraph.org 2022-03-15 12:30:05 -07:00
Michael Hohn
0f070a6ae4 sarif-extract-multi: extract combined tables from multiple sarif files
This command introduces a new tree structure that pulls in a collection
of sarif files.  In yaml format, an example is

    - creation_date: '2021-12-09'   # Repository creation date
      primary_language: javascript  # By lines of code
      project_name: treeio/treeio   # Repo name-short name
      query_commit_id: fa9571646c   # Commit id for custom (non-library) queries
      sarif_content: {}             # The sarif content will be attached here
      sarif_file_name: 2021-12-09/results.sarif # Path to sarif file
      scan_start_date: '2021-12-09'             # Beginning date/time of scan
      scan_stop_date:  '2021-12-10'             # End date/time of scan
      tool_name: codeql
      tool_version: v1.27

    - creation_date: '2022-02-25'
      primary_language: javascript
      ...

At run time,

    cd ~/local/sarif-cli/data/treeio
    sarif-extract-multi multi-sarif-01.json test-multi-table

will load the specified sarif files and put them in place of
`sarif_content`, then build tables against the new signature found in
sarif_cli/signature_multi.py, and merge those into 6 larger tables.  The
exported tables are

    artifacts.csv  path-problem.csv  project.csv
    codeflows.csv  problem.csv       related-locations.csv

and they have join keys for further operations.

The new typegraph is rendered in

    notes/typegraph-multi.pdf

using the instructions in

    sarif_cli/signature_multi.py
2022-03-11 23:00:53 -08:00
Michael Hohn
ad738abed3 sarif-extract-tables: also output relatedLocations table
With --related-locations,

    ../../bin/sarif-results-summary -r results.sarif

produces the details

    RESULT: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:
    72:722:73: Character ''' is repeated [here](1) in the same character class.
    Character ''' is repeated [here](2) in the same character class.
    Character ''' is repeated [here](3) in the same character class.
    REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:74:722:75: here
    REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:76:722:77: here
    REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:78:722:79: here

Via
    ../../bin/sarif-extract-tables results.sarif tables

sarif-extract-tables now produces two output tables,

    tables/
    ├── messages.csv
    └── relatedLocations.csv

that contain the relevant information and can be joined or otherwise combined on
the struct_id_4055 key.

For example, adding to the end of sarif-extract-tables:
    import IPython
    IPython.embed()

    msg = d2[d2.message.str.startswith("Character ''' is repeated [here]")]
    dr3[dr3.struct_id_4055 == msg.struct_id_4055.values[0]]

    In [24]: msg
    Out[24]:
         struct_id_4055  ...                                            message
    180      4796917312  ...  Character ''' is repeated [here](1) in the sam...

    [1 rows x 7 columns]

    In [25]: dr3[dr3.struct_id_4055 == msg.struct_id_4055.values[0]]
    Out[25]:
         struct_id_4055                                                uri  startLine  startColumn  endLine  endColumn message
    180      4796917312  static/js/tinymce/jscripts/tiny_mce/plugins/pa...        722           74      722         75    here
    181      4796917312  static/js/tinymce/jscripts/tiny_mce/plugins/pa...        722           76      722         77    here
    182      4796917312  static/js/tinymce/jscripts/tiny_mce/plugins/pa...        722           78      722         79    here

or manually from the shell:

    # pick up the struct_id_4055:
    0:$ grep "static.*Character ''' is repeated \[here\]" tables/messages.csv
    180,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,72,722,73,"Character ''' is repeated [here](1) in the same character class.

    # and find relatedLocations:
    0:$ grep 4927448704 tables/relatedLocations.csv
    180,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,74,722,75,here
    181,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,76,722,77,here
    182,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,78,722,79,here

Changes:
- Introduce scli-dyys, a random id string for later identification and removal of
  dummy table rows.

- Keep the struct_id_4055 column to join tables as needed.

- Output is now written to a directory as there are always multiple files.
2022-02-16 17:03:58 -08:00
Michael Hohn
f246f06d4e sarif-extract-tables: interim commit: form tables
Tables are now formed and kept in the Typegraph instance.
These will be tested using pandas operations to form one of the previous outputs.
2022-02-04 23:56:01 -08:00
Michael Hohn
7a517fa06c sarif-extract-tables: interim commit
Internal destructuring and array aggregration run, but need to be tested.
Tables need to be formed, and pandas selections/joins/etc. used for custom table output.
2022-02-04 14:44:55 -08:00
Michael Hohn
cf8096446b sarif-to-dot: cleanup for and preparation for sarif table extraction 2022-02-01 22:42:25 -08:00
Michael Hohn
119f9a5c18 sarif-to-dot: add more support for --fill-structure option
Expand

  ('Struct4827', ('struct', ('physicalLocation', 'Struct4963'))),

to have fields

  ( 'Struct2683',
    ( 'struct',
      ('id', 'Int'),
      ('message', 'Struct2774'),
      ('physicalLocation', 'Struct4963')))

and avoid a redundant table.
2022-01-27 18:55:02 -08:00
Michael Hohn
eb53ede8b1 sarif-to-dot: add more support for --fill-structure option
Common to all:
| ('locations', 'Array008')            |
| ('message', 'Struct009')             |
| ('partialFingerprints', 'Struct010') |
| ('rule', 'Struct011')                |
| ('ruleId', 'String'),                |
| ('ruleIndex', 'Int')))               |

Only some problems and flow problems have
| ('relatedLocations', 'Array014') |

Add dummy value for relatedLocations to reduce to two result categories,
@kind flow problem and @kind problem.
2022-01-27 18:18:43 -08:00
Michael Hohn
80b22001ce sarif-to-dot: make signature names order-independent
To create entire subtrees conforming to a signature, first make the
signature names order-independent.  Use hashes to name the signatures.
2022-01-27 17:53:14 -08:00
Michael Hohn
0b13a297a5 sarif-to-dot: add more support for --fill-structure option
Ensure

    ('Array003', ('array', (0, 'String'))),

is always present, collapse the following into one:

( 'Struct032',
  ( 'struct',
    ('artifacts', 'Array002'),
    ('columnKind', 'String'),
    ('newlineSequences', 'Array003'),
    ('properties', 'Struct004'),
    ('results', 'Array023'),
    ('tool', 'Struct029'),
    ('versionControlProvenance', 'Array031'))),

( 'Struct033',
  ( 'struct',
    ('artifacts', 'Array002'),
    ('columnKind', 'String'),
    ('properties', 'Struct004'),
    ('results', 'Array023'),
    ('tool', 'Struct029'),
    ('versionControlProvenance', 'Array031')))
2022-01-26 22:27:07 -08:00
Michael Hohn
2adf0dfa21 sarif-to-dot: increase graph ranksep to get intelligible edges 2022-01-26 16:15:42 -08:00
Michael Hohn
2c98cf0d41 sarif-to-dot: add more support for --fill-structure option
When both

   ('message', 'Struct009'),
   ('physicalLocation', 'Struct006'))),

are present, ensure

      ('id', 'Int'),

also is.
2022-01-26 16:06:15 -08:00
Michael Hohn
2b75988b9a sarif-to-dot: add more support for --fill-structure option
Expand all 'properties' objects to common signature; instead of the 3
entries, get one:

( 'struct',
('kind', 'String'),
('precision', 'String'),
('severity', 'String'),
('tags', 'Array003')))

( 'struct',
('kind', 'String'),
('precision', 'String'),
('security-severity', 'String'),
('severity', 'String'),
('tags', 'Array003'))

( 'struct',
('kind', 'String'),
('precision', 'String'),
('severity', 'String'),
('sub-severity', 'String'),
('tags', 'Array003'))
2022-01-26 15:41:26 -08:00
Michael Hohn
153eba8346 sarif-to-dot: to reduce graph clutter, add option --no-edges-to-scalars 2022-01-26 00:41:31 -08:00
Michael Hohn
d7d566c5db sarif-to-dot: add more support for --fill-structure option
Collapse multipl 'physicalLocation's into one; from
 ( 'Struct006',
    ('struct', ('artifactLocation', 'Struct000'), ('region', 'Struct005'))),

 ('Struct036', ('struct', ('artifactLocation', 'Struct000'))),

to

 ( 'Struct006',
    ('struct', ('artifactLocation', 'Struct000'), ('region', 'Struct005'))),
2022-01-25 23:43:43 -08:00
Michael Hohn
b816705574 sarif-to-dot: add --fill-structure option and initial library support
This collapses the rightmost column of the signature output from

    ../../bin/sarif-to-dot -u -t -d -f results.sarif | dot -Tpdf

which has multiple distinct entries

 ('Struct030', ('struct', ('endColumn', 'Int'), ('startLine', 'Int'))),
 ( 'Struct016',
    ( 'struct',
      ('endColumn', 'Int'),
      ('startColumn', 'Int'),
      ('startLine', 'Int'))),
 ( 'Struct025',
    ( 'struct',
      ('endColumn', 'Int'),
      ('endLine', 'Int'),
      ('startColumn', 'Int'),
      ('startLine', 'Int'))),
 ('Struct030', ('struct', ('endColumn', 'Int'), ('startLine', 'Int'))),

to a single entry,

  ( 'Struct005',
    ( 'struct',
      ('endColumn', 'Int'),
      ('endLine', 'Int'),
      ('startColumn', 'Int'),
      ('startLine', 'Int'))),

when using

    ../../bin/sarif-to-dot results.sarif -u -t -f
2022-01-25 23:18:20 -08:00
Michael Hohn
edfe1f3363 sarif-to-dot: move signature functions into their own module 2022-01-25 17:57:44 -08:00
Michael Hohn
113fa483ca traverse: add file header 2022-01-16 13:23:33 -08:00
Michael Hohn
ef08825b43 Processing in stages: Move the initial sarif_cli code to sarif_cli/traverse 2021-12-22 18:03:34 -08:00
Michael Hohn
9590d0a677 Add newline after dbg(message) output 2021-12-18 14:19:38 -08:00
Michael Hohn
f1d21e4a43 Fix missing 'region' key in relatedLocations: use whole-file output
The goal is fixed-structure output formatting, so whole-file output uses
-1,-1,-1,-1 for line, column information.
2021-12-08 16:02:31 -08:00
Michael Hohn
1271589bc4 Fix class NoFile: comment 2021-12-06 15:34:03 -08:00
Michael Hohn
92d904ee10 Add quick check to verify that input is serif
An occasional output from LGTM is
    {"code":404,"error":"The specified analysis could not be found"}

With this patch, the csv output is now
    "ERROR","invalid json contents %s","some-file.json"

and the plain text output becomes
    ERROR: invalid json contents in some-file.json
2021-12-06 14:24:08 -08:00
Michael Hohn
120e673424 Fix: handle relatedLocations without physicalLocations (files)
Problem:
    The
        artifact = get(related_location, 'physicalLocation', 'artifactLocation')
    requested by
        message, artifact, region = S.get_relatedlocation_message_info(location)
    is incomplete:
        ipdb> p related_location
        {'message': {'text': 'request'}}

Fix:
    Introduce the NoFile class to propagate this and handle it where needed.

Now simply report <NoFile> as appropriate.
    For plain text output:

        RESULT: src/optionsparser/ ..
        FLOW STEP 0: <NoFile>: request
        FLOW STEP 1: <NoFile>: request_mp
        FLOW STEP 2: src/....

    For csv output:

        "result","src/optionsparser/...","116","26","116","34","`& ...` used as ..."
        "flow_step","0","<NoFile>","-1","-1","-1","-1","request"
        "flow_step","1","<NoFile>","-1","-1","-1","-1","request_mp"
        "flow_step","2","src/foo.cpp","119","97","119","104","request"
2021-12-06 12:37:35 -08:00
Michael Hohn
2c3ca3c0eb Fix for KeyError: 'region', caused by result without region
Region / line / column information are present in most messages.  The one that
caused this error refers to the whole file:

    ipdb> p sarif_struct

    {'ruleId': 'com.lgtm/cpp-queries:cpp/missing-header-guard', 'ruleIndex': 12,
    'message': {'text': 'This header file should contain a header guard to prevent
    multiple inclusion.'}, 'locations': [{'physicalLocation': {'artifactLocation':
    {'uri': 'diff/cmpbuf.h', 'uriBaseId': '%SRCROOT%', 'index': 13}}}],
    'partialFingerprints': {'primaryLocationLineHash': 'd04cb834fa64727d:1',
    'primaryLocationStartColumnFingerprint': '0'}}

The goal is fixed-structure output formatting, so whole-file output uses
-1,-1,-1,-1 for line, column information.
2021-12-06 11:48:53 -08:00
Michael Hohn
ffcacec630 sarif-results-summary: add csv output option 2021-12-06 11:48:53 -08:00
Michael Hohn
f0aa815a9a Fix encoding read error
When using
: with open(fname, 'r') as file:
hits the accented letter á in Vrána in the file
: data/wxWidgets-small/src/stc/scintilla/lexers/LexCSS.cxx
it results in a
: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 119: invalid continuation byte

We are reading source code, so we likely don't care about dropping non-ascii; using
: with codecs.open(fname, 'r', encoding="latin-1") as file:
ignores this problem.
2021-12-06 11:48:53 -08:00
Michael Hohn
85ddaaafe1 sarif-results-summary: add codeFlow (path-problem) output, remove meta-data
The per-language result counts are removed; they belong in a separate sarif-info script.
2021-12-06 11:48:53 -08:00
Michael Hohn
6147e57260 Introduce get_relatedlocation_message_info to co-locate tree information 2021-11-17 16:34:20 -08:00
Michael Hohn
1f7e78b049 refactor: introduce get_location_message_info 2021-11-17 16:28:43 -08:00
Michael Hohn
90758f769f factor common code into display_underlined 2021-11-17 15:56:43 -08:00
Michael Hohn
9f3be7bcb0 Log missing files, but try to continue execution 2021-11-16 21:45:54 -08:00
Michael Hohn
e36874cb54 sarif-results-summary: underline affected code region
Using
    sarif-results-summary -s data/linux-small data/torvalds_linux__2021-10-21_10_07_00__export.sarif |less
now underscores the indicated regions, e.g.

tools/cgroup/iocost_monitor.py:64:5:64:27: Normal methods should have 'self', rather than 'blkcg', as their first parameter.

    def blkcg_name(blkcg):
    ^^^^^^^^^^^^^^^^^^^^^^
2021-11-15 14:16:23 -08:00
Michael Hohn
a756abbb09 Consistency with tabs in Python source code
In load_lines, use 1 space for each tab
2021-11-15 14:00:18 -08:00
Michael Hohn
912f75c52a fix load_lines: only strip newlines 2021-11-15 13:41:51 -08:00
Michael Hohn
b69eec404d sarif-results-summary -s: include source file lines in output 2021-11-09 16:10:12 -08:00
Michael Hohn
ab1d7c27ef Use sensible values for start/end line/columns for empty entries in the sarif 'region' structure. 2021-11-09 15:04:36 -08:00
Michael Hohn
a0af2c8c59 fix: traverse all languages 2021-11-09 14:29:31 -08:00
Michael Hohn
3032fe3fcd pre-alpha versions of bin/sarif-{digest,labeled,list-files,results-summary 2021-11-09 12:21:12 -08:00