Commit Graph

177 Commits

Author SHA1 Message Date
Michael Hohn
0f070a6ae4 sarif-extract-multi: extract combined tables from multiple sarif files
This command introduces a new tree structure that pulls in a collection
of sarif files.  In yaml format, an example is

    - creation_date: '2021-12-09'   # Repository creation date
      primary_language: javascript  # By lines of code
      project_name: treeio/treeio   # Repo name-short name
      query_commit_id: fa9571646c   # Commit id for custom (non-library) queries
      sarif_content: {}             # The sarif content will be attached here
      sarif_file_name: 2021-12-09/results.sarif # Path to sarif file
      scan_start_date: '2021-12-09'             # Beginning date/time of scan
      scan_stop_date:  '2021-12-10'             # End date/time of scan
      tool_name: codeql
      tool_version: v1.27

    - creation_date: '2022-02-25'
      primary_language: javascript
      ...

At run time,

    cd ~/local/sarif-cli/data/treeio
    sarif-extract-multi multi-sarif-01.json test-multi-table

will load the specified sarif files and put them in place of
`sarif_content`, then build tables against the new signature found in
sarif_cli/signature_multi.py, and merge those into 6 larger tables.  The
exported tables are

    artifacts.csv  path-problem.csv  project.csv
    codeflows.csv  problem.csv       related-locations.csv

and they have join keys for further operations.

The new typegraph is rendered in

    notes/typegraph-multi.pdf

using the instructions in

    sarif_cli/signature_multi.py
2022-03-11 23:00:53 -08:00
Michael Hohn
9c151e295b sarif-extract-tables: include relatedLocations from both sources
With the addition of the path-problem output, include both as sources (left joins)
for relatedLocations:

    pd.concat([sf(4055)[['relatedLocations', 'struct_id']],
              sf(9699)[['relatedLocations', 'struct_id']]])
2022-02-22 17:35:39 -08:00
Michael Hohn
1dbd240b5b sarif-extract-tables: Form the codeFlows dataframe and write it out
One of the shorter multi-path results from
     cd ~/local/sarif-cli/data/treeio
     ../../bin/sarif-results-summary -r results.sarif |less
follows; the dataframe formed here starts with the codeFlows-containing table 9699
and has the content of the PATH * output below.

    RESULT: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:89:35:93:14: [DOM text](1) is reinte
    rpreted as HTML without escaping meta-characters.
    [DOM text](2) is reinterpreted as HTML without escaping meta-characters.
    [DOM text](3) is reinterpreted as HTML without escaping meta-characters.
    REFERENCE: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:90:17:90:27: DOM text
    REFERENCE: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:91:17:91:28: DOM text
    REFERENCE: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:92:17:92:31: DOM text
    PATH 0
    FLOW STEP 0: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:90:17:90:27: name.val()
    FLOW STEP 1: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:89:35:93:14: "<tr>"  ... "</tr>
    "
    PATH 1
    FLOW STEP 0: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:91:17:91:28: email.val()
    FLOW STEP 1: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:89:35:93:14: "<tr>"  ... "</tr>"
    PATH 2
    FLOW STEP 0: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:92:17:92:31: password.val()
    FLOW STEP 1: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:89:35:93:14: "<tr>"  ... "</tr>"
2022-02-22 16:50:44 -08:00
Michael Hohn
ad738abed3 sarif-extract-tables: also output relatedLocations table
With --related-locations,

    ../../bin/sarif-results-summary -r results.sarif

produces the details

    RESULT: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:
    72:722:73: Character ''' is repeated [here](1) in the same character class.
    Character ''' is repeated [here](2) in the same character class.
    Character ''' is repeated [here](3) in the same character class.
    REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:74:722:75: here
    REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:76:722:77: here
    REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:78:722:79: here

Via
    ../../bin/sarif-extract-tables results.sarif tables

sarif-extract-tables now produces two output tables,

    tables/
    ├── messages.csv
    └── relatedLocations.csv

that contain the relevant information and can be joined or otherwise combined on
the struct_id_4055 key.

For example, adding to the end of sarif-extract-tables:
    import IPython
    IPython.embed()

    msg = d2[d2.message.str.startswith("Character ''' is repeated [here]")]
    dr3[dr3.struct_id_4055 == msg.struct_id_4055.values[0]]

    In [24]: msg
    Out[24]:
         struct_id_4055  ...                                            message
    180      4796917312  ...  Character ''' is repeated [here](1) in the sam...

    [1 rows x 7 columns]

    In [25]: dr3[dr3.struct_id_4055 == msg.struct_id_4055.values[0]]
    Out[25]:
         struct_id_4055                                                uri  startLine  startColumn  endLine  endColumn message
    180      4796917312  static/js/tinymce/jscripts/tiny_mce/plugins/pa...        722           74      722         75    here
    181      4796917312  static/js/tinymce/jscripts/tiny_mce/plugins/pa...        722           76      722         77    here
    182      4796917312  static/js/tinymce/jscripts/tiny_mce/plugins/pa...        722           78      722         79    here

or manually from the shell:

    # pick up the struct_id_4055:
    0:$ grep "static.*Character ''' is repeated \[here\]" tables/messages.csv
    180,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,72,722,73,"Character ''' is repeated [here](1) in the same character class.

    # and find relatedLocations:
    0:$ grep 4927448704 tables/relatedLocations.csv
    180,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,74,722,75,here
    181,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,76,722,77,here
    182,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,78,722,79,here

Changes:
- Introduce scli-dyys, a random id string for later identification and removal of
  dummy table rows.

- Keep the struct_id_4055 column to join tables as needed.

- Output is now written to a directory as there are always multiple files.
2022-02-16 17:03:58 -08:00
Michael Hohn
ec9a0b5590 sarif-extract-tables: initial version, reproduces known output as table
Reproduce the

    file:line:col:line:col: message

output from

    ../../bin/sarif-results-summary results.sarif | grep size

as test/example.

Original sample output is

    RESULT: static/js/fileuploader.js:1214:13:1214:17: Unused variable size.
    RESULT: static/js/tinymce/jscripts/tiny_mce/plugins/media/js/media.js:438:30:438:34: Unused variable size.

The table result here is

    0:$ ../../bin/sarif-extract-tables results.sarif | grep size
    0,static/js/fileuploader.js,1214,13,1214,17,Unused variable size.
    34,static/js/tinymce/jscripts/tiny_mce/plugins/media/js/media.js,438,30,438,34,Unused variable size.
2022-02-08 20:04:28 -08:00
Michael Hohn
f5e73e90ba sarif-extract-tables: interim commit: first joins
These joins construct the table needed for sarif-results-summary output
2022-02-07 17:11:55 -08:00
Michael Hohn
f246f06d4e sarif-extract-tables: interim commit: form tables
Tables are now formed and kept in the Typegraph instance.
These will be tested using pandas operations to form one of the previous outputs.
2022-02-04 23:56:01 -08:00
Michael Hohn
7a517fa06c sarif-extract-tables: interim commit
Internal destructuring and array aggregration run, but need to be tested.
Tables need to be formed, and pandas selections/joins/etc. used for custom table output.
2022-02-04 14:44:55 -08:00
Michael Hohn
cf8096446b sarif-to-dot: cleanup for and preparation for sarif table extraction 2022-02-01 22:42:25 -08:00
Michael Hohn
c664ae2f8f .gitignore: ignore temporary files 2022-02-01 22:31:10 -08:00
Michael Hohn
119f9a5c18 sarif-to-dot: add more support for --fill-structure option
Expand

  ('Struct4827', ('struct', ('physicalLocation', 'Struct4963'))),

to have fields

  ( 'Struct2683',
    ( 'struct',
      ('id', 'Int'),
      ('message', 'Struct2774'),
      ('physicalLocation', 'Struct4963')))

and avoid a redundant table.
2022-01-27 18:55:02 -08:00
Michael Hohn
eb53ede8b1 sarif-to-dot: add more support for --fill-structure option
Common to all:
| ('locations', 'Array008')            |
| ('message', 'Struct009')             |
| ('partialFingerprints', 'Struct010') |
| ('rule', 'Struct011')                |
| ('ruleId', 'String'),                |
| ('ruleIndex', 'Int')))               |

Only some problems and flow problems have
| ('relatedLocations', 'Array014') |

Add dummy value for relatedLocations to reduce to two result categories,
@kind flow problem and @kind problem.
2022-01-27 18:18:43 -08:00
Michael Hohn
80b22001ce sarif-to-dot: make signature names order-independent
To create entire subtrees conforming to a signature, first make the
signature names order-independent.  Use hashes to name the signatures.
2022-01-27 17:53:14 -08:00
Michael Hohn
3e5d3ff5de Added interesting sarif structure diagram to notes/ 2022-01-26 23:25:30 -08:00
Michael Hohn
0b13a297a5 sarif-to-dot: add more support for --fill-structure option
Ensure

    ('Array003', ('array', (0, 'String'))),

is always present, collapse the following into one:

( 'Struct032',
  ( 'struct',
    ('artifacts', 'Array002'),
    ('columnKind', 'String'),
    ('newlineSequences', 'Array003'),
    ('properties', 'Struct004'),
    ('results', 'Array023'),
    ('tool', 'Struct029'),
    ('versionControlProvenance', 'Array031'))),

( 'Struct033',
  ( 'struct',
    ('artifacts', 'Array002'),
    ('columnKind', 'String'),
    ('properties', 'Struct004'),
    ('results', 'Array023'),
    ('tool', 'Struct029'),
    ('versionControlProvenance', 'Array031')))
2022-01-26 22:27:07 -08:00
Michael Hohn
2adf0dfa21 sarif-to-dot: increase graph ranksep to get intelligible edges 2022-01-26 16:15:42 -08:00
Michael Hohn
2c98cf0d41 sarif-to-dot: add more support for --fill-structure option
When both

   ('message', 'Struct009'),
   ('physicalLocation', 'Struct006'))),

are present, ensure

      ('id', 'Int'),

also is.
2022-01-26 16:06:15 -08:00
Michael Hohn
2b75988b9a sarif-to-dot: add more support for --fill-structure option
Expand all 'properties' objects to common signature; instead of the 3
entries, get one:

( 'struct',
('kind', 'String'),
('precision', 'String'),
('severity', 'String'),
('tags', 'Array003')))

( 'struct',
('kind', 'String'),
('precision', 'String'),
('security-severity', 'String'),
('severity', 'String'),
('tags', 'Array003'))

( 'struct',
('kind', 'String'),
('precision', 'String'),
('severity', 'String'),
('sub-severity', 'String'),
('tags', 'Array003'))
2022-01-26 15:41:26 -08:00
Michael Hohn
153eba8346 sarif-to-dot: to reduce graph clutter, add option --no-edges-to-scalars 2022-01-26 00:41:31 -08:00
Michael Hohn
d7d566c5db sarif-to-dot: add more support for --fill-structure option
Collapse multipl 'physicalLocation's into one; from
 ( 'Struct006',
    ('struct', ('artifactLocation', 'Struct000'), ('region', 'Struct005'))),

 ('Struct036', ('struct', ('artifactLocation', 'Struct000'))),

to

 ( 'Struct006',
    ('struct', ('artifactLocation', 'Struct000'), ('region', 'Struct005'))),
2022-01-25 23:43:43 -08:00
Michael Hohn
b816705574 sarif-to-dot: add --fill-structure option and initial library support
This collapses the rightmost column of the signature output from

    ../../bin/sarif-to-dot -u -t -d -f results.sarif | dot -Tpdf

which has multiple distinct entries

 ('Struct030', ('struct', ('endColumn', 'Int'), ('startLine', 'Int'))),
 ( 'Struct016',
    ( 'struct',
      ('endColumn', 'Int'),
      ('startColumn', 'Int'),
      ('startLine', 'Int'))),
 ( 'Struct025',
    ( 'struct',
      ('endColumn', 'Int'),
      ('endLine', 'Int'),
      ('startColumn', 'Int'),
      ('startLine', 'Int'))),
 ('Struct030', ('struct', ('endColumn', 'Int'), ('startLine', 'Int'))),

to a single entry,

  ( 'Struct005',
    ( 'struct',
      ('endColumn', 'Int'),
      ('endLine', 'Int'),
      ('startColumn', 'Int'),
      ('startLine', 'Int'))),

when using

    ../../bin/sarif-to-dot results.sarif -u -t -f
2022-01-25 23:18:20 -08:00
Michael Hohn
edfe1f3363 sarif-to-dot: move signature functions into their own module 2022-01-25 17:57:44 -08:00
Michael Hohn
0444a87076 sarif-to-dot: remove module-variable references 2022-01-25 17:49:07 -08:00
Michael Hohn
86caa3f56f sarif-to-dot: small renaming 2022-01-24 17:24:30 -08:00
Michael Hohn
939ba9bd8a sarif-to-dot: output array signatures as nodes, not edges; fix raise statements 2022-01-20 18:09:45 -08:00
Michael Hohn
cef9b47b58 sarif-to-dot: produce dot output using -d option
The command
   ../../bin/sarif-to-dot results.sarif -u -t -d | dot -Tpdf > raw-nested-types.pdf
produces a good illustration of the problems arising when optional values are absent.
To clean this up, structures missing fields have to be supplemented with those fields,
from right to left in the graph.
This is basically what sarif-results-summary does on the fly, it just has to be applied
to the input tree before collecting the signatures and producing this graph.
Once that is done, the types collected here can be used in SQL table export.
2022-01-16 14:21:23 -08:00
Michael Hohn
113fa483ca traverse: add file header 2022-01-16 13:23:33 -08:00
Michael Hohn
d64b100101 sarif-to-dot: move processing code to the end 2022-01-16 01:39:24 -08:00
Michael Hohn
b94be6a21e sarif-to-dot: map values to their typedf 2022-01-16 01:26:24 -08:00
Michael Hohn
afca6b341a sarif-to-dot: improved output, add three options
-   Full view with some clean-up:
    45608 lines

        cd ~/local/sarif-cli/data/treeio
        ../../bin/sarif-to-dot results.sarif | tr -d "',[]"  |less

-   Only show unique array entry signatures
    1573 lines

        cd ~/local/sarif-cli/data/treeio
        ../../bin/sarif-to-dot results.sarif -u | tr -d "',[]"  |less

-   Only show unique array entry signatures, typedef object signatures
    338 lines

        cd ~/local/sarif-cli/data/treeio
        ../../bin/sarif-to-dot results.sarif -u -t | tr -d "',[]"  |less
2022-01-16 01:07:24 -08:00
Michael Hohn
706e4cdd54 sarif-to-dot: Print the type signature of a sarif file, at various levels of verbosity. 2022-01-15 23:05:06 -08:00
Michael Hohn
ef08825b43 Processing in stages: Move the initial sarif_cli code to sarif_cli/traverse 2021-12-22 18:03:34 -08:00
Michael Hohn
7d49c3bd08 Update the sarif-results-summary examples 2021-12-22 17:48:24 -08:00
Michael Hohn
558e218d3b Add endpoints-only option for path output and a collection of usage samples 2021-12-21 14:05:27 -08:00
Michael Hohn
79649a6226 Add treeio/ files referenced in sarif 2021-12-18 14:58:51 -08:00
Michael Hohn
979042ff5c Add a 3 =relatedLocations= and 3 =threadFlows= example 2021-12-18 14:58:10 -08:00
Michael Hohn
f0e52753f6 Illustration of the steps needed to pull in used source files only 2021-12-18 14:56:39 -08:00
Michael Hohn
9590d0a677 Add newline after dbg(message) output 2021-12-18 14:19:38 -08:00
Michael Hohn
291726dd58 Add smaller sarif test files 2021-12-18 13:19:11 -08:00
Michael Hohn
68a661fffb Added notes on more thorough examination of multiple results 2021-12-18 00:33:38 -08:00
Michael Hohn
7e66e29f53 Fix editing error 2021-12-15 14:02:27 -08:00
Michael Hohn
62ae8dca4a Correct the =sarif-results-summary= commands 2021-12-10 11:56:10 -08:00
Michael Hohn
780def7063 Add utility scripts to retrieve sarif files from lgtm 2021-12-10 11:25:03 -08:00
Michael Hohn
5386310b1b Prepend path index to data flow results; use single newlines 2021-12-08 16:28:32 -08:00
Michael Hohn
f1d21e4a43 Fix missing 'region' key in relatedLocations: use whole-file output
The goal is fixed-structure output formatting, so whole-file output uses
-1,-1,-1,-1 for line, column information.
2021-12-08 16:02:31 -08:00
Michael Hohn
1271589bc4 Fix class NoFile: comment 2021-12-06 15:34:03 -08:00
Michael Hohn
92d904ee10 Add quick check to verify that input is serif
An occasional output from LGTM is
    {"code":404,"error":"The specified analysis could not be found"}

With this patch, the csv output is now
    "ERROR","invalid json contents %s","some-file.json"

and the plain text output becomes
    ERROR: invalid json contents in some-file.json
2021-12-06 14:24:08 -08:00
Michael Hohn
120e673424 Fix: handle relatedLocations without physicalLocations (files)
Problem:
    The
        artifact = get(related_location, 'physicalLocation', 'artifactLocation')
    requested by
        message, artifact, region = S.get_relatedlocation_message_info(location)
    is incomplete:
        ipdb> p related_location
        {'message': {'text': 'request'}}

Fix:
    Introduce the NoFile class to propagate this and handle it where needed.

Now simply report <NoFile> as appropriate.
    For plain text output:

        RESULT: src/optionsparser/ ..
        FLOW STEP 0: <NoFile>: request
        FLOW STEP 1: <NoFile>: request_mp
        FLOW STEP 2: src/....

    For csv output:

        "result","src/optionsparser/...","116","26","116","34","`& ...` used as ..."
        "flow_step","0","<NoFile>","-1","-1","-1","-1","request"
        "flow_step","1","<NoFile>","-1","-1","-1","-1","request_mp"
        "flow_step","2","src/foo.cpp","119","97","119","104","request"
2021-12-06 12:37:35 -08:00
Michael Hohn
2c3ca3c0eb Fix for KeyError: 'region', caused by result without region
Region / line / column information are present in most messages.  The one that
caused this error refers to the whole file:

    ipdb> p sarif_struct

    {'ruleId': 'com.lgtm/cpp-queries:cpp/missing-header-guard', 'ruleIndex': 12,
    'message': {'text': 'This header file should contain a header guard to prevent
    multiple inclusion.'}, 'locations': [{'physicalLocation': {'artifactLocation':
    {'uri': 'diff/cmpbuf.h', 'uriBaseId': '%SRCROOT%', 'index': 13}}}],
    'partialFingerprints': {'primaryLocationLineHash': 'd04cb834fa64727d:1',
    'primaryLocationStartColumnFingerprint': '0'}}

The goal is fixed-structure output formatting, so whole-file output uses
-1,-1,-1,-1 for line, column information.
2021-12-06 11:48:53 -08:00
Michael Hohn
ffcacec630 sarif-results-summary: add csv output option 2021-12-06 11:48:53 -08:00