Commit Graph

76 Commits

Author SHA1 Message Date
Kristen Newbury
eb50bdf834 Merge branch 'main' 2023-05-15 13:09:21 -04:00
Kristen Newbury
953d47edd3 Fix extract scans interface CLI default 2023-03-02 11:43:25 -05:00
Benjamin Groh
e8123903f6 Use repositoryUri instead of org/repo 2023-01-18 16:40:39 -05:00
Kristen Newbury
04e3dedb77 Merge pull request #2 from dbeer/exceptions
Fix exception reraising
2023-01-12 12:23:00 -05:00
Kristen Newbury
7dad175d4d Fix tool to default CLI not LGTM sarif input
update readme minor improvement
2023-01-12 12:03:51 -05:00
Kristen Newbury
1a915e4de8 Update how project_id is generated
previously relied on assumption:
naming like: <org>/<project> in
repositoryUri
now just uses full repositoryUri
2023-01-05 16:37:55 -05:00
Daniel Beer
6b475becd9 Fix exception reraising 2022-12-30 12:40:07 -05:00
Kristen Newbury
04a5aae14d Add CLI support
enabled by -f flag with CLI value
tested on sarif from CodeQL CLIs:
2.6.3, 2.9.4, 2.11.4
MUST contain versionControlProvenance property however
2022-12-15 19:12:58 -05:00
Kristen Newbury
009cf12d2c Fix load error csv output error 2022-12-12 17:15:49 -05:00
Kristen Newbury
2bda917a4e Improve error handling on signature mismatch cases
and cleanup old todos that have been addressed
2022-11-23 14:06:23 -05:00
Kristen Newbury
678219beb7 Add csv status aggregate tool 2022-11-15 10:18:12 -05:00
Kristen Newbury
d9bdcc8724 Fix runner defaults and setup more options
sarif-extract-scans-runner now takes specific outer
output dir
bin/sarif-aggregate-scans now takes specific directory
to summarize from
2022-11-14 14:30:55 -05:00
Kristen Newbury
066fcb8248 Add error handling csv writer
writer generates status csv per sarif
2022-11-14 13:02:36 -05:00
Kristen Newbury
a9d84ce26c Make sarif-aggregate-scans executable 2022-11-10 10:51:30 -05:00
Kristen Newbury
1caf03f5f0 Rework project name format and project id format 2022-11-07 13:56:50 -05:00
Kristen Newbury
4121072088 Rework project and scan id generation
goal:
deterministic across multiple instances of scan on same sarif file
no collisions between sarif files from different scan instances (regardless of for same project or not)

assumption sarif file naming will follow: <project>/<unique_filename_per_analysis> format
2022-10-26 12:00:38 -04:00
Kristen Newbury
d9116eba6a Move flakegen scan id to outermost bin tool runner 2022-10-25 10:40:25 -04:00
Kristen Newbury
4285b7a834 Add unique flakegen scan id 2022-10-21 12:16:44 -04:00
Michael Hohn
203343df07 Add sarif-pad-aggregate to fill scan values
Fills the scans table's db_create_start/stop and scan_start/stop_date
columns with realistic random values.
2022-08-31 21:19:02 -07:00
Michael Hohn
235acf6b93 Quote all non-numeric CSV output 2022-08-10 17:44:29 -07:00
Michael Hohn
03a9ef0477 Rewrite sarif-combine-tables.py as full tool, bin/sarif-aggregate-scans 2022-08-10 17:34:35 -07:00
Michael Hohn
7e996e746c Rewrite sarif-runner as full tool, sarif-extract-scans-runner 2022-08-08 14:47:25 -07:00
Michael Hohn
560b9ecf35 Enforce types when forming the scan tables (internal and output formatting)
Force all column types to ensure appropriate formatting for writing.  In
particular, no character data in place of integers, no floats, no
objects in place of strings.

Table formation for the functions
- st.joins_for_results
- st.joins_for_scans
- st.joins_for_projects
enforces types.
2022-08-07 19:04:13 -07:00
Michael Hohn
ef00559408 Bring sarif-extract-tables up to date with sarif-extract-scans 2022-07-19 15:42:26 -07:00
Michael Hohn
741be0cfe1 Include project table in output of sarif-extract-scans; add commit_id to scans table 2022-06-02 16:45:04 -07:00
Michael Hohn
eb8e2f18e9 Initial version of sarif-extract-scans, to be tested
Running

    cd ~/local/sarif-cli/data/treeio
    sarif-extract-scans scan-spec-0.json test-scan

produces the 2 derived and one sarif-based table (codeflows.csv):

    ls test-scan/
    codeflows.csv  results.csv  scans.csv

Adding -r via

    sarif-extract-scans -r scan-spec-0.json test-scan

writes all tables:

    ls test-scan/
    artifacts.csv  kind_pathproblem.csv  project.csv           results.csv  scans.csv
    codeflows.csv  kind_problem.csv      relatedLocations.csv  rules.csv
2022-05-16 18:58:53 -07:00
Michael Hohn
154b0bdc56 WIP: assemble derived 'results' table 2022-05-13 17:01:18 -07:00
Michael Hohn
b212423907 WIP: sarif-extract-scans: back to single sarif file handling, incorporate multi-file libraries 2022-05-10 19:01:38 -07:00
Michael Hohn
30e3dd3a37 Replace internal ids with snowflake ids before writing tables 2022-04-29 22:39:25 -07:00
Michael Hohn
d5390bb87e Full revision of the base tables derived from multiple sarif input files
The new base tables produced by `sarif-extract-multi` are
    artifacts
    codeflows
    kind_pathproblem
    kind_problem
    project
    relatedLocations
    rules

The revised table overview is in the jupyter notebook
scripts/multi-table-overview.ipynb

The file notes/typegraph-multi-with-tables.pdf illustrates what original (sarif)
tables are used to form the base (derived) tables.
2022-03-23 16:37:41 -07:00
Michael Hohn
db00f17137 Some cleanup based on pyflakes output 2022-03-17 17:23:53 -07:00
Michael Hohn
b82c620a1e Add overview of the base tables derived from multi-sarif input; add rules.csv
The table overview is in the jupyter notebook
scripts/multi-table-overview.ipynb and makes use of some formatting
customizations to actually get an overview.

The initial `projects` table had far too many entries; the `rules` part
is now in a separate `rules` table.
2022-03-16 16:54:14 -07:00
Michael Hohn
0f070a6ae4 sarif-extract-multi: extract combined tables from multiple sarif files
This command introduces a new tree structure that pulls in a collection
of sarif files.  In yaml format, an example is

    - creation_date: '2021-12-09'   # Repository creation date
      primary_language: javascript  # By lines of code
      project_name: treeio/treeio   # Repo name-short name
      query_commit_id: fa9571646c   # Commit id for custom (non-library) queries
      sarif_content: {}             # The sarif content will be attached here
      sarif_file_name: 2021-12-09/results.sarif # Path to sarif file
      scan_start_date: '2021-12-09'             # Beginning date/time of scan
      scan_stop_date:  '2021-12-10'             # End date/time of scan
      tool_name: codeql
      tool_version: v1.27

    - creation_date: '2022-02-25'
      primary_language: javascript
      ...

At run time,

    cd ~/local/sarif-cli/data/treeio
    sarif-extract-multi multi-sarif-01.json test-multi-table

will load the specified sarif files and put them in place of
`sarif_content`, then build tables against the new signature found in
sarif_cli/signature_multi.py, and merge those into 6 larger tables.  The
exported tables are

    artifacts.csv  path-problem.csv  project.csv
    codeflows.csv  problem.csv       related-locations.csv

and they have join keys for further operations.

The new typegraph is rendered in

    notes/typegraph-multi.pdf

using the instructions in

    sarif_cli/signature_multi.py
2022-03-11 23:00:53 -08:00
Michael Hohn
9c151e295b sarif-extract-tables: include relatedLocations from both sources
With the addition of the path-problem output, include both as sources (left joins)
for relatedLocations:

    pd.concat([sf(4055)[['relatedLocations', 'struct_id']],
              sf(9699)[['relatedLocations', 'struct_id']]])
2022-02-22 17:35:39 -08:00
Michael Hohn
1dbd240b5b sarif-extract-tables: Form the codeFlows dataframe and write it out
One of the shorter multi-path results from
     cd ~/local/sarif-cli/data/treeio
     ../../bin/sarif-results-summary -r results.sarif |less
follows; the dataframe formed here starts with the codeFlows-containing table 9699
and has the content of the PATH * output below.

    RESULT: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:89:35:93:14: [DOM text](1) is reinte
    rpreted as HTML without escaping meta-characters.
    [DOM text](2) is reinterpreted as HTML without escaping meta-characters.
    [DOM text](3) is reinterpreted as HTML without escaping meta-characters.
    REFERENCE: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:90:17:90:27: DOM text
    REFERENCE: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:91:17:91:28: DOM text
    REFERENCE: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:92:17:92:31: DOM text
    PATH 0
    FLOW STEP 0: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:90:17:90:27: name.val()
    FLOW STEP 1: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:89:35:93:14: "<tr>"  ... "</tr>
    "
    PATH 1
    FLOW STEP 0: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:91:17:91:28: email.val()
    FLOW STEP 1: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:89:35:93:14: "<tr>"  ... "</tr>"
    PATH 2
    FLOW STEP 0: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:92:17:92:31: password.val()
    FLOW STEP 1: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:89:35:93:14: "<tr>"  ... "</tr>"
2022-02-22 16:50:44 -08:00
Michael Hohn
ad738abed3 sarif-extract-tables: also output relatedLocations table
With --related-locations,

    ../../bin/sarif-results-summary -r results.sarif

produces the details

    RESULT: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:
    72:722:73: Character ''' is repeated [here](1) in the same character class.
    Character ''' is repeated [here](2) in the same character class.
    Character ''' is repeated [here](3) in the same character class.
    REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:74:722:75: here
    REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:76:722:77: here
    REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:78:722:79: here

Via
    ../../bin/sarif-extract-tables results.sarif tables

sarif-extract-tables now produces two output tables,

    tables/
    ├── messages.csv
    └── relatedLocations.csv

that contain the relevant information and can be joined or otherwise combined on
the struct_id_4055 key.

For example, adding to the end of sarif-extract-tables:
    import IPython
    IPython.embed()

    msg = d2[d2.message.str.startswith("Character ''' is repeated [here]")]
    dr3[dr3.struct_id_4055 == msg.struct_id_4055.values[0]]

    In [24]: msg
    Out[24]:
         struct_id_4055  ...                                            message
    180      4796917312  ...  Character ''' is repeated [here](1) in the sam...

    [1 rows x 7 columns]

    In [25]: dr3[dr3.struct_id_4055 == msg.struct_id_4055.values[0]]
    Out[25]:
         struct_id_4055                                                uri  startLine  startColumn  endLine  endColumn message
    180      4796917312  static/js/tinymce/jscripts/tiny_mce/plugins/pa...        722           74      722         75    here
    181      4796917312  static/js/tinymce/jscripts/tiny_mce/plugins/pa...        722           76      722         77    here
    182      4796917312  static/js/tinymce/jscripts/tiny_mce/plugins/pa...        722           78      722         79    here

or manually from the shell:

    # pick up the struct_id_4055:
    0:$ grep "static.*Character ''' is repeated \[here\]" tables/messages.csv
    180,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,72,722,73,"Character ''' is repeated [here](1) in the same character class.

    # and find relatedLocations:
    0:$ grep 4927448704 tables/relatedLocations.csv
    180,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,74,722,75,here
    181,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,76,722,77,here
    182,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,78,722,79,here

Changes:
- Introduce scli-dyys, a random id string for later identification and removal of
  dummy table rows.

- Keep the struct_id_4055 column to join tables as needed.

- Output is now written to a directory as there are always multiple files.
2022-02-16 17:03:58 -08:00
Michael Hohn
ec9a0b5590 sarif-extract-tables: initial version, reproduces known output as table
Reproduce the

    file:line:col:line:col: message

output from

    ../../bin/sarif-results-summary results.sarif | grep size

as test/example.

Original sample output is

    RESULT: static/js/fileuploader.js:1214:13:1214:17: Unused variable size.
    RESULT: static/js/tinymce/jscripts/tiny_mce/plugins/media/js/media.js:438:30:438:34: Unused variable size.

The table result here is

    0:$ ../../bin/sarif-extract-tables results.sarif | grep size
    0,static/js/fileuploader.js,1214,13,1214,17,Unused variable size.
    34,static/js/tinymce/jscripts/tiny_mce/plugins/media/js/media.js,438,30,438,34,Unused variable size.
2022-02-08 20:04:28 -08:00
Michael Hohn
f5e73e90ba sarif-extract-tables: interim commit: first joins
These joins construct the table needed for sarif-results-summary output
2022-02-07 17:11:55 -08:00
Michael Hohn
f246f06d4e sarif-extract-tables: interim commit: form tables
Tables are now formed and kept in the Typegraph instance.
These will be tested using pandas operations to form one of the previous outputs.
2022-02-04 23:56:01 -08:00
Michael Hohn
7a517fa06c sarif-extract-tables: interim commit
Internal destructuring and array aggregration run, but need to be tested.
Tables need to be formed, and pandas selections/joins/etc. used for custom table output.
2022-02-04 14:44:55 -08:00
Michael Hohn
cf8096446b sarif-to-dot: cleanup for and preparation for sarif table extraction 2022-02-01 22:42:25 -08:00
Michael Hohn
80b22001ce sarif-to-dot: make signature names order-independent
To create entire subtrees conforming to a signature, first make the
signature names order-independent.  Use hashes to name the signatures.
2022-01-27 17:53:14 -08:00
Michael Hohn
153eba8346 sarif-to-dot: to reduce graph clutter, add option --no-edges-to-scalars 2022-01-26 00:41:31 -08:00
Michael Hohn
b816705574 sarif-to-dot: add --fill-structure option and initial library support
This collapses the rightmost column of the signature output from

    ../../bin/sarif-to-dot -u -t -d -f results.sarif | dot -Tpdf

which has multiple distinct entries

 ('Struct030', ('struct', ('endColumn', 'Int'), ('startLine', 'Int'))),
 ( 'Struct016',
    ( 'struct',
      ('endColumn', 'Int'),
      ('startColumn', 'Int'),
      ('startLine', 'Int'))),
 ( 'Struct025',
    ( 'struct',
      ('endColumn', 'Int'),
      ('endLine', 'Int'),
      ('startColumn', 'Int'),
      ('startLine', 'Int'))),
 ('Struct030', ('struct', ('endColumn', 'Int'), ('startLine', 'Int'))),

to a single entry,

  ( 'Struct005',
    ( 'struct',
      ('endColumn', 'Int'),
      ('endLine', 'Int'),
      ('startColumn', 'Int'),
      ('startLine', 'Int'))),

when using

    ../../bin/sarif-to-dot results.sarif -u -t -f
2022-01-25 23:18:20 -08:00
Michael Hohn
edfe1f3363 sarif-to-dot: move signature functions into their own module 2022-01-25 17:57:44 -08:00
Michael Hohn
0444a87076 sarif-to-dot: remove module-variable references 2022-01-25 17:49:07 -08:00
Michael Hohn
86caa3f56f sarif-to-dot: small renaming 2022-01-24 17:24:30 -08:00
Michael Hohn
939ba9bd8a sarif-to-dot: output array signatures as nodes, not edges; fix raise statements 2022-01-20 18:09:45 -08:00
Michael Hohn
cef9b47b58 sarif-to-dot: produce dot output using -d option
The command
   ../../bin/sarif-to-dot results.sarif -u -t -d | dot -Tpdf > raw-nested-types.pdf
produces a good illustration of the problems arising when optional values are absent.
To clean this up, structures missing fields have to be supplemented with those fields,
from right to left in the graph.
This is basically what sarif-results-summary does on the fly, it just has to be applied
to the input tree before collecting the signatures and producing this graph.
Once that is done, the types collected here can be used in SQL table export.
2022-01-16 14:21:23 -08:00
Michael Hohn
d64b100101 sarif-to-dot: move processing code to the end 2022-01-16 01:39:24 -08:00