The new base tables produced by `sarif-extract-multi` are
artifacts
codeflows
kind_pathproblem
kind_problem
project
relatedLocations
rules
The revised table overview is in the jupyter notebook
scripts/multi-table-overview.ipynb
The file notes/typegraph-multi-with-tables.pdf illustrates what original (sarif)
tables are used to form the base (derived) tables.
The table overview is in the jupyter notebook
scripts/multi-table-overview.ipynb and makes use of some formatting
customizations to actually get an overview.
The initial `projects` table had far too many entries; the `rules` part
is now in a separate `rules` table.
This command introduces a new tree structure that pulls in a collection
of sarif files. In yaml format, an example is
- creation_date: '2021-12-09' # Repository creation date
primary_language: javascript # By lines of code
project_name: treeio/treeio # Repo name-short name
query_commit_id: fa9571646c # Commit id for custom (non-library) queries
sarif_content: {} # The sarif content will be attached here
sarif_file_name: 2021-12-09/results.sarif # Path to sarif file
scan_start_date: '2021-12-09' # Beginning date/time of scan
scan_stop_date: '2021-12-10' # End date/time of scan
tool_name: codeql
tool_version: v1.27
- creation_date: '2022-02-25'
primary_language: javascript
...
At run time,
cd ~/local/sarif-cli/data/treeio
sarif-extract-multi multi-sarif-01.json test-multi-table
will load the specified sarif files and put them in place of
`sarif_content`, then build tables against the new signature found in
sarif_cli/signature_multi.py, and merge those into 6 larger tables. The
exported tables are
artifacts.csv path-problem.csv project.csv
codeflows.csv problem.csv related-locations.csv
and they have join keys for further operations.
The new typegraph is rendered in
notes/typegraph-multi.pdf
using the instructions in
sarif_cli/signature_multi.py
With the addition of the path-problem output, include both as sources (left joins)
for relatedLocations:
pd.concat([sf(4055)[['relatedLocations', 'struct_id']],
sf(9699)[['relatedLocations', 'struct_id']]])
One of the shorter multi-path results from
cd ~/local/sarif-cli/data/treeio
../../bin/sarif-results-summary -r results.sarif |less
follows; the dataframe formed here starts with the codeFlows-containing table 9699
and has the content of the PATH * output below.
RESULT: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:89:35:93:14: [DOM text](1) is reinte
rpreted as HTML without escaping meta-characters.
[DOM text](2) is reinterpreted as HTML without escaping meta-characters.
[DOM text](3) is reinterpreted as HTML without escaping meta-characters.
REFERENCE: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:90:17:90:27: DOM text
REFERENCE: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:91:17:91:28: DOM text
REFERENCE: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:92:17:92:31: DOM text
PATH 0
FLOW STEP 0: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:90:17:90:27: name.val()
FLOW STEP 1: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:89:35:93:14: "<tr>" ... "</tr>
"
PATH 1
FLOW STEP 0: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:91:17:91:28: email.val()
FLOW STEP 1: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:89:35:93:14: "<tr>" ... "</tr>"
PATH 2
FLOW STEP 0: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:92:17:92:31: password.val()
FLOW STEP 1: static/js/jquery-ui-1.10.3/demos/dialog/modal-form.html:89:35:93:14: "<tr>" ... "</tr>"
With --related-locations,
../../bin/sarif-results-summary -r results.sarif
produces the details
RESULT: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:
72:722:73: Character ''' is repeated [here](1) in the same character class.
Character ''' is repeated [here](2) in the same character class.
Character ''' is repeated [here](3) in the same character class.
REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:74:722:75: here
REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:76:722:77: here
REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:78:722:79: here
Via
../../bin/sarif-extract-tables results.sarif tables
sarif-extract-tables now produces two output tables,
tables/
├── messages.csv
└── relatedLocations.csv
that contain the relevant information and can be joined or otherwise combined on
the struct_id_4055 key.
For example, adding to the end of sarif-extract-tables:
import IPython
IPython.embed()
msg = d2[d2.message.str.startswith("Character ''' is repeated [here]")]
dr3[dr3.struct_id_4055 == msg.struct_id_4055.values[0]]
In [24]: msg
Out[24]:
struct_id_4055 ... message
180 4796917312 ... Character ''' is repeated [here](1) in the sam...
[1 rows x 7 columns]
In [25]: dr3[dr3.struct_id_4055 == msg.struct_id_4055.values[0]]
Out[25]:
struct_id_4055 uri startLine startColumn endLine endColumn message
180 4796917312 static/js/tinymce/jscripts/tiny_mce/plugins/pa... 722 74 722 75 here
181 4796917312 static/js/tinymce/jscripts/tiny_mce/plugins/pa... 722 76 722 77 here
182 4796917312 static/js/tinymce/jscripts/tiny_mce/plugins/pa... 722 78 722 79 here
or manually from the shell:
# pick up the struct_id_4055:
0:$ grep "static.*Character ''' is repeated \[here\]" tables/messages.csv
180,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,72,722,73,"Character ''' is repeated [here](1) in the same character class.
# and find relatedLocations:
0:$ grep 4927448704 tables/relatedLocations.csv
180,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,74,722,75,here
181,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,76,722,77,here
182,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,78,722,79,here
Changes:
- Introduce scli-dyys, a random id string for later identification and removal of
dummy table rows.
- Keep the struct_id_4055 column to join tables as needed.
- Output is now written to a directory as there are always multiple files.
Reproduce the
file:line:col:line:col: message
output from
../../bin/sarif-results-summary results.sarif | grep size
as test/example.
Original sample output is
RESULT: static/js/fileuploader.js:1214:13:1214:17: Unused variable size.
RESULT: static/js/tinymce/jscripts/tiny_mce/plugins/media/js/media.js:438:30:438:34: Unused variable size.
The table result here is
0:$ ../../bin/sarif-extract-tables results.sarif | grep size
0,static/js/fileuploader.js,1214,13,1214,17,Unused variable size.
34,static/js/tinymce/jscripts/tiny_mce/plugins/media/js/media.js,438,30,438,34,Unused variable size.
Internal destructuring and array aggregration run, but need to be tested.
Tables need to be formed, and pandas selections/joins/etc. used for custom table output.
The command
../../bin/sarif-to-dot results.sarif -u -t -d | dot -Tpdf > raw-nested-types.pdf
produces a good illustration of the problems arising when optional values are absent.
To clean this up, structures missing fields have to be supplemented with those fields,
from right to left in the graph.
This is basically what sarif-results-summary does on the fly, it just has to be applied
to the input tree before collecting the signatures and producing this graph.
Once that is done, the types collected here can be used in SQL table export.
An occasional output from LGTM is
{"code":404,"error":"The specified analysis could not be found"}
With this patch, the csv output is now
"ERROR","invalid json contents %s","some-file.json"
and the plain text output becomes
ERROR: invalid json contents in some-file.json
Problem:
The
artifact = get(related_location, 'physicalLocation', 'artifactLocation')
requested by
message, artifact, region = S.get_relatedlocation_message_info(location)
is incomplete:
ipdb> p related_location
{'message': {'text': 'request'}}
Fix:
Introduce the NoFile class to propagate this and handle it where needed.
Now simply report <NoFile> as appropriate.
For plain text output:
RESULT: src/optionsparser/ ..
FLOW STEP 0: <NoFile>: request
FLOW STEP 1: <NoFile>: request_mp
FLOW STEP 2: src/....
For csv output:
"result","src/optionsparser/...","116","26","116","34","`& ...` used as ..."
"flow_step","0","<NoFile>","-1","-1","-1","-1","request"
"flow_step","1","<NoFile>","-1","-1","-1","-1","request_mp"
"flow_step","2","src/foo.cpp","119","97","119","104","request"
Region / line / column information are present in most messages. The one that
caused this error refers to the whole file:
ipdb> p sarif_struct
{'ruleId': 'com.lgtm/cpp-queries:cpp/missing-header-guard', 'ruleIndex': 12,
'message': {'text': 'This header file should contain a header guard to prevent
multiple inclusion.'}, 'locations': [{'physicalLocation': {'artifactLocation':
{'uri': 'diff/cmpbuf.h', 'uriBaseId': '%SRCROOT%', 'index': 13}}}],
'partialFingerprints': {'primaryLocationLineHash': 'd04cb834fa64727d:1',
'primaryLocationStartColumnFingerprint': '0'}}
The goal is fixed-structure output formatting, so whole-file output uses
-1,-1,-1,-1 for line, column information.
Using
sarif-results-summary -s data/linux-small data/torvalds_linux__2021-10-21_10_07_00__export.sarif |less
now underscores the indicated regions, e.g.
tools/cgroup/iocost_monitor.py:64:5:64:27: Normal methods should have 'self', rather than 'blkcg', as their first parameter.
def blkcg_name(blkcg):
^^^^^^^^^^^^^^^^^^^^^^