The new base tables produced by `sarif-extract-multi` are
artifacts
codeflows
kind_pathproblem
kind_problem
project
relatedLocations
rules
The revised table overview is in the jupyter notebook
scripts/multi-table-overview.ipynb
The file notes/typegraph-multi-with-tables.pdf illustrates what original (sarif)
tables are used to form the base (derived) tables.
The table overview is in the jupyter notebook
scripts/multi-table-overview.ipynb and makes use of some formatting
customizations to actually get an overview.
The initial `projects` table had far too many entries; the `rules` part
is now in a separate `rules` table.
This command introduces a new tree structure that pulls in a collection
of sarif files. In yaml format, an example is
- creation_date: '2021-12-09' # Repository creation date
primary_language: javascript # By lines of code
project_name: treeio/treeio # Repo name-short name
query_commit_id: fa9571646c # Commit id for custom (non-library) queries
sarif_content: {} # The sarif content will be attached here
sarif_file_name: 2021-12-09/results.sarif # Path to sarif file
scan_start_date: '2021-12-09' # Beginning date/time of scan
scan_stop_date: '2021-12-10' # End date/time of scan
tool_name: codeql
tool_version: v1.27
- creation_date: '2022-02-25'
primary_language: javascript
...
At run time,
cd ~/local/sarif-cli/data/treeio
sarif-extract-multi multi-sarif-01.json test-multi-table
will load the specified sarif files and put them in place of
`sarif_content`, then build tables against the new signature found in
sarif_cli/signature_multi.py, and merge those into 6 larger tables. The
exported tables are
artifacts.csv path-problem.csv project.csv
codeflows.csv problem.csv related-locations.csv
and they have join keys for further operations.
The new typegraph is rendered in
notes/typegraph-multi.pdf
using the instructions in
sarif_cli/signature_multi.py
With --related-locations,
../../bin/sarif-results-summary -r results.sarif
produces the details
RESULT: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:
72:722:73: Character ''' is repeated [here](1) in the same character class.
Character ''' is repeated [here](2) in the same character class.
Character ''' is repeated [here](3) in the same character class.
REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:74:722:75: here
REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:76:722:77: here
REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:78:722:79: here
Via
../../bin/sarif-extract-tables results.sarif tables
sarif-extract-tables now produces two output tables,
tables/
├── messages.csv
└── relatedLocations.csv
that contain the relevant information and can be joined or otherwise combined on
the struct_id_4055 key.
For example, adding to the end of sarif-extract-tables:
import IPython
IPython.embed()
msg = d2[d2.message.str.startswith("Character ''' is repeated [here]")]
dr3[dr3.struct_id_4055 == msg.struct_id_4055.values[0]]
In [24]: msg
Out[24]:
struct_id_4055 ... message
180 4796917312 ... Character ''' is repeated [here](1) in the sam...
[1 rows x 7 columns]
In [25]: dr3[dr3.struct_id_4055 == msg.struct_id_4055.values[0]]
Out[25]:
struct_id_4055 uri startLine startColumn endLine endColumn message
180 4796917312 static/js/tinymce/jscripts/tiny_mce/plugins/pa... 722 74 722 75 here
181 4796917312 static/js/tinymce/jscripts/tiny_mce/plugins/pa... 722 76 722 77 here
182 4796917312 static/js/tinymce/jscripts/tiny_mce/plugins/pa... 722 78 722 79 here
or manually from the shell:
# pick up the struct_id_4055:
0:$ grep "static.*Character ''' is repeated \[here\]" tables/messages.csv
180,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,72,722,73,"Character ''' is repeated [here](1) in the same character class.
# and find relatedLocations:
0:$ grep 4927448704 tables/relatedLocations.csv
180,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,74,722,75,here
181,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,76,722,77,here
182,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,78,722,79,here
Changes:
- Introduce scli-dyys, a random id string for later identification and removal of
dummy table rows.
- Keep the struct_id_4055 column to join tables as needed.
- Output is now written to a directory as there are always multiple files.
Internal destructuring and array aggregration run, but need to be tested.
Tables need to be formed, and pandas selections/joins/etc. used for custom table output.
Common to all:
| ('locations', 'Array008') |
| ('message', 'Struct009') |
| ('partialFingerprints', 'Struct010') |
| ('rule', 'Struct011') |
| ('ruleId', 'String'), |
| ('ruleIndex', 'Int'))) |
Only some problems and flow problems have
| ('relatedLocations', 'Array014') |
Add dummy value for relatedLocations to reduce to two result categories,
@kind flow problem and @kind problem.
An occasional output from LGTM is
{"code":404,"error":"The specified analysis could not be found"}
With this patch, the csv output is now
"ERROR","invalid json contents %s","some-file.json"
and the plain text output becomes
ERROR: invalid json contents in some-file.json
Problem:
The
artifact = get(related_location, 'physicalLocation', 'artifactLocation')
requested by
message, artifact, region = S.get_relatedlocation_message_info(location)
is incomplete:
ipdb> p related_location
{'message': {'text': 'request'}}
Fix:
Introduce the NoFile class to propagate this and handle it where needed.
Now simply report <NoFile> as appropriate.
For plain text output:
RESULT: src/optionsparser/ ..
FLOW STEP 0: <NoFile>: request
FLOW STEP 1: <NoFile>: request_mp
FLOW STEP 2: src/....
For csv output:
"result","src/optionsparser/...","116","26","116","34","`& ...` used as ..."
"flow_step","0","<NoFile>","-1","-1","-1","-1","request"
"flow_step","1","<NoFile>","-1","-1","-1","-1","request_mp"
"flow_step","2","src/foo.cpp","119","97","119","104","request"
Region / line / column information are present in most messages. The one that
caused this error refers to the whole file:
ipdb> p sarif_struct
{'ruleId': 'com.lgtm/cpp-queries:cpp/missing-header-guard', 'ruleIndex': 12,
'message': {'text': 'This header file should contain a header guard to prevent
multiple inclusion.'}, 'locations': [{'physicalLocation': {'artifactLocation':
{'uri': 'diff/cmpbuf.h', 'uriBaseId': '%SRCROOT%', 'index': 13}}}],
'partialFingerprints': {'primaryLocationLineHash': 'd04cb834fa64727d:1',
'primaryLocationStartColumnFingerprint': '0'}}
The goal is fixed-structure output formatting, so whole-file output uses
-1,-1,-1,-1 for line, column information.
When using
: with open(fname, 'r') as file:
hits the accented letter á in Vrána in the file
: data/wxWidgets-small/src/stc/scintilla/lexers/LexCSS.cxx
it results in a
: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 119: invalid continuation byte
We are reading source code, so we likely don't care about dropping non-ascii; using
: with codecs.open(fname, 'r', encoding="latin-1") as file:
ignores this problem.
Using
sarif-results-summary -s data/linux-small data/torvalds_linux__2021-10-21_10_07_00__export.sarif |less
now underscores the indicated regions, e.g.
tools/cgroup/iocost_monitor.py:64:5:64:27: Normal methods should have 'self', rather than 'blkcg', as their first parameter.
def blkcg_name(blkcg):
^^^^^^^^^^^^^^^^^^^^^^