Before, the query_id was
==> results.csv <==
query_id STRING, -- git commit id of the ql query set
now, it's
query_id STRING, -- @id from the CodeQL query
Force all column types to ensure appropriate formatting for writing. In
particular, no character data in place of integers, no floats, no
objects in place of strings.
Table formation for the functions
- st.joins_for_results
- st.joins_for_scans
- st.joins_for_projects
enforces types.
The new base tables produced by `sarif-extract-multi` are
artifacts
codeflows
kind_pathproblem
kind_problem
project
relatedLocations
rules
The revised table overview is in the jupyter notebook
scripts/multi-table-overview.ipynb
The file notes/typegraph-multi-with-tables.pdf illustrates what original (sarif)
tables are used to form the base (derived) tables.
The table overview is in the jupyter notebook
scripts/multi-table-overview.ipynb and makes use of some formatting
customizations to actually get an overview.
The initial `projects` table had far too many entries; the `rules` part
is now in a separate `rules` table.
This command introduces a new tree structure that pulls in a collection
of sarif files. In yaml format, an example is
- creation_date: '2021-12-09' # Repository creation date
primary_language: javascript # By lines of code
project_name: treeio/treeio # Repo name-short name
query_commit_id: fa9571646c # Commit id for custom (non-library) queries
sarif_content: {} # The sarif content will be attached here
sarif_file_name: 2021-12-09/results.sarif # Path to sarif file
scan_start_date: '2021-12-09' # Beginning date/time of scan
scan_stop_date: '2021-12-10' # End date/time of scan
tool_name: codeql
tool_version: v1.27
- creation_date: '2022-02-25'
primary_language: javascript
...
At run time,
cd ~/local/sarif-cli/data/treeio
sarif-extract-multi multi-sarif-01.json test-multi-table
will load the specified sarif files and put them in place of
`sarif_content`, then build tables against the new signature found in
sarif_cli/signature_multi.py, and merge those into 6 larger tables. The
exported tables are
artifacts.csv path-problem.csv project.csv
codeflows.csv problem.csv related-locations.csv
and they have join keys for further operations.
The new typegraph is rendered in
notes/typegraph-multi.pdf
using the instructions in
sarif_cli/signature_multi.py
With --related-locations,
../../bin/sarif-results-summary -r results.sarif
produces the details
RESULT: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:
72:722:73: Character ''' is repeated [here](1) in the same character class.
Character ''' is repeated [here](2) in the same character class.
Character ''' is repeated [here](3) in the same character class.
REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:74:722:75: here
REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:76:722:77: here
REFERENCE: static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js:722:78:722:79: here
Via
../../bin/sarif-extract-tables results.sarif tables
sarif-extract-tables now produces two output tables,
tables/
├── messages.csv
└── relatedLocations.csv
that contain the relevant information and can be joined or otherwise combined on
the struct_id_4055 key.
For example, adding to the end of sarif-extract-tables:
import IPython
IPython.embed()
msg = d2[d2.message.str.startswith("Character ''' is repeated [here]")]
dr3[dr3.struct_id_4055 == msg.struct_id_4055.values[0]]
In [24]: msg
Out[24]:
struct_id_4055 ... message
180 4796917312 ... Character ''' is repeated [here](1) in the sam...
[1 rows x 7 columns]
In [25]: dr3[dr3.struct_id_4055 == msg.struct_id_4055.values[0]]
Out[25]:
struct_id_4055 uri startLine startColumn endLine endColumn message
180 4796917312 static/js/tinymce/jscripts/tiny_mce/plugins/pa... 722 74 722 75 here
181 4796917312 static/js/tinymce/jscripts/tiny_mce/plugins/pa... 722 76 722 77 here
182 4796917312 static/js/tinymce/jscripts/tiny_mce/plugins/pa... 722 78 722 79 here
or manually from the shell:
# pick up the struct_id_4055:
0:$ grep "static.*Character ''' is repeated \[here\]" tables/messages.csv
180,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,72,722,73,"Character ''' is repeated [here](1) in the same character class.
# and find relatedLocations:
0:$ grep 4927448704 tables/relatedLocations.csv
180,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,74,722,75,here
181,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,76,722,77,here
182,4927448704,static/js/tinymce/jscripts/tiny_mce/plugins/paste/editor_plugin_src.js,722,78,722,79,here
Changes:
- Introduce scli-dyys, a random id string for later identification and removal of
dummy table rows.
- Keep the struct_id_4055 column to join tables as needed.
- Output is now written to a directory as there are always multiple files.
Internal destructuring and array aggregration run, but need to be tested.
Tables need to be formed, and pandas selections/joins/etc. used for custom table output.
Common to all:
| ('locations', 'Array008') |
| ('message', 'Struct009') |
| ('partialFingerprints', 'Struct010') |
| ('rule', 'Struct011') |
| ('ruleId', 'String'), |
| ('ruleIndex', 'Int'))) |
Only some problems and flow problems have
| ('relatedLocations', 'Array014') |
Add dummy value for relatedLocations to reduce to two result categories,
@kind flow problem and @kind problem.
An occasional output from LGTM is
{"code":404,"error":"The specified analysis could not be found"}
With this patch, the csv output is now
"ERROR","invalid json contents %s","some-file.json"
and the plain text output becomes
ERROR: invalid json contents in some-file.json
Problem:
The
artifact = get(related_location, 'physicalLocation', 'artifactLocation')
requested by
message, artifact, region = S.get_relatedlocation_message_info(location)
is incomplete:
ipdb> p related_location
{'message': {'text': 'request'}}
Fix:
Introduce the NoFile class to propagate this and handle it where needed.
Now simply report <NoFile> as appropriate.
For plain text output:
RESULT: src/optionsparser/ ..
FLOW STEP 0: <NoFile>: request
FLOW STEP 1: <NoFile>: request_mp
FLOW STEP 2: src/....
For csv output:
"result","src/optionsparser/...","116","26","116","34","`& ...` used as ..."
"flow_step","0","<NoFile>","-1","-1","-1","-1","request"
"flow_step","1","<NoFile>","-1","-1","-1","-1","request_mp"
"flow_step","2","src/foo.cpp","119","97","119","104","request"
Region / line / column information are present in most messages. The one that
caused this error refers to the whole file:
ipdb> p sarif_struct
{'ruleId': 'com.lgtm/cpp-queries:cpp/missing-header-guard', 'ruleIndex': 12,
'message': {'text': 'This header file should contain a header guard to prevent
multiple inclusion.'}, 'locations': [{'physicalLocation': {'artifactLocation':
{'uri': 'diff/cmpbuf.h', 'uriBaseId': '%SRCROOT%', 'index': 13}}}],
'partialFingerprints': {'primaryLocationLineHash': 'd04cb834fa64727d:1',
'primaryLocationStartColumnFingerprint': '0'}}
The goal is fixed-structure output formatting, so whole-file output uses
-1,-1,-1,-1 for line, column information.
When using
: with open(fname, 'r') as file:
hits the accented letter á in Vrána in the file
: data/wxWidgets-small/src/stc/scintilla/lexers/LexCSS.cxx
it results in a
: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 119: invalid continuation byte
We are reading source code, so we likely don't care about dropping non-ascii; using
: with codecs.open(fname, 'r', encoding="latin-1") as file:
ignores this problem.