Merge pull request #3953 from RasmusWL/python-more-call-graph-tracing

Approved by tausbn
This commit is contained in:
CodeQL CI
2020-09-07 17:34:14 +01:00
committed by GitHub
48 changed files with 2092 additions and 308 deletions

View File

@@ -0,0 +1,6 @@
# As described in https://github.com/psf/black/blob/master/docs/compatible_configs.md#flake8
# and https://black.readthedocs.io/en/stable/the_black_code_style.html#line-length
[flake8]
max-line-length = 88
select = C,E,F,W,B,B950
ignore = E203, E501, W503

View File

@@ -0,0 +1,13 @@
# Example DB
cg-trace-example-db/
# Tests artifacts
tests/python-traces/
tests/cg-trace-test-db
# Artifact from building `pip install -e .`
src/cg_trace.egg-info/
projects/
venv/

View File

@@ -0,0 +1,6 @@
[settings]
multi_line_output = 3
include_trailing_comma = True
force_grid_wrap = 0
use_parentheses = True
line_length = 88

View File

@@ -4,14 +4,113 @@ also known as _call graph tracing_.
Execute a python program and for each call being made, record the call and callee. This allows us to compare call graph resolution from static analysis with actual data -- that is, can we statically determine the target of each actual call correctly.
This is still in the early stages, and currently only supports a very minimal working example (to show that this approach might work).
Using the call graph tracer does incur a heavy toll on the performance. Expect 10x longer to execute the program.
The next hurdle is being able to handle multiple calls on the same line, such as
Number of calls recorded vary a little from run to run. I have not been able to pinpoint why.
- `foo(); bar()`
- `foo(bar())`
- `foo().bar()`
## Running against real projects
## How do I give it a spin?
Currently it's possible to gather metrics from traced runs of the standard test suite of a few projects (defined in [projects.json](./projects.json)): `youtube-dl`, `wcwidth`, and `flask`.
Run the `recreate-db.sh` script to create the database `cg-trace-example-db`, which will include the `example/simple.xml` trace from executing the `example/simple.py` code. Then run the queries inside the `ql/` directory.
To run against all projects, use
```bash
$ ./helper.sh all $(./helper.sh projects)
```
To view the results, use
```
$ head -n 100 projects/*/Metrics.txt
```
### Expanding set of projects
It should be fairly straightforward to expand the set of projects. Most projects use `tox` for running their tests against multiple python versions. I didn't look into any kind of integration, but have manually picked out the instructions required to get going.
As an example, compare the [`tox.ini`](https://github.com/pallets/flask/blob/21c3df31de4bc2f838c945bd37d185210d9bab1a/tox.ini) file from flask with the configuration
```json
"flask": {
"repo": "https://github.com/pallets/flask.git",
"sha": "21c3df31de4bc2f838c945bd37d185210d9bab1a",
"module_command": "pytest -c /dev/null tests examples",
"setup": [
"pip install -r requirements/tests.txt",
"pip install -q -e examples/tutorial[test]",
"pip install -q -e examples/javascript[test]"
]
}
```
## Local development
### Setup
1. Ensure you have at least Python 3.7
2. Create virtual environment `python3 -m venv venv` and activate it
3. Install dependencies `pip install -r --upgrade requirements.txt`
4. Install this codebase as an editable package `pip install -e .`
5. Setup your editor. If you're using VS Code, create a new project for this folder, and
use these settings for correct autoformatting of code on save:
```
{
"python.pythonPath": "venv/bin/python",
"python.linting.enabled": true,
"python.linting.flake8Enabled": true,
"python.formatting.provider": "black",
"editor.formatOnSave": true,
"[python]": {
"editor.codeActionsOnSave": {
"source.organizeImports": true
}
},
"python.autoComplete.extraPaths": [
"src"
]
}
```
6. Enjoy writing code, and being able to run `cg-trace` on your command line :tada:
### Using it
After following setup instructions above, you should be able to reproduce the example trace by running
```
cg-trace --xml example/simple.xml example/simple.py
```
You can also run traces for all tests and build a database by running `tests/create-test-db.sh`. Then run the queries inside the `ql/` directory.
## Tracing Limitations
### Multi-threading
Should be possible by using [`threading.setprofile`](https://docs.python.org/3.8/library/threading.html#threading.setprofile), but that hasn't been done yet.
### Code that uses `sys.setprofile`
Since that is our mechanism for recording calls, any code that uses `sys.setprofile` will not work together with the call-graph tracer.
### Class instantiation
Does not always fire off an event in the `sys.setprofile` function (neither in `sys.settrace`), so is not recorded. Example:
```
r = range(10)
```
when disassembled (`python -m dis <file>`):
```
9 48 LOAD_NAME 7 (range)
50 LOAD_CONST 5 (10)
52 CALL_FUNCTION 1
54 STORE_NAME 8 (r)
```
but no event :disappointed:

View File

@@ -1,222 +0,0 @@
#!/usr/bin/env python3
"""Call Graph tracing.
Execute a python program and for each call being made, record the call and callee. This
allows us to compare call graph resolution from static analysis with actual data -- that
is, can we statically determine the target of each actual call correctly.
If there is 100% code coverage from the Python execution, it would also be possible to
look at the precision of the call graph resolutions -- that is, do we expect a function to
be able to be called in a place where it is not? Currently not something we're looking at.
"""
# read: https://eli.thegreenplace.net/2012/03/23/python-internals-how-callables-work/
# TODO: Know that a call to a C-function was made. See
# https://docs.python.org/3/library/bdb.html#bdb.Bdb.trace_dispatch. Maybe use `lxml` as
# test
# For inspiration, look at these projects:
# - https://github.com/joerick/pyinstrument (capture call-stack every <n> ms for profiling)
# - https://github.com/gak/pycallgraph (display call-graph with graphviz after python execution)
import argparse
import bdb
from io import StringIO
import sys
import os
import dis
import dataclasses
import csv
import xml.etree.ElementTree as ET
# Copy-Paste and uncomment for interactive ipython sessions
# import IPython; IPython.embed(); sys.exit()
@dataclasses.dataclass(frozen=True)
class Call():
"""A call
"""
filename: str
linenum: int
inst_index: int
@classmethod
def from_frame(cls, frame, debugger: bdb.Bdb):
code = frame.f_code
# Uncomment to see the bytecode
# b = dis.Bytecode(frame.f_code, current_offset=frame.f_lasti)
# print(b.dis(), file=sys.__stderr__)
return cls(
filename = debugger.canonic(code.co_filename),
linenum = frame.f_lineno,
inst_index = frame.f_lasti,
)
@dataclasses.dataclass(frozen=True)
class Callee():
"""A callee (Function/Lambda/???)
should (hopefully) be uniquely identified by its name and location (filename+line
number)
"""
funcname: str
filename: str
linenum: int
@classmethod
def from_frame(cls, frame, debugger: bdb.Bdb):
code = frame.f_code
return cls(
funcname = code.co_name,
filename = debugger.canonic(code.co_filename),
linenum = frame.f_lineno,
)
class CallGraphTracer(bdb.Bdb):
"""Tracer that records calls being made
It would seem obvious that this should have extended `trace` library
(https://docs.python.org/3/library/trace.html), but that part is not extensible --
however, the basic debugger (bdb) is, and provides maybe a bit more help than just
using `sys.settrace` directly.
"""
recorded_calls: set
def __init__(self):
self.recorded_calls = set()
super().__init__()
def user_call(self, frame, argument_list):
call = Call.from_frame(frame.f_back, self)
callee = Callee.from_frame(frame, self)
# _print(f'{call} -> {callee}')
self.recorded_calls.add((call, callee))
################################################################################
# Export
################################################################################
class Exporter:
@staticmethod
def export(recorded_calls, outfile_path):
raise NotImplementedError()
@staticmethod
def dataclass_to_dict(obj):
d = dataclasses.asdict(obj)
prefix = obj.__class__.__name__.lower()
return {f"{prefix}_{key}": val for (key, val) in d.items()}
class CSVExporter(Exporter):
@staticmethod
def export(recorded_calls, outfile_path):
with open(outfile_path, 'w', newline='') as csv_file:
writer = None
for (call, callee) in recorded_calls:
data = {
**Exporter.dataclass_to_dict(call),
**Exporter.dataclass_to_dict(callee)
}
if writer is None:
writer = csv.DictWriter(csv_file, fieldnames=data.keys())
writer.writeheader()
writer.writerow(data)
print(f'output written to {outfile_path}')
# embed(); sys.exit()
class XMLExporter(Exporter):
@staticmethod
def export(recorded_calls, outfile_path):
root = ET.Element('root')
for (call, callee) in recorded_calls:
data = {
**Exporter.dataclass_to_dict(call),
**Exporter.dataclass_to_dict(callee)
}
rc = ET.SubElement(root, 'recorded_call')
# this xml library only supports serializing attributes that have string values
rc.attrib = {k: str(v) for k, v in data.items()}
tree = ET.ElementTree(root)
tree.write(outfile_path, encoding='utf-8')
################################################################################
# __main__
################################################################################
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--csv')
parser.add_argument('--xml')
parser.add_argument('progname', help='file to run as main program')
parser.add_argument('arguments', nargs=argparse.REMAINDER,
help='arguments to the program')
opts = parser.parse_args()
# These details of setting up the program to be run is very much inspired by `trace`
# from the standard library
sys.argv = [opts.progname, *opts.arguments]
sys.path[0] = os.path.dirname(opts.progname)
with open(opts.progname) as fp:
code = compile(fp.read(), opts.progname, 'exec')
# try to emulate __main__ namespace as much as possible
globs = {
'__file__': opts.progname,
'__name__': '__main__',
'__package__': None,
'__cached__': None,
}
real_stdout = sys.stdout
real_stderr = sys.stderr
captured_stdout = StringIO()
sys.stdout = captured_stdout
cgt = CallGraphTracer()
cgt.run(code, globs, globs)
sys.stdout = real_stdout
if opts.csv:
CSVExporter.export(cgt.recorded_calls, opts.csv)
elif opts.xml:
XMLExporter.export(cgt.recorded_calls, opts.xml)
else:
for (call, callee) in cgt.recorded_calls:
print(f'{call} -> {callee}')
print('--- captured stdout ---')
print(captured_stdout.getvalue(), end='')

View File

@@ -1,6 +1,137 @@
<root>
<recorded_call call_filename="/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py" call_linenum="7" call_inst_index="18" callee_funcname="foo" callee_filename="/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py" callee_linenum="1" />
<recorded_call call_filename="/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py" call_linenum="8" call_inst_index="24" callee_funcname="bar" callee_filename="/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py" callee_linenum="4" />
<recorded_call call_filename="/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py" call_linenum="10" call_inst_index="30" callee_funcname="foo" callee_filename="/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py" callee_linenum="1" />
<recorded_call call_filename="/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py" call_linenum="10" call_inst_index="36" callee_funcname="bar" callee_filename="/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py" callee_linenum="4" />
<info>
<cg_trace_version>0.0.2</cg_trace_version>
<args>--xml example/simple.xml example/simple.py</args>
<exit_status>completed</exit_status>
<elapsed>0.00 seconds</elapsed>
<utctimestamp>2020-07-22T12:14:02</utctimestamp>
</info>
<recorded_calls>
<recorded_call>
<Call>
<filename>/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py</filename>
<linenum>2</linenum>
<inst_index>4</inst_index>
<bytecode_expr>
<BytecodeCall>
<function>
<BytecodeVariableName>
<name>print</name>
</BytecodeVariableName>
</function>
</BytecodeCall>
</bytecode_expr>
</Call>
<ExternalCallee>
<module>builtins</module>
<qualname>print</qualname>
<is_builtin>True</is_builtin>
</ExternalCallee>
</recorded_call>
<recorded_call>
<Call>
<filename>/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py</filename>
<linenum>5</linenum>
<inst_index>4</inst_index>
<bytecode_expr>
<BytecodeCall>
<function>
<BytecodeVariableName>
<name>print</name>
</BytecodeVariableName>
</function>
</BytecodeCall>
</bytecode_expr>
</Call>
<ExternalCallee>
<module>builtins</module>
<qualname>print</qualname>
<is_builtin>True</is_builtin>
</ExternalCallee>
</recorded_call>
<recorded_call>
<Call>
<filename>/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py</filename>
<linenum>7</linenum>
<inst_index>18</inst_index>
<bytecode_expr>
<BytecodeCall>
<function>
<BytecodeVariableName>
<name>foo</name>
</BytecodeVariableName>
</function>
</BytecodeCall>
</bytecode_expr>
</Call>
<PythonCallee>
<filename>/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py</filename>
<linenum>1</linenum>
<funcname>foo</funcname>
</PythonCallee>
</recorded_call>
<recorded_call>
<Call>
<filename>/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py</filename>
<linenum>8</linenum>
<inst_index>24</inst_index>
<bytecode_expr>
<BytecodeCall>
<function>
<BytecodeVariableName>
<name>bar</name>
</BytecodeVariableName>
</function>
</BytecodeCall>
</bytecode_expr>
</Call>
<PythonCallee>
<filename>/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py</filename>
<linenum>4</linenum>
<funcname>bar</funcname>
</PythonCallee>
</recorded_call>
<recorded_call>
<Call>
<filename>/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py</filename>
<linenum>10</linenum>
<inst_index>30</inst_index>
<bytecode_expr>
<BytecodeCall>
<function>
<BytecodeVariableName>
<name>foo</name>
</BytecodeVariableName>
</function>
</BytecodeCall>
</bytecode_expr>
</Call>
<PythonCallee>
<filename>/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py</filename>
<linenum>1</linenum>
<funcname>foo</funcname>
</PythonCallee>
</recorded_call>
<recorded_call>
<Call>
<filename>/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py</filename>
<linenum>10</linenum>
<inst_index>36</inst_index>
<bytecode_expr>
<BytecodeCall>
<function>
<BytecodeVariableName>
<name>bar</name>
</BytecodeVariableName>
</function>
</BytecodeCall>
</bytecode_expr>
</Call>
<PythonCallee>
<filename>/home/rasmus/code/ql/python/tools/recorded-call-graph-metrics/example/simple.py</filename>
<linenum>4</linenum>
<funcname>bar</funcname>
</PythonCallee>
</recorded_call>
</recorded_calls>
</root>

View File

@@ -0,0 +1,191 @@
#!/bin/bash
set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
PROJECTS_FILE="$SCRIPTDIR/projects.json"
METRICS_QUERY="ql/query/Metrics.ql"
PROJECTS_BASE_DIR="$SCRIPTDIR/projects"
repo_dir() {
echo "$PROJECTS_BASE_DIR/$1/repo"
}
venv_dir() {
echo "$PROJECTS_BASE_DIR/$1/venv"
}
trace_dir() {
echo "$PROJECTS_BASE_DIR/$1/traces"
}
db_path() {
echo "$PROJECTS_BASE_DIR/$1/$1-db"
}
query_result_base_path() {
echo "$PROJECTS_BASE_DIR/$1/$2"
}
help() {
echo -n """\
$0 help This message
$0 projects List projects
$0 repo <projects> Fetch repo for projects
$0 setup <projects> Perform setup steps for projects (install dependencies)
$0 trace <projects> Trace projects
$0 db <projects> Build databases for projects
$0 metrics <projects> Run $METRICS_QUERY on projects
$0 all <projects> Perform all the above steps for projects
"""
}
projects() {
jq -r 'keys[]' "$PROJECTS_FILE"
}
check_project_exists() {
if ! jq -e ".\"$1\"" "$PROJECTS_FILE" &>/dev/null; then
echo "ERROR: '$1' not a known project, see '$0 projects'"
exit 1
fi
}
repo() {
for project in $@; do
check_project_exists $project
echo "Cloning repo for '$project'"
REPO_DIR=$(repo_dir $project)
if [[ -d "$REPO_DIR" ]]; then
echo "Repo already cloned in $REPO_DIR"
continue;
fi
REPO_URL=$(jq -e -r ".\"$project\".repo" "$PROJECTS_FILE")
SHA=$(jq -e -r ".\"$project\".sha" "$PROJECTS_FILE")
mkdir -p "$REPO_DIR"
cd "$REPO_DIR"
git init
git remote add origin $REPO_URL
git fetch --depth 1 origin $SHA
git -c advice.detachedHead=False checkout FETCH_HEAD
done
}
setup() {
for project in $@; do
check_project_exists $project
echo "Setting up '$project'"
python3 -m venv $(venv_dir $project)
source $(venv_dir $project)/bin/activate
cd $(repo_dir $project)
pip install -e "$SCRIPTDIR"
IFS=$'\n'
setup_commands=($(jq -r ".\"$project\".setup[]" $PROJECTS_FILE))
unset IFS
for setup_command in "${setup_commands[@]}"; do
echo "Running '$setup_command'"
$setup_command
done
# deactivate venv again
deactivate
done
}
trace() {
for project in $@; do
check_project_exists $project
echo "Tracing '$project'"
source $(venv_dir $project)/bin/activate
REPO_DIR=$(repo_dir $project)
cd "$REPO_DIR"
rm -rf $(trace_dir $project)
mkdir -p $(trace_dir $project)
MODULE_COMMAND=$(jq -r ".\"$project\".module_command" $PROJECTS_FILE)
cg-trace --xml $(trace_dir $project)/trace.xml --module $MODULE_COMMAND
# deactivate venv again
deactivate
done
}
db() {
for project in $@; do
check_project_exists $project
echo "Creating CodeQL database for '$project'"
DB=$(db_path $project)
SRC=$(repo_dir $project)
PYTHON_EXTRACTOR=$(codeql resolve extractor --language=python)
# Source venv so we can extract dependencies
source $(venv_dir $project)/bin/activate
rm -rf "$DB"
codeql database init --source-root="$SRC" --language=python "$DB"
codeql database trace-command --working-dir="$SRC" "$DB" "$PYTHON_EXTRACTOR/tools/autobuild.sh"
codeql database index-files --language xml --include-extension .xml --working-dir="$(trace_dir $project)" "$DB"
codeql database finalize "$DB"
echo "Created database in '$DB'"
# deactivate venv again
deactivate
done
}
metrics() {
for project in $@; do
check_project_exists $project
echo "Running $METRICS_QUERY on '$project'"
RESULTS_BASE=$(query_result_base_path $project Metrics)
DB=$(db_path $project)
codeql query run "$SCRIPTDIR/$METRICS_QUERY" --database "$DB" --output "${RESULTS_BASE}.bqrs"
codeql bqrs decode "${RESULTS_BASE}.bqrs" --format text --output "${RESULTS_BASE}.txt"
echo "Results available in '${RESULTS_BASE}.txt'"
done
}
all() {
for project in $@; do
check_project_exists $project
repo $project
setup $project
trace $project
db $project
metrics $project
done
}
COMMAND=${1:-"help"}
if [[ $# -ge 2 ]]; then
shift
fi
$COMMAND $@

View File

@@ -0,0 +1,28 @@
{
"wcwidth": {
"repo": "https://github.com/jquast/wcwidth.git",
"sha": "b29897e5a1b403a0e36f7fc991614981cbc42475",
"module_command": "pytest -c /dev/null",
"setup": [
"pip install pytest"
]
},
"youtube-dl": {
"repo": "https://github.com/ytdl-org/youtube-dl.git",
"sha": "a115e07594ccb7749ca108c889978510c7df126e",
"module_command": "nose -v test --exclude test_download.py --exclude test_age_restriction.py --exclude test_subtitles.py --exclude test_write_annotations.py --exclude test_youtube_lists.py --exclude test_iqiyi_sdk_interpreter.py --exclude test_socks.py",
"setup": [
"pip install nose"
]
},
"flask": {
"repo": "https://github.com/pallets/flask.git",
"sha": "21c3df31de4bc2f838c945bd37d185210d9bab1a",
"module_command": "pytest -c /dev/null tests examples",
"setup": [
"pip install -r requirements/tests.txt",
"pip install -q -e examples/tutorial[test]",
"pip install -q -e examples/javascript[test]"
]
}
}

View File

@@ -1,9 +0,0 @@
import RecordedCalls
from ValidRecordedCall rc, Call call, Function callee, CallableValue calleeValue
where
call = rc.getCall() and
callee = rc.getCallee() and
calleeValue.getScope() = callee and
calleeValue.getACall() = call.getAFlowNode()
select call, "-->", callee

View File

@@ -1,36 +0,0 @@
import python
class RecordedCall extends XMLElement {
RecordedCall() { this.hasName("recorded_call") }
string call_filename() { result = this.getAttributeValue("call_filename") }
int call_linenum() { result = this.getAttributeValue("call_linenum").toInt() }
int call_inst_index() { result = this.getAttributeValue("call_inst_index").toInt() }
Call getCall() {
// TODO: handle calls spanning multiple lines
result.getLocation().hasLocationInfo(this.call_filename(), this.call_linenum(), _, _, _)
}
string callee_filename() { result = this.getAttributeValue("callee_filename") }
int callee_linenum() { result = this.getAttributeValue("callee_linenum").toInt() }
string callee_funcname() { result = this.getAttributeValue("callee_funcname") }
Function getCallee() {
result.getLocation().hasLocationInfo(this.callee_filename(), this.callee_linenum(), _, _, _)
}
}
/**
* Class of recorded calls where we can uniquely identify both the `call` and the `callee`.
*/
class ValidRecordedCall extends RecordedCall {
ValidRecordedCall() {
strictcount(this.getCall()) = 1 and
strictcount(this.getCallee()) = 1
}
}

View File

@@ -1,7 +0,0 @@
import RecordedCalls
from RecordedCall rc
where not rc instanceof ValidRecordedCall
select "Could not uniquely identify this recorded call (either call or callee was not uniquely identified)",
rc.call_filename(), rc.call_linenum(), rc.call_inst_index(), "-->", rc.callee_filename(),
rc.callee_linenum(), rc.callee_funcname()

View File

@@ -0,0 +1,73 @@
import python
abstract class XMLBytecodeExpr extends XMLElement { }
class XMLBytecodeConst extends XMLBytecodeExpr {
XMLBytecodeConst() { this.hasName("BytecodeConst") }
string get_value_data_raw() { result = this.getAChild("value").getTextValue() }
}
class XMLBytecodeVariableName extends XMLBytecodeExpr {
XMLBytecodeVariableName() { this.hasName("BytecodeVariableName") }
string get_name_data() { result = this.getAChild("name").getTextValue() }
}
class XMLBytecodeAttribute extends XMLBytecodeExpr {
XMLBytecodeAttribute() { this.hasName("BytecodeAttribute") }
string get_attr_name_data() { result = this.getAChild("attr_name").getTextValue() }
XMLBytecodeExpr get_object_data() { result.getParent() = this.getAChild("object") }
}
class XMLBytecodeSubscript extends XMLBytecodeExpr {
XMLBytecodeSubscript() { this.hasName("BytecodeSubscript") }
XMLBytecodeExpr get_key_data() { result.getParent() = this.getAChild("key") }
XMLBytecodeExpr get_object_data() { result.getParent() = this.getAChild("object") }
}
class XMLBytecodeTuple extends XMLBytecodeExpr {
XMLBytecodeTuple() { this.hasName("BytecodeTuple") }
XMLBytecodeExpr get_elements_data(int index) {
result = this.getAChild("elements").getChild(index)
}
}
class XMLBytecodeList extends XMLBytecodeExpr {
XMLBytecodeList() { this.hasName("BytecodeList") }
XMLBytecodeExpr get_elements_data(int index) {
result = this.getAChild("elements").getChild(index)
}
}
class XMLBytecodeCall extends XMLBytecodeExpr {
XMLBytecodeCall() { this.hasName("BytecodeCall") }
XMLBytecodeExpr get_function_data() { result.getParent() = this.getAChild("function") }
}
class XMLBytecodeUnknown extends XMLBytecodeExpr {
XMLBytecodeUnknown() { this.hasName("BytecodeUnknown") }
string get_opname_data() { result = this.getAChild("opname").getTextValue() }
}
class XMLBytecodeMakeFunction extends XMLBytecodeExpr {
XMLBytecodeMakeFunction() { this.hasName("BytecodeMakeFunction") }
XMLBytecodeExpr get_qualified_name_data() {
result.getParent() = this.getAChild("qualified_name")
}
}
class XMLSomethingInvolvingScaryBytecodeJump extends XMLBytecodeExpr {
XMLSomethingInvolvingScaryBytecodeJump() { this.hasName("SomethingInvolvingScaryBytecodeJump") }
string get_opname_data() { result = this.getAChild("opname").getTextValue() }
}

View File

@@ -0,0 +1,269 @@
import python
import semmle.python.types.Builtins
import semmle.python.objects.Callables
import lib.BytecodeExpr
/** The XML data for a recorded call (includes all data). */
class XMLRecordedCall extends XMLElement {
XMLRecordedCall() { this.hasName("recorded_call") }
/** Gets the XML data for the call. */
XMLCall getXMLCall() { result.getParent() = this }
/** Gets a call matching the recorded information. */
Call getACall() { result = this.getXMLCall().getACall() }
/** Gets the XML data for the callee. */
XMLCallee getXMLCallee() { result.getParent() = this }
/** Gets a python function matching the recorded information of the callee. */
Function getAPythonCallee() { result = this.getXMLCallee().(XMLPythonCallee).getACallee() }
/** Gets a builtin function matching the recorded information of the callee. */
Builtin getABuiltinCallee() { result = this.getXMLCallee().(XMLExternalCallee).getACallee() }
/** Get a different `XMLRecordedCall` with the same result-set for `getACall`. */
XMLRecordedCall getOtherWithSameSetOfCalls() {
// `rc` is for a different bytecode instruction on same line
not result.getXMLCall().get_inst_index_data() = this.getXMLCall().get_inst_index_data() and
result.getXMLCall().get_filename_data() = this.getXMLCall().get_filename_data() and
result.getXMLCall().get_linenum_data() = this.getXMLCall().get_linenum_data() and
// set of calls are equal
forall(Call call | call = this.getACall() or call = result.getACall() |
call = this.getACall() and call = result.getACall()
)
}
override string toString() {
exists(string path |
path =
any(File file | file.getAbsolutePath() = this.getXMLCall().get_filename_data())
.getRelativePath()
or
not exists(File file |
file.getAbsolutePath() = this.getXMLCall().get_filename_data() and
exists(file.getRelativePath())
) and
path = this.getXMLCall().get_filename_data()
|
result = this.getName() + ": " + path + ":" + this.getXMLCall().get_linenum_data()
)
}
}
/** The XML data for the call part a recorded call. */
class XMLCall extends XMLElement {
XMLCall() { this.hasName("Call") }
string get_filename_data() { result = this.getAChild("filename").getTextValue() }
int get_linenum_data() { result = this.getAChild("linenum").getTextValue().toInt() }
int get_inst_index_data() { result = this.getAChild("inst_index").getTextValue().toInt() }
/** Gets a call that matches the recorded information. */
Call getACall() {
// TODO: do we handle calls spanning multiple lines?
this.matchBytecodeExpr(result, this.getAChild("bytecode_expr").getAChild())
}
/** Holds if `expr` can be fully matched with `bytecode`. */
private predicate matchBytecodeExpr(Expr expr, XMLBytecodeExpr bytecode) {
exists(Call parent_call, XMLBytecodeCall parent_bytecode_call |
parent_call
.getLocation()
.hasLocationInfo(this.get_filename_data(), this.get_linenum_data(), _, _, _) and
parent_call.getAChildNode*() = expr and
parent_bytecode_call.getParent() = this.getAChild("bytecode_expr") and
parent_bytecode_call.getAChild*() = bytecode
) and
(
expr.(Name).getId() = bytecode.(XMLBytecodeVariableName).get_name_data()
or
expr.(Attribute).getName() = bytecode.(XMLBytecodeAttribute).get_attr_name_data() and
matchBytecodeExpr(expr.(Attribute).getObject(),
bytecode.(XMLBytecodeAttribute).get_object_data())
or
matchBytecodeExpr(expr.(Call).getFunc(), bytecode.(XMLBytecodeCall).get_function_data())
//
// I considered allowing a partial match as well. That is, if the bytecode
// expression information only tells us `<unknown>.foo()`, and we find an AST
// expression that matches on `.foo()`, that is good enough.
//
// However, we cannot assume that all calls are recorded (such as `range(10)`),
// and we cannot assume that for all recorded calls there exists a corresponding
// AST call (such as for list-comprehensions).
//
// So allowing partial matches is not safe, since we might end up matching a
// recorded call not in the AST together with an unrecorded call visible in the
// AST.
)
}
}
/** The XML data for the callee part a recorded call. */
abstract class XMLCallee extends XMLElement { }
/** The XML data for the callee part a recorded call, when the callee is a Python function. */
class XMLPythonCallee extends XMLCallee {
XMLPythonCallee() { this.hasName("PythonCallee") }
string get_filename_data() { result = this.getAChild("filename").getTextValue() }
int get_linenum_data() { result = this.getAChild("linenum").getTextValue().toInt() }
string get_funcname_data() { result = this.getAChild("funcname").getTextValue() }
Function getACallee() {
result.getLocation().hasLocationInfo(this.get_filename_data(), this.get_linenum_data(), _, _, _)
or
// if function has decorator, the call will be recorded going to the first
result
.getADecorator()
.getLocation()
.hasLocationInfo(this.get_filename_data(), this.get_linenum_data(), _, _, _)
}
}
/** The XML data for the callee part a recorded call, when the callee is a C function or builtin. */
class XMLExternalCallee extends XMLCallee {
XMLExternalCallee() { this.hasName("ExternalCallee") }
string get_module_data() { result = this.getAChild("module").getTextValue() }
string get_qualname_data() { result = this.getAChild("qualname").getTextValue() }
Builtin getACallee() {
exists(Builtin mod |
mod.isModule() and
mod.getName() = this.get_module_data()
|
result = traverse_qualname(mod, this.get_qualname_data())
)
}
}
/**
* Helper predicate. If parent = `builtins` and qualname = `list.append`, it will
* return the result of `builtins.list.append`.class
*/
private Builtin traverse_qualname(Builtin parent, string qualname) {
not qualname = "__objclass__" and
not qualname.matches("%.%") and
result = parent.getMember(qualname)
or
qualname.matches("%.%") and
exists(string before_dot, string after_dot, Builtin intermediate_parent |
qualname = before_dot + "." + after_dot and
not before_dot = "__objclass__" and
intermediate_parent = parent.getMember(before_dot) and
result = traverse_qualname(intermediate_parent, after_dot)
)
}
/**
* Class of recorded calls where we can identify both the `call` and the `callee` uniquely.
*/
class IdentifiedRecordedCall extends XMLRecordedCall {
IdentifiedRecordedCall() {
strictcount(this.getACall()) = 1 and
(
strictcount(this.getAPythonCallee()) = 1
or
strictcount(this.getABuiltinCallee()) = 1
)
or
// Handle case where the same function is called multiple times in one line, for
// example `func(); func()`. This only works if:
// - all the callees for these calls is the same
// - all these calls were recorded
//
// without this `strictcount`, in the case `func(); func(); func()`, if 1 of the calls
// is not recorded, we would still mark the other two recorded calls as valid
// (which is not following the rules above). + 1 to count `this` as well.
strictcount(this.getACall()) = strictcount(this.getOtherWithSameSetOfCalls()) + 1 and
forex(XMLRecordedCall rc | rc = this.getOtherWithSameSetOfCalls() |
unique(Function f | f = this.getAPythonCallee()) =
unique(Function f | f = rc.getAPythonCallee())
or
unique(Builtin b | b = this.getABuiltinCallee()) =
unique(Builtin b | b = rc.getABuiltinCallee())
)
}
override string toString() {
exists(string callee_str |
exists(Function callee, string path | callee = this.getAPythonCallee() |
(
path = callee.getLocation().getFile().getRelativePath()
or
not exists(callee.getLocation().getFile().getRelativePath()) and
path = callee.getLocation().getFile().getAbsolutePath()
) and
callee_str =
callee.toString() + " (" + path + ":" + callee.getLocation().getStartLine() + ")"
)
or
callee_str = this.getABuiltinCallee().toString()
|
result = super.toString() + " --> " + callee_str
)
}
}
/**
* Class of recorded calls where we cannot identify both the `call` and the `callee` uniquely.
*/
class UnidentifiedRecordedCall extends XMLRecordedCall {
UnidentifiedRecordedCall() { not this instanceof IdentifiedRecordedCall }
}
/**
* Recorded calls made from outside project folder, that can be ignored when evaluating
* call-graph quality.
*/
class IgnoredRecordedCall extends XMLRecordedCall {
IgnoredRecordedCall() {
not exists(
any(File file | file.getAbsolutePath() = this.getXMLCall().get_filename_data())
.getRelativePath()
)
}
}
/** Provides classes for call-graph resolution by using points-to. */
module PointsToBasedCallGraph {
/** An IdentifiedRecordedCall that can be resolved with points-to */
class ResolvableRecordedCall extends IdentifiedRecordedCall {
Value calleeValue;
ResolvableRecordedCall() {
exists(Call call, XMLCallee xmlCallee |
call = this.getACall() and
calleeValue.getACall() = call.getAFlowNode() and
xmlCallee = this.getXMLCallee() and
(
xmlCallee instanceof XMLPythonCallee and
(
// normal function
calleeValue.(PythonFunctionValue).getScope() = xmlCallee.(XMLPythonCallee).getACallee()
or
// class instantiation -- points-to says the call goes to the class
calleeValue.(ClassValue).lookup("__init__").(PythonFunctionValue).getScope() =
xmlCallee.(XMLPythonCallee).getACallee()
)
or
xmlCallee instanceof XMLExternalCallee and
calleeValue.(BuiltinFunctionObjectInternal).getBuiltin() =
xmlCallee.(XMLExternalCallee).getACallee()
or
xmlCallee instanceof XMLExternalCallee and
calleeValue.(BuiltinMethodObjectInternal).getBuiltin() =
xmlCallee.(XMLExternalCallee).getACallee()
)
)
}
Value getCalleeValue() { result = calleeValue }
}
}

View File

@@ -0,0 +1,17 @@
/**
* Metrics for evaluating how good we are at interpreting results from the cg_trace program.
* See Metrics.ql for call-graph quality metrics.
*/
import lib.RecordedCalls
from string text, float number, float ratio
where
exists(int all_rcs | all_rcs = count(XMLRecordedCall rc) and ratio = number / all_rcs |
text = "XMLRecordedCall" and number = all_rcs
or
text = "IdentifiedRecordedCall" and number = count(IdentifiedRecordedCall rc)
or
text = "UnidentifiedRecordedCall" and number = count(UnidentifiedRecordedCall rc)
)
select text, number, ratio * 100 + "%" as percent

View File

@@ -0,0 +1,56 @@
import lib.RecordedCalls
// column i is just used for sorting
from string text, float number, float ratio, int i
where
exists(int all_rcs | all_rcs = count(XMLRecordedCall rc) and ratio = number / all_rcs |
text = "XMLRecordedCall" and number = all_rcs and i = 0
or
text = "IgnoredRecordedCall" and number = count(IgnoredRecordedCall rc) and i = 1
or
text = "not IgnoredRecordedCall" and number = all_rcs - count(IgnoredRecordedCall rc) and i = 2
)
or
text = "----------" and
number = 0 and
ratio = 0 and
i = 10
or
exists(int all_not_ignored_rcs |
all_not_ignored_rcs = count(XMLRecordedCall rc | not rc instanceof IgnoredRecordedCall) and
ratio = number / all_not_ignored_rcs
|
text = "IdentifiedRecordedCall" and
number = count(IdentifiedRecordedCall rc | not rc instanceof IgnoredRecordedCall) and
i = 11
or
text = "UnidentifiedRecordedCall" and
number = count(UnidentifiedRecordedCall rc | not rc instanceof IgnoredRecordedCall) and
i = 12
)
or
text = "----------" and
number = 0 and
ratio = 0 and
i = 20
or
exists(int all_identified_rcs |
all_identified_rcs = count(IdentifiedRecordedCall rc | not rc instanceof IgnoredRecordedCall) and
ratio = number / all_identified_rcs
|
text = "points-to ResolvableRecordedCall" and
number =
count(PointsToBasedCallGraph::ResolvableRecordedCall rc |
not rc instanceof IgnoredRecordedCall
) and
i = 21
or
text = "points-to not ResolvableRecordedCall" and
number =
all_identified_rcs -
count(PointsToBasedCallGraph::ResolvableRecordedCall rc |
not rc instanceof IgnoredRecordedCall
) and
i = 22
)
select i, text, number, ratio * 100 + "%" as percent order by i

View File

@@ -0,0 +1,4 @@
import lib.RecordedCalls
from PointsToBasedCallGraph::ResolvableRecordedCall rc
select rc.getACall(), "-->", rc.getCalleeValue()

View File

@@ -0,0 +1,5 @@
import lib.RecordedCalls
from IdentifiedRecordedCall rc
where not rc instanceof PointsToBasedCallGraph::ResolvableRecordedCall
select rc, rc.getACall()

View File

@@ -0,0 +1,23 @@
import lib.RecordedCalls
from UnidentifiedRecordedCall rc, string reason
where
not rc instanceof IgnoredRecordedCall and
(
not exists(rc.getACall()) and
reason = "no call"
or
count(rc.getACall()) > 1 and
reason = "more than 1 call"
or
not exists(rc.getAPythonCallee()) and
not exists(rc.getABuiltinCallee()) and
reason = "no callee"
or
count(rc.getAPythonCallee()) > 1 and
reason = "more than 1 Python callee"
or
count(rc.getABuiltinCallee()) > 1 and
reason = "more than 1 Builtin callee"
)
select rc, reason

View File

@@ -0,0 +1,9 @@
import python
import lib.RecordedCalls
// Could be useful for deciding which new opcodes to support
from string op_name, int c
where
exists(XMLBytecodeUnknown unknown | unknown.get_opname_data() = op_name) and
c = count(XMLBytecodeUnknown unknown | unknown.get_opname_data() = op_name | 1)
select op_name, c order by c

View File

@@ -1,23 +0,0 @@
#!/bin/bash
set -e
set -x
DB="cg-trace-example-db"
SRC="example/"
XMLDIR="$SRC"
PYTHON_EXTRACTOR=$(codeql resolve extractor --language=python)
./cg_trace.py --xml example/simple.xml example/simple.py
rm -rf "$DB"
codeql database init --source-root="$SRC" --language=python "$DB"
codeql database trace-command --working-dir="$SRC" "$DB" "$PYTHON_EXTRACTOR/tools/autobuild.sh"
codeql database index-files --language xml --include-extension .xml --working-dir="$XMLDIR" "$DB"
codeql database finalize "$DB"
set +x
echo "Created database '$DB'"

View File

@@ -0,0 +1,5 @@
lxml
# dev
black
flake8
flake8-bugbear

View File

@@ -0,0 +1,14 @@
from setuptools import find_packages, setup
# using src/ folder as recommended in: https://blog.ionelmc.ro/2014/05/25/python-packaging/
setup(
name="cg_trace",
version="0.0.2", # Remember to update src/cg_trace/__init__.py
description="Call graph tracing",
packages=find_packages("src"),
package_dir={"": "src"},
install_requires=["lxml"],
entry_points={"console_scripts": ["cg-trace = cg_trace.main:main"]},
python_requires=">=3.7",
)

View File

@@ -0,0 +1,15 @@
import sys
__version__ = "0.0.2" # remember to update setup.py
# Since the virtual machine opcodes changed in 3.6, not going to attempt to support
# anything before that. Using dataclasses, which is a new feature in Python 3.7
MIN_PYTHON_VERSION = (3, 7)
MIN_PYTHON_VERSION_FORMATTED = ".".join(str(i) for i in MIN_PYTHON_VERSION)
if not sys.version_info[:2] >= MIN_PYTHON_VERSION:
sys.exit(
"You need at least Python {} to use 'cg_trace'".format(
MIN_PYTHON_VERSION_FORMATTED
)
)

View File

@@ -0,0 +1,5 @@
import sys
from cg_trace.main import main
sys.exit(main())

View File

@@ -0,0 +1,275 @@
import dataclasses
import dis
import logging
from dis import Instruction
from types import FrameType
from typing import Any, List
from cg_trace.settings import DEBUG, FAIL_ON_UNKNOWN_BYTECODE
from cg_trace.utils import better_compare_for_dataclass
LOGGER = logging.getLogger(__name__)
# See https://docs.python.org/3/library/dis.html#python-bytecode-instructions for
# details on the bytecode instructions
# TODO: read https://opensource.com/article/18/4/introduction-python-bytecode
class BytecodeExpr:
"""An expression reconstructed from Python bytecode
"""
@better_compare_for_dataclass
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class BytecodeConst(BytecodeExpr):
"""FOR LOAD_CONST"""
value: Any
def __str__(self):
return repr(self.value)
@better_compare_for_dataclass
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class BytecodeVariableName(BytecodeExpr):
name: str
def __str__(self):
return self.name
@better_compare_for_dataclass
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class BytecodeAttribute(BytecodeExpr):
attr_name: str
object: BytecodeExpr
def __str__(self):
return f"{self.object}.{self.attr_name}"
@better_compare_for_dataclass
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class BytecodeSubscript(BytecodeExpr):
key: BytecodeExpr
object: BytecodeExpr
def __str__(self):
return f"{self.object}[{self.key}]"
@better_compare_for_dataclass
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class BytecodeTuple(BytecodeExpr):
elements: List[BytecodeExpr]
def __str__(self):
elements_formatted = (
", ".join(str(e) for e in self.elements)
if len(self.elements) > 1
else f"{self.elements[0]},"
)
return f"({elements_formatted})"
@better_compare_for_dataclass
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class BytecodeList(BytecodeExpr):
elements: List[BytecodeExpr]
def __str__(self):
elements_formatted = (
", ".join(str(e) for e in self.elements)
if len(self.elements) > 1
else f"{self.elements[0]},"
)
return f"[{elements_formatted}]"
@better_compare_for_dataclass
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class BytecodeCall(BytecodeExpr):
function: BytecodeExpr
def __str__(self):
return f"{self.function}()"
@better_compare_for_dataclass
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class BytecodeUnknown(BytecodeExpr):
opname: str
def __str__(self):
return f"<{self.opname}>"
@better_compare_for_dataclass
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class BytecodeMakeFunction(BytecodeExpr):
"""For MAKE_FUNCTION opcode"""
qualified_name: BytecodeExpr
def __str__(self):
return f"<MAKE_FUNCTION>(qualified_name={self.qualified_name})>"
@better_compare_for_dataclass
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class SomethingInvolvingScaryBytecodeJump(BytecodeExpr):
opname: str
def __str__(self):
return "<SomethingInvolvingScaryBytecodeJump>"
def expr_that_added_elem_to_stack(
instructions: List[Instruction], start_index: int, stack_pos: int
):
"""Backwards traverse instructions
Backwards traverse the instructions starting at `start_index` until we find the
instruction that added the element at stack position `stack_pos` (where 0 means top
of stack). For example, if the instructions are:
```
0: LOAD_GLOBAL 0 (func)
1: LOAD_CONST 1 (42)
2: CALL_FUNCTION 1
```
We can look for the function that is called by invoking this function with
`start_index = 1` and `stack_pos = 1`. It will see that `LOAD_CONST` added the top
element to the stack, and find that `LOAD_GLOBAL` was the instruction to add element
in stack position 1 to the stack -- so `expr_from_instruction(instructions, 0)` is
returned.
It is assumed that if `stack_pos == 0` then the instruction you are looking for is
the one at `instructions[start_index]`. This might not hold, in case of using `NOP`
instructions.
If any jump instruction is found, `SomethingInvolvingScaryBytecodeJump` is returned
immediately. (since correctly process the bytecode when faced with jumps is not as
straight forward).
"""
if DEBUG:
LOGGER.debug(
f"find_inst_that_added_elem_to_stack start_index={start_index} stack_pos={stack_pos}"
)
assert stack_pos >= 0
for inst in reversed(instructions[: start_index + 1]):
# Return immediately if faced with a jump
if inst.opcode in dis.hasjabs or inst.opcode in dis.hasjrel:
return SomethingInvolvingScaryBytecodeJump(inst.opname)
if stack_pos == 0:
if DEBUG:
LOGGER.debug(f"Found it: {inst}")
found_index = instructions.index(inst)
break
old = stack_pos
stack_pos -= dis.stack_effect(inst.opcode, inst.arg)
new = stack_pos
if DEBUG:
LOGGER.debug(f"Skipping ({old} -> {new}) {inst}")
else:
raise Exception("inst_index_for_stack_diff failed")
return expr_from_instruction(instructions, found_index)
def expr_from_instruction(instructions: List[Instruction], index: int) -> BytecodeExpr:
inst = instructions[index]
if DEBUG:
LOGGER.debug(f"expr_from_instruction: {inst} index={index}")
if inst.opname in ["LOAD_GLOBAL", "LOAD_FAST", "LOAD_NAME", "LOAD_DEREF"]:
return BytecodeVariableName(inst.argval)
# elif inst.opname in ["LOAD_CONST"]:
# return BytecodeConst(inst.argval)
# https://docs.python.org/3/library/dis.html#opcode-LOAD_METHOD
# https://docs.python.org/3/library/dis.html#opcode-LOAD_ATTR
elif inst.opname in ["LOAD_METHOD", "LOAD_ATTR"]:
attr_name = inst.argval
obj_expr = expr_that_added_elem_to_stack(instructions, index - 1, 0)
return BytecodeAttribute(attr_name=attr_name, object=obj_expr)
# elif inst.opname in ["BINARY_SUBSCR"]:
# key_expr = expr_that_added_elem_to_stack(instructions, index - 1, 0)
# obj_expr = expr_that_added_elem_to_stack(instructions, index - 1, 1)
# return BytecodeSubscript(key=key_expr, object=obj_expr)
# elif inst.opname in ["BUILD_TUPLE", "BUILD_LIST"]:
# elements = []
# for i in range(inst.arg):
# element_expr = expr_that_added_elem_to_stack(instructions, index - 1, i)
# elements.append(element_expr)
# elements.reverse()
# klass = {"BUILD_TUPLE": BytecodeTuple, "BUILD_LIST": BytecodeList}[inst.opname]
# return klass(elements=elements)
# https://docs.python.org/3/library/dis.html#opcode-CALL_FUNCTION
elif inst.opname in [
"CALL_FUNCTION",
"CALL_METHOD",
"CALL_FUNCTION_KW",
"CALL_FUNCTION_EX",
]:
assert index > 0
assert isinstance(inst.arg, int)
if inst.opname in ["CALL_FUNCTION", "CALL_METHOD"]:
num_stack_elems = inst.arg
elif inst.opname == "CALL_FUNCTION_KW":
num_stack_elems = inst.arg + 1
elif inst.opname == "CALL_FUNCTION_EX":
# top of stack _can_ be keyword argument dictionary (indicated by lowest bit
# set), always followed by the positional arguments (also if there are not
# any).
num_stack_elems = (1 if inst.arg & 1 == 1 else 0) + 1
func_expr = expr_that_added_elem_to_stack(
instructions, index - 1, num_stack_elems
)
return BytecodeCall(function=func_expr)
# elif inst.opname in ["MAKE_FUNCTION"]:
# name_expr = expr_that_added_elem_to_stack(instructions, index - 1, 0)
# assert isinstance(name_expr, BytecodeConst)
# return BytecodeMakeFunction(qualified_name=name_expr)
# TODO: handle with statements (https://docs.python.org/3/library/dis.html#opcode-SETUP_WITH)
WITH_OPNAMES = ["SETUP_WITH", "WITH_CLEANUP_START", "WITH_CLEANUP_FINISH"]
# Special cases ignored for now:
#
# - LOAD_BUILD_CLASS: Called when constructing a class.
# - IMPORT_NAME: Observed to result in a call to filename='<frozen
# importlib._bootstrap>', linenum=389, funcname='parent'
if FAIL_ON_UNKNOWN_BYTECODE:
if inst.opname not in ["LOAD_BUILD_CLASS", "IMPORT_NAME"] + WITH_OPNAMES:
LOGGER.warning(
f"Don't know how to handle this type of instruction: {inst.opname}"
)
raise BaseException()
return BytecodeUnknown(inst.opname)
def expr_from_frame(frame: FrameType) -> BytecodeExpr:
bytecode = dis.Bytecode(frame.f_code, current_offset=frame.f_lasti)
if DEBUG:
LOGGER.debug(
f"{frame.f_code.co_filename}:{frame.f_lineno}: bytecode: \n{bytecode.dis()}"
)
instructions = list(iter(bytecode))
last_instruction_index = [inst.offset for inst in instructions].index(frame.f_lasti)
return expr_from_instruction(instructions, last_instruction_index)

View File

@@ -0,0 +1,22 @@
import argparse
def parse(args):
parser = argparse.ArgumentParser()
parser.add_argument(
"--debug", action="store_true", default=False, help="Enable debug logging"
)
parser.add_argument("--xml")
parser.add_argument(
"--module", action="store_true", default=False, help="Trace a module"
)
parser.add_argument("progname", help="file to run as main program")
parser.add_argument(
"arguments", nargs=argparse.REMAINDER, help="arguments to the program"
)
return parser.parse_args(args)

View File

@@ -0,0 +1,46 @@
import dataclasses
from typing import Dict
from lxml import etree
def dataclass_to_xml(obj, parent):
obj_elem = etree.SubElement(parent, obj.__class__.__name__)
for field in dataclasses.fields(obj):
field_elem = etree.SubElement(obj_elem, field.name)
value = getattr(obj, field.name)
if isinstance(value, (str, int)) or value is None:
field_elem.text = str(value)
elif isinstance(value, list):
for list_elem in value:
assert dataclasses.is_dataclass(list_elem)
dataclass_to_xml(list_elem, field_elem)
elif dataclasses.is_dataclass(value):
dataclass_to_xml(value, field_elem)
else:
raise ValueError(
f"Can't export key {field.name!r} with value {value!r} (type {type(value)}"
)
class XMLExporter:
@staticmethod
def export(outfile_path, recorded_calls, info: Dict[str, str]):
root = etree.Element("root")
info_elem = etree.SubElement(root, "info")
for k, v in info.items():
etree.SubElement(info_elem, k).text = v
rcs = etree.SubElement(root, "recorded_calls")
for (call, callee) in sorted(recorded_calls):
rc = etree.SubElement(rcs, "recorded_call")
dataclass_to_xml(call, rc)
dataclass_to_xml(callee, rc)
tree = etree.ElementTree(root)
tree.write(outfile_path, encoding="utf-8", pretty_print=True)
print(f"Wrote {len(recorded_calls)} recorded calls to {outfile_path}")

View File

@@ -0,0 +1,46 @@
import dataclasses
from typing import Any, List
from cg_trace.bytecode_reconstructor import BytecodeExpr
PREAMBLE = """\
import python
abstract class XMLBytecodeExpr extends XMLElement { }
"""
CLASS_PREAMBLE = """\
class XML{class_name} extends XMLBytecodeExpr {{
XML{class_name}() {{ this.hasName("{class_name}") }}
"""
CLASS_AFTER = """\
}
"""
ATTR_TEMPLATES = {
str: 'string get_{name}_data() {{ result = this.getAChild("{name}").getTextValue() }}',
int: 'int get_{name}_data() {{ result = this.getAChild("{name}").getTextValue().toInt() }}',
BytecodeExpr: 'XMLBytecodeExpr get_{name}_data() {{ result.getParent() = this.getAChild("{name}") }}',
List[
BytecodeExpr
]: 'XMLBytecodeExpr get_{name}_data(int index) {{ result = this.getAChild("{name}").getChild(index) }}',
Any: 'string get_{name}_data_raw() {{ result = this.getAChild("{name}").getTextValue() }}',
}
if __name__ == "__main__":
print(PREAMBLE)
for sc in BytecodeExpr.__subclasses__():
print(CLASS_PREAMBLE.format(class_name=sc.__name__))
for f in dataclasses.fields(sc):
field_template = ATTR_TEMPLATES.get(f.type)
if field_template:
generated = field_template.format(name=f.name)
print(f" {generated}")
else:
raise Exception("no template for", f.type)
print(CLASS_AFTER)

View File

@@ -0,0 +1,118 @@
import itertools
import logging
import os
import sys
import time
from datetime import datetime
from io import StringIO
from cg_trace import __version__, cmdline, settings, tracer
from cg_trace.exporter import XMLExporter
def record_calls(code, globals):
real_stdout = sys.stdout
real_stderr = sys.stderr
captured_stdout = StringIO()
captured_stderr = StringIO()
sys.stdout = captured_stdout
sys.stderr = captured_stderr
cgt = tracer.CallGraphTracer()
exit_status = cgt.run(code, globals, globals)
sys.stdout = real_stdout
sys.stderr = real_stderr
all_calls_sorted = sorted(
itertools.chain(cgt.python_calls.values(), cgt.external_calls.values())
)
return all_calls_sorted, captured_stdout, captured_stderr, exit_status
def setup_logging(debug):
# code we run can also set up logging, so we need to set the level directly on our
# own pacakge
sh = logging.StreamHandler(stream=sys.stderr)
pkg_logger = logging.getLogger("cg_trace")
pkg_logger.addHandler(sh)
pkg_logger.setLevel(logging.CRITICAL if debug else logging.INFO)
def main(args=None) -> int:
# from . import bytecode_reconstructor
# logging.getLogger(bytecode_reconstructor.__name__).setLevel(logging.INFO)
if args is None:
# first element in argv is program name
args = sys.argv[1:]
opts = cmdline.parse(args)
settings.DEBUG = opts.debug
setup_logging(opts.debug)
# These details of setting up the program to be run is very much inspired by `trace`
# from the standard library
if opts.module:
import runpy
module_name = opts.progname
_mod_name, mod_spec, code = runpy._get_module_details(module_name)
sys.argv = [code.co_filename, *opts.arguments]
globs = {
"__name__": "__main__",
"__file__": code.co_filename,
"__package__": mod_spec.parent,
"__loader__": mod_spec.loader,
"__spec__": mod_spec,
"__cached__": None,
}
else:
sys.argv = [opts.progname, *opts.arguments]
sys.path[0] = os.path.dirname(opts.progname)
with open(opts.progname) as fp:
code = compile(fp.read(), opts.progname, "exec")
# try to emulate __main__ namespace as much as possible
globs = {
"__file__": opts.progname,
"__name__": "__main__",
"__package__": None,
"__cached__": None,
}
start = time.time()
recorded_calls, captured_stdout, captured_stderr, exit_status = record_calls(
code, globs
)
end = time.time()
elapsed_formatted = f"{end-start:.2f} seconds"
if opts.xml:
XMLExporter.export(
opts.xml,
recorded_calls,
info={
"cg_trace_version": __version__,
"args": " ".join(args),
"exit_status": exit_status,
"elapsed": elapsed_formatted,
"utctimestamp": datetime.utcnow().replace(microsecond=0).isoformat(),
},
)
else:
print(f"--- Recorded calls (in {elapsed_formatted}) ---")
for (call, callee) in recorded_calls:
print(f"{call} --> {callee}")
print("--- captured stdout ---")
print(captured_stdout.getvalue(), end="")
print("--- captured stderr ---")
print(captured_stderr.getvalue(), end="")
return 0

View File

@@ -0,0 +1,6 @@
# Whether to run the call graph tracer with debugging enabled. Turning off
# `if DEBUG: LOGGER.debug()` code completely yielded massive performance improvements.
DEBUG = False
FAIL_ON_UNKNOWN_BYTECODE = False

View File

@@ -0,0 +1,333 @@
import dataclasses
import inspect
import logging
import os
import sys
from types import FrameType
from typing import Any, Optional, Tuple
from cg_trace.bytecode_reconstructor import BytecodeExpr, expr_from_frame
from cg_trace.settings import DEBUG
from cg_trace.utils import better_compare_for_dataclass
LOGGER = logging.getLogger(__name__)
# copy-paste For interactive ipython sessions
# import IPython; sys.stdout = sys.__stdout__; IPython.embed(); sys.exit()
_canonic_filename_cache = dict()
def canonic_filename(filename):
"""Return canonical form of filename. (same as Bdb.canonic)
For real filenames, the canonical form is a case-normalized (on
case insensitive filesystems) absolute path. 'Filenames' with
angle brackets, such as "<stdin>", generated in interactive
mode, are returned unchanged.
"""
if filename == "<" + filename[1:-1] + ">":
return filename
canonic = _canonic_filename_cache.get(filename)
if not canonic:
canonic = os.path.abspath(filename)
canonic = os.path.normcase(canonic)
_canonic_filename_cache[filename] = canonic
return canonic
_call_cache = dict()
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class Call:
"""A call
"""
filename: str
linenum: int
inst_index: int
bytecode_expr: BytecodeExpr
def __str__(self):
d = dataclasses.asdict(self)
del d["bytecode_expr"]
normal_fields = ", ".join(f"{k}={v!r}" for k, v in d.items())
return f"{type(self).__name__}({normal_fields}, bytecode_expr≈{str(self.bytecode_expr)})"
@classmethod
def from_frame(cls, frame: FrameType):
global _call_cache
key = cls.hash_key(frame)
if key in _call_cache:
return _call_cache[key]
code = frame.f_code
bytecode_expr = expr_from_frame(frame)
call = cls(
filename=canonic_filename(code.co_filename),
linenum=frame.f_lineno,
inst_index=frame.f_lasti,
bytecode_expr=bytecode_expr,
)
_call_cache[key] = call
return call
@staticmethod
def hash_key(frame: FrameType) -> Tuple[str, int, int]:
code = frame.f_code
return (
canonic_filename(code.co_filename),
frame.f_lineno,
frame.f_lasti,
)
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class Callee:
pass
BUILTIN_FUNCTION_OR_METHOD = type(print)
METHOD_DESCRIPTOR_TYPE = type(dict.get)
_unknown_module_fixup_cache = dict()
def _unkown_module_fixup(func):
# TODO: Doesn't work for everything (for example: `OrderedDict.fromkeys`, `object.__new__`)
# TODO: Can make this logic easier by using `func.__self__`. For `f = dict().get`, `f.__self__.__class__ == dict`
# and `dict.__new__.__self__ = dict`
module = func.__module__
qualname = func.__qualname__
cls_name, method_name = qualname.split(".")
key = (module, qualname)
if key in _unknown_module_fixup_cache:
return _unknown_module_fixup_cache[key]
matching_classes = list()
for klass in object.__subclasses__():
if inspect.isabstract(klass):
continue
try:
# type(dict.get) == METHOD_DESCRIPTOR_TYPE
# type(dict.__new__) == BUILTIN_FUNCTION_OR_METHOD
if klass.__qualname__ == cls_name and type(
getattr(klass, method_name, None)
) in [BUILTIN_FUNCTION_OR_METHOD, METHOD_DESCRIPTOR_TYPE]:
matching_classes.append(klass)
# For flask, observed to give `ValueError: Namespace class is abstract`, even with the isabstract above
except ValueError:
pass
if len(matching_classes) == 1:
klass = matching_classes[0]
ret = klass.__module__
else:
if DEBUG:
LOGGER.debug(f"Found more than one matching class for {module} {qualname}")
ret = None
_unknown_module_fixup_cache[key] = ret
return ret
@better_compare_for_dataclass
@dataclasses.dataclass(frozen=True, eq=True)
class ExternalCallee(Callee):
# Some bound methods might not have __module__ attribute: for example,
# `list().append.__module__ is None`
module: Optional[str]
qualname: str
#
is_builtin: bool
@classmethod
def from_arg(cls, func):
# builtin bound methods seems to always return `None` for __module__, but we
# might be able to recover the lost information by looking through all classes.
# For example, `dict().get.__module__ is None` and `dict().get.__qualname__ ==
# "dict.get"`
module = func.__module__
qualname = func.__qualname__
if module is None and qualname.count(".") == 1:
module = _unkown_module_fixup(func)
return cls(
module=module,
qualname=qualname,
is_builtin=type(func) == BUILTIN_FUNCTION_OR_METHOD,
)
def __lt__(self, other):
if not isinstance(other, ExternalCallee):
raise TypeError()
for field in dataclasses.fields(self):
s_a = getattr(self, field.name)
o_a = getattr(other, field.name)
# `None < None` gives TypeError
if s_a is None and o_a is None:
return False
if type(s_a) != type(o_a):
return type(s_a).__name__ < type(o_a).__name__
if not s_a < o_a:
return False
return True
def __gt__(self, other):
return other < self
def __ge__(self, other):
return self > other or self == other
def __le__(self, other):
return self < other or self == other
@better_compare_for_dataclass
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class PythonCallee(Callee):
"""A callee (Function/Lambda/???)
should (hopefully) be uniquely identified by its name and location (filename+line
number)
"""
filename: str
linenum: int
funcname: str
@classmethod
def from_frame(cls, frame: FrameType):
code = frame.f_code
return cls(
filename=canonic_filename(code.co_filename),
linenum=frame.f_lineno,
funcname=code.co_name,
)
class CallGraphTracer:
"""Tracer that records calls being made
It would seem obvious that this should have extended `trace` library
(https://docs.python.org/3/library/trace.html), but that part is not extensible.
You might think that we can just use `sys.settrace`
(https://docs.python.org/3.8/library/sys.html#sys.settrace) like the basic debugger
(bdb) does, but that isn't invoked on calls to C code, which we need in general, and
need for handling builtins specifically.
Luckily, `sys.setprofile`
(https://docs.python.org/3.8/library/sys.html#sys.setprofile) provides all that we
need. You might be scared by reading the following bit of the documentation
> The function is thread-specific, but there is no way for the profiler to know about
> context switches between threads, so it does not make sense to use this in the
> presence of multiple threads.
but that is to be understood in the context of making a profiler (you can't reliably
measure function execution time if you don't know about context switches). For our
use-case, this is not a problem.
"""
def __init__(self):
# Performing `Call.from_frame` can be expensive, so we cache (call, callee)
# pairs we have already seen to avoid double procressing.
self.python_calls = dict()
self.external_calls = dict()
def run(self, code, globals, locals):
self.exec_call_seen = False
self.ignore_rest = False
try:
sys.setprofile(self.profilefunc)
exec(code, globals, locals)
return "completed"
except SystemExit:
return "completed (SystemExit)"
except Exception:
sys.setprofile(None)
LOGGER.info("Exception occurred while running program:", exc_info=True)
return "exception occurred"
finally:
sys.setprofile(None)
def profilefunc(self, frame: FrameType, event: str, arg):
# ignore everything until the first call, since that is `exec` from the `run`
# method above
if not self.exec_call_seen:
if event == "call":
self.exec_call_seen = True
return
# if we're going out of the exec, we should ignore anything else (for example the
# call to `sys.setprofile(None)`)
if event == "c_return":
if arg == exec and frame.f_code.co_filename == __file__:
self.ignore_rest = True
if self.ignore_rest:
return
if event not in ["call", "c_call"]:
return
if DEBUG:
LOGGER.debug(f"profilefunc event={event}")
if event == "call":
# in call, the `frame` argument is new the frame for entering the callee
assert frame.f_back is not None
callee = PythonCallee.from_frame(frame)
key = (Call.hash_key(frame.f_back), callee)
if key in self.python_calls:
if DEBUG:
LOGGER.debug(f"ignoring already seen call {key[0]} --> {callee}")
return
if DEBUG:
LOGGER.debug(f"callee={callee}")
call = Call.from_frame(frame.f_back)
self.python_calls[key] = (call, callee)
if event == "c_call":
# in c_call, the `frame` argument is frame where the call happens, and the
# `arg` argument is the C function object.
callee = ExternalCallee.from_arg(arg)
key = (Call.hash_key(frame), callee)
if key in self.external_calls:
if DEBUG:
LOGGER.debug(f"ignoring already seen call {key[0]} --> {callee}")
return
if DEBUG:
LOGGER.debug(f"callee={callee}")
call = Call.from_frame(frame)
self.external_calls[key] = (call, callee)
if DEBUG:
LOGGER.debug(f"{call} --> {callee}")

View File

@@ -0,0 +1,20 @@
def better_compare_for_dataclass(cls):
"""When dataclass is used with `order=True`, the comparison methods is only implemented for
objects of the same class. This decorator extends the functionality to compare class
name if used against other objects.
"""
for op in [
"__lt__",
"__le__",
"__gt__",
"__ge__",
]:
old = getattr(cls, op)
def new(self, other, op=op, old=old):
if type(self) == type(other):
return old(self, other)
return getattr(str, op)(self.__class__.__name__, other.__class__.__name__)
setattr(cls, op, new)
return cls

View File

@@ -0,0 +1,32 @@
#!/bin/bash
set -Eeuo pipefail # see https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/
SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
if ! pip show cg_trace &>/dev/null; then
echo "You need to follow setup instructions in README"
exit 1
fi
DB="$SCRIPTDIR/cg-trace-test-db"
SRC="$SCRIPTDIR/python-src/"
XMLDIR="$SCRIPTDIR/python-traces/"
PYTHON_EXTRACTOR=$(codeql resolve extractor --language=python)
rm -rf "$DB"
rm -rf "$XMLDIR"
mkdir -p "$XMLDIR"
for f in $(ls $SRC); do
echo "Tracing $f"
cg-trace --xml "$XMLDIR/${f%.py}.xml" "$SRC/$f"
done
codeql database init --source-root="$SRC" --language=python "$DB"
codeql database trace-command --working-dir="$SRC" "$DB" "$PYTHON_EXTRACTOR/tools/autobuild.sh"
codeql database index-files --language xml --include-extension .xml --working-dir="$XMLDIR" "$DB"
codeql database finalize "$DB"
echo "Created database '$DB'"

View File

@@ -0,0 +1,9 @@
def foo():
print("foo")
def bar():
print("bar")
[foo, bar][0]()

View File

@@ -0,0 +1,9 @@
def foo():
print("foo")
def bar():
print("bar")
(foo, bar)[0]()

View File

@@ -0,0 +1,15 @@
def func(*args, **kwargs):
print("func", args, kwargs)
args = [1, 2, 3]
kwargs = {"a": 1, "b": 2}
# These gives rise to a CALL_FUNCTION_EX
func(*args)
func(**kwargs)
func(*args, **kwargs)
func(*args, foo="foo")
func(*args, foo="foo", **kwargs)

View File

@@ -0,0 +1,7 @@
class Foo:
def __getitem__(self, key):
print("__getitem__")
foo = Foo()
foo["key"] # this is recorded as a call :)

View File

@@ -0,0 +1,9 @@
print("builtins test")
len("bar")
l = list()
l.append(42)
import sys
sys.getdefaultencoding()
r = range(10)

View File

@@ -0,0 +1,44 @@
def func(self, arg):
print("func", self, arg)
class Foo(object):
def __init__(self, arg):
print("Foo.__init__", self, arg)
def some_method(self):
print("Foo.some_method", self)
return self
f = func
@staticmethod
def some_staticmethod():
print("Foo.some_staticmethod")
@classmethod
def some_classmethod(cls):
print("Foo.some_classmethod", cls)
foo = Foo(42)
foo.some_method()
foo.f(10)
foo.some_staticmethod()
foo.some_classmethod()
foo.some_method().some_method().some_method()
Foo.some_staticmethod()
Foo.some_classmethod()
class Bar(object):
def wat(self):
print("Bar.wat")
# these calls to Bar() are not recorded (since no __init__ function)
bar = Bar()
bar.wat()
Bar().wat()

View File

@@ -0,0 +1,3 @@
d = dict()
d.get("foo") or d.get("bar")

View File

@@ -0,0 +1,4 @@
import socket
sock = socket.socket()
print(sock.getsockname())

View File

@@ -0,0 +1,4 @@
import io
# the `io.open` is just an alias for `_io.open`, but we record the external callee as `io.open` :|
io.open("foo")

View File

@@ -0,0 +1,7 @@
for i in range(10):
print(i)
[i + 1 for i in range(10)]
l = list(range(10))
[i + 1 for i in l]
[i + 1 for i in l]

View File

@@ -0,0 +1,37 @@
def one(*args, **kwargs):
print("one")
return 1
def two(*args, **kwargs):
print("two")
return 2
def three(*args, **kwargs):
print("three")
return 3
one(); two()
print("---")
one(); one()
print("---")
alias_one = one
alias_one(); two()
print("---")
three(one(), two())
print("---")
three(one(), two=two())
print("---")
def f():
print("f")
def g():
print("g")
return g
f()()

View File

@@ -0,0 +1,26 @@
class Foo:
def __init__(self):
self.list = []
def func(self, kwargs=None, result_callback=None):
self.list.append((kwargs or {}, result_callback))
foo = Foo()
foo.func()
"""
Has problematic bytecode, since to find out what method is called from instruction 16, we need
to traverse the JUMP_IF_TRUE_OR_POP which requires some more sophistication.
Disassembly of <code object func at 0x7f98f64ee030, file "example/problem-1.py", line 5>:
6 0 LOAD_FAST 0 (self)
2 LOAD_ATTR 0 (list)
4 LOAD_METHOD 1 (append)
6 LOAD_FAST 1 (kwargs)
8 JUMP_IF_TRUE_OR_POP 12
10 BUILD_MAP 0
>> 12 LOAD_FAST 2 (result_callback)
14 BUILD_TUPLE 2
16 CALL_METHOD 1
"""

View File

@@ -0,0 +1,25 @@
def func(func_arg):
print("func")
def func2():
print("func2")
return func_arg()
func2()
def nop():
print("nop")
pass
func(nop)
"""
Needs handling of LOAD_DEREF. Disassembled bytecode looks like:
6 8 LOAD_DEREF 0 (func_arg)
10 CALL_FUNCTION 0
12 RETURN_VALUE
"""

View File

@@ -0,0 +1,10 @@
def foo():
print('foo')
def bar():
print('bar')
foo()
bar()
foo(); bar()

View File

@@ -0,0 +1,5 @@
import sys
print("will exit now")
sys.exit()