mirror of
https://github.com/hohn/codeql-lab.git
synced 2025-12-16 01:53:03 +01:00
301 lines
16 KiB
Org Mode
301 lines
16 KiB
Org Mode
* codeql-lab: Centralized Git Repository for CodeQL Development
|
||
|
||
** Overview
|
||
codeql-lab is a consolidated Git repository that collects all relevant
|
||
CodeQL components, resources, and tooling into a single
|
||
version-controlled location.
|
||
|
||
** Purpose
|
||
The goal of this repository is to provide an integrated development
|
||
environment (“lab”) for CodeQL research, experimentation, and custom
|
||
query development. It simplifies setup by maintaining all required
|
||
submodules, configuration files, and datasets in one place.
|
||
|
||
** Repository Location
|
||
The primary repository is hosted at:
|
||
https://github.com/hohn/codeql-lab
|
||
|
||
** Intended Use Cases
|
||
- Local experimentation with CodeQL queries and libraries.
|
||
- End-to-end testing of custom model data and query logic.
|
||
This includes writing and validating custom data flow models,
|
||
adjusting model coverage, and confirming that query results behave
|
||
as expected across controlled datasets. The lab setup supports rapid
|
||
iteration on QL logic, helping detect unintended changes and enabling
|
||
reproducible evaluations of taint tracking, control flow, or API usage
|
||
patterns.
|
||
- Structured collaboration and controlled updates across all
|
||
CodeQL-related artifacts.
|
||
- Simplified onboarding and reproducible setup for new contributors or
|
||
analysis environments.
|
||
|
||
* Prerequisites
|
||
|
||
Working with this repository assumes prior experience with:
|
||
|
||
- *Git, Bash, and standard Unix command-line tools*. These are used
|
||
throughout and are required for setup and day-to-day tasks.
|
||
Tools such as [[https://man.archlinux.org/man/rg.1][ripgrep]], [[https://www.gnu.org/software/bash/][GNU Bash]], and [[https://en.wikipedia.org/wiki/Grep][grep/regex workflows]] are assumed.
|
||
|
||
- *At least one supported programming language*, such as C, C++, Java,
|
||
Python, Go, or Ruby. A solid understanding of the target language is
|
||
necessary to interpret analysis results and write effective queries.
|
||
See general background on [[https://en.wikipedia.org/wiki/Programming_language][programming languages]] if needed.
|
||
|
||
- *Basic familiarity with program structure concepts*, including
|
||
[[https://en.wikipedia.org/wiki/Abstract_syntax_tree][abstract syntax trees (ASTs)]], [[https://en.wikipedia.org/wiki/Control-flow_graph][control-flow graphs (CFGs)]], and
|
||
[[https://en.wikipedia.org/wiki/Data-flow_analysis][data-flow graphs (DFGs)]]. These are core to how CodeQL models code behavior.
|
||
|
||
- *Optional but helpful*: familiarity with structural or functional
|
||
programming languages (e.g. [[https://en.wikipedia.org/wiki/Lisp_(programming_language)][Lisp]] or [[https://en.wikipedia.org/wiki/OCaml][OCaml]]) can make working with
|
||
CodeQL’s query language and type system more intuitive.
|
||
See overview of [[https://en.wikipedia.org/wiki/Functional_programming][functional programming]] for related context.
|
||
|
||
|
||
* Repository Layout
|
||
** Core Structure
|
||
- Repository is based on: https://github.com/github/vscode-codeql-starter.git
|
||
- All development work is done on the branch: qllab
|
||
- CodeQL version is pinned via the =ql/= submodule:
|
||
: commit 4d681f05bd671f8b5e31624f16a2b4d75e61c071 (tag: codeql-cli/v2.22.0)
|
||
- A prebuilt CodeQL CLI binary is included:
|
||
: 1104625939 assets/codeql-osx64.zip
|
||
- Project-specific repositories can be added directly under the root.
|
||
Example: the C dataflow workshop in =./codeql-dataflow-sql-injection-c=
|
||
|
||
** Additional Structure Notes
|
||
- The original upstream README.md is preserved at [[./README-vscode-codeql-starter.md]]
|
||
|
||
* Possible Reading Orders
|
||
|
||
** Data Flow
|
||
*** Debugging data flow config (instead of taint flow), Java
|
||
We can illustrate taint-flow debugging in the Java SQL injection sample
|
||
- [[./codeql-sqlite-java/TaintFlowDebugging.ql]]
|
||
- [[./codeql-sqlite-java/TaintFlowDebugging.md]]
|
||
|
||
*** TODO Debugging data flow config (instead of taint flow), C
|
||
A corresponding example for C is planned, using a simplified query to trace
|
||
value propagation in [[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/add-user.c]].
|
||
Unlike Java, C may require manual modeling even to visualize basic flows.
|
||
|
||
** Modeling
|
||
There are two primary approaches to modeling: direct use of CodeQL predicates
|
||
and the models-as-data system. The models-as-data system is implemented in QL
|
||
but relies on external YAML files that are interpreted at query evaluation
|
||
time.
|
||
|
||
The model editor provides a GUI for managing YAML-based models, but the
|
||
underlying format is identical to that used by the models-as-data system. In C
|
||
and other cases where GUI support is limited or unavailable, we write these
|
||
YAML models manually and invoke them directly from queries.
|
||
|
||
When YAML models are written directly, the use of GPT-based tooling becomes
|
||
very natural. GPTs can extract function signatures, parameter semantics, and
|
||
flow annotations from documentation or code examples, then generate valid YAML
|
||
model entries automatically.
|
||
|
||
As diagram:
|
||
#+BEGIN_SRC text
|
||
+----------------------+
|
||
| Modeling in |
|
||
| CodeQL |
|
||
+----------+-----------+
|
||
|
|
||
+------------------------------+------------------------------+
|
||
| |
|
||
+--------v--------+ +---------v---------+
|
||
| Direct CodeQL | | Models-as-Data |
|
||
| (QL predicates) | | (YAML + QL eval) |
|
||
+--------+--------+ +---------+---------+
|
||
| |
|
||
| |
|
||
+----------v----------+ +---------------v---------------+
|
||
| Manual customization| | YAML models via GUI |
|
||
| via Customizations.qll | (Model Editor frontend) |
|
||
+----------+----------+ +---------------+---------------+
|
||
| |
|
||
| |
|
||
+---------v---------+ +-----------v-----------+
|
||
| Java: built-in | | Java: Jedis + Console |
|
||
| includes .qll hook | | GUI modeling examples |
|
||
+--------------------+ +------------------------+
|
||
|
|
||
| Manual setup needed for:
|
||
v
|
||
+------------------------+
|
||
| C / C++: requires |
|
||
| cpp.qll patch + |
|
||
| Customizations.qll |
|
||
+------------------------+
|
||
|
|
||
v
|
||
+-------------------------------+
|
||
| Use models-as-data directly |
|
||
| (YAML only, no editor) |
|
||
+-------------------------------+
|
||
|
|
||
v
|
||
+-------------------------------+
|
||
| GPT-assisted YAML generation |
|
||
| from docs, code, or examples |
|
||
+-------------------------------+
|
||
#+END_SRC
|
||
|
||
|
||
*** Review: SQLite Injection Workshop, Java
|
||
We begin with a recap of the Java-based injection example, focusing on the
|
||
vulnerable code in [[./codeql-sqlite-java/AddUser.java][AddUser.java]]. Following that, we examine a fully manual
|
||
CodeQL query available in [[./codeql-sqlite-java/full-query.ql][full-query.ql]], which was written to explicitly trace
|
||
tainted data through the program. Next, we explore the out-of-the-box query
|
||
[[./ql/java/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]] included in the standard CodeQL packs, and conclude with an
|
||
inspection of the relevant base classes and framework modeling in
|
||
[[./codeql-sqlite-java/Illustrations.ql][Illustrations.ql]].
|
||
|
||
*** Customizations via codeql (Java)
|
||
To customize CodeQL for Java, we identify and extend base classes to add
|
||
custom flow sources and sinks. A general explanation of this approach is
|
||
available in the file [[./codeql-dataflow-sql-injection-c/README.org][README.org]], particularly
|
||
the section [[file:codeql-dataflow-sql-injection-c/README.org::*supplement codeql: Add to FlowSource or a subclass][supplement codeql: Add to FlowSource or a subclass]]. For Java,
|
||
[[./ql/java/ql/lib/java.qll][java.qll]] includes [[./ql/java/ql/lib/Customizations.qll][Customizations.qll]], which provides extension points for
|
||
custom flow modeling -- this structure is common across most CodeQL-supported
|
||
languages, with the notable exception of C. Further details on this
|
||
customization process can be found in
|
||
[[./codeql-dataflow-sql-injection-c/incoming.codeql-customizations-workshop.md][incoming.codeql-customizations-workshop.md]].
|
||
|
||
*** Customizations via Model Editor: Jedis Example (Java Redis client)
|
||
The Jedis example is a straightforward case with no unexpected
|
||
behavior. Although the library contains many functions, they follow a simple
|
||
and repetitive pattern, making it ideal for large-scale modeling. The CodeQL
|
||
model editor can be used to efficiently define sources and sinks for such
|
||
cases. A detailed explanation is provided
|
||
OK
|
||
in [[file:~/work-gh/codeql-lab/codeql-jedis-java/README.org::*Modeling Jedis as a Dependency in Model Editor][Modeling Jedis as a Dependency in Model Editor]], while validation of
|
||
OK
|
||
the modeled sink is discussed in [[file:~/work-gh/codeql-lab/codeql-jedis-java/README.org::*Verifying the Modeled Sink][Verifying the Modeled Sink]].
|
||
Finally, the query-level usage of these models can be seen
|
||
OK
|
||
in [[file:~/work-gh/codeql-lab/codeql-jedis-java/README.org::*Identify usage of injection-related models in existing queries][Identify usage of injection-related models in existing queries]].
|
||
|
||
*** Customizations via Model Editor: Single-function case (Java SQLite sample)
|
||
We extend the Java SQLite example using the model editor, with both the
|
||
necessary data and specification already available. This example highlights a
|
||
subtle issue with the model editor: the method =java.io.Console.readLine()= is
|
||
already modeled as a taint *step* and therefore does not appear in the editor
|
||
interface, even though we need it modeled as a *source*. This requires special
|
||
handling. The relevant extensions are defined in
|
||
[[./.github/codeql/extensions/sqlite-db/codeql-pack.yml]], and the extension data
|
||
is provided in
|
||
[[./.github/codeql/extensions/sqlite-db/models/sqlite.model.yml]]. A detailed
|
||
*OK*
|
||
explanation is available in [[file:~/work-gh/codeql-lab/codeql-sqlite-java/README.org::*Using sqlite to illustrate models-as-data][Using sqlite to illustrate models-as-data]].
|
||
|
||
To support this, we explain how the "models-as-data" system works
|
||
internally. A diagnostic query can be used to enumerate currently recognized
|
||
sources and sinks. From there, the relevant entry points -- such as QL classes
|
||
and predicates -- can be identified by inspecting representative queries like
|
||
[[./ql/java/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]].
|
||
|
||
|
||
*** Review: SQLite Injection Workshop (C)
|
||
This is the C version of the injection workshop, based on
|
||
[[./codeql-dataflow-sql-injection-c/add-user.c]]. It
|
||
serves as the basis for both the "models-as-data" manual modeling and the
|
||
extension via =Customizations.qll=.
|
||
|
||
*** (PARTIAL) Use models-as-data QL code directly (no graphical editor)
|
||
This section focuses on using the models-as-data system *without* the
|
||
graphical model editor. While model definition files and supporting data
|
||
already exist, we manually write YAML files to add or override flow
|
||
behavior. This approach is especially relevant for C, where graphical tooling
|
||
is limited or nonexistent.
|
||
|
||
As reinforcement, we reuse the C version of the SQLite injection workshop:
|
||
- The code sample is at
|
||
[[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/add-user.c]].
|
||
- The accompanying query is
|
||
[[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/SqlInjection.ql]].
|
||
|
||
For structural reference, see the Java version’s documentation (not the editor
|
||
interface): [[file:~/work-gh/codeql-lab/codeql-sqlite-java/README.org::*Using sqlite to illustrate models-as-data][Using sqlite to illustrate models-as-data]]. There is no separate
|
||
C-specific walkthrough because the YAML structure and logic are nearly
|
||
identical.
|
||
|
||
For workshop use, we extend the example by modeling key functions manually:
|
||
- Add a source model for: =count = read(STDIN_FILENO, buf, BUFSIZE);=
|
||
- Add a sink model for: =rc = sqlite3_exec(db, query, NULL, 0, &zErrMsg);=
|
||
|
||
We demonstrate how to define YAML-based models for standard functions like
|
||
=read()= and verify their effect using the out-of-the-box query:
|
||
[[./ql/cpp/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]].
|
||
|
||
As an additional teaching case, we introduce the higher-level, redundant
|
||
function =char* get_user_info()= as a custom source—even though it internally
|
||
calls a function already modeled as a source—to illustrate how user-defined
|
||
extensions affect propagation logic.
|
||
|
||
*** (PARTIAL) Extending Queries with Customizations.qll for C
|
||
The manual YAML modeling approach described earlier works well for isolated or
|
||
prototype cases. However, for idiomatic, large-scale, or reusable CodeQL
|
||
analysis, it is often preferable to define custom dataflow logic directly in
|
||
QL—using =Customizations.qll=.
|
||
|
||
Most CodeQL-supported languages (e.g., Java, Python) include built-in support
|
||
for this mechanism. For example, Java’s primary entry point [[./ql/java/ql/lib/java.qll][java.qll]]
|
||
automatically imports [[./ql/java/ql/lib/Customizations.qll][Customizations.qll]], exposing extension points for
|
||
user-defined sources, sinks, and flow steps.
|
||
|
||
In contrast, C and C++ do *not* support this out of the box. To enable it, you
|
||
must manually patch the language pack and (optionally) rebuild the CodeQL
|
||
bundle.
|
||
|
||
This section is *partially complete*: we document the required source-level QL
|
||
changes, but the bundling process is still pending.
|
||
|
||
To enable =Customizations.qll= support for C/C++, perform the following:
|
||
|
||
1. Modify =ql/cpp/ql/lib/cpp.qll= to import your =Customizations.qll= module.
|
||
2. Create and populate =ql/cpp/ql/lib/Customizations.qll= with new
|
||
source/sink/flow logic.
|
||
3. *For full deployment:* Rebuild the CodeQL bundle to include the updated
|
||
QL files.
|
||
- This allows portable use in CLI runs and IDE workflows.
|
||
- Once bundled, C/C++ customization behaves like any other supported
|
||
language.
|
||
4. *For workshops and local development:* No bundling is needed.
|
||
- If you run queries directly from the modified source tree, the changes
|
||
take effect immediately.
|
||
|
||
A working demonstration of this modification (without bundling) is provided
|
||
in:
|
||
[[./codeql-dataflow-sql-injection-c/README.org]]
|
||
|
||
** TODO CodeQL Bundling
|
||
This section will provide a detailed walkthrough of the CodeQL bundling process
|
||
using the CLI tool at https://github.com/advanced-security/codeql-bundle. This
|
||
tool enables custom pack composition and is necessary when extending language
|
||
libraries (e.g., adding `Customizations.qll` support for C/C++).
|
||
|
||
While the official tool is somewhat of a black box, we will demystify the
|
||
underlying structure and show how to build, inspect, and deploy custom bundles
|
||
from source. Notes and scripts will be collected in
|
||
[[file:codeql-bundling/README.org::XX: continue]].
|
||
|
||
* Tool Setup
|
||
Some scripts are used here, found in [[./bin/]]. To ensure the ones written in
|
||
Python have access to prerequites, set up a virtual environment via
|
||
#+BEGIN_SRC sh
|
||
# 1. Create the virtualenv
|
||
python3 -m venv ~/codeql-lab/venv
|
||
|
||
# 2. Install any packages
|
||
source ~/codeql-lab/venv/bin/activate
|
||
pip install pyyaml
|
||
#+END_SRC
|
||
|
||
For any of these scripts to work, add them to the PATH via
|
||
#+BEGIN_SRC sh
|
||
export PATH="$HOME/codeql-lab/bin:$PATH"
|
||
#+END_SRC
|
||
|