Verified all prompt-injection framework models against the real Python
SDK sources:
- OpenRouter: the official openrouter SDK uses client.chat.send(messages=)
(not chat.completions.create), client.embeddings.generate(input=) (not
embeddings.create), and client.responses.send(input=, instructions=).
Corrected the framework qll and model, and fixed the test files that
used the wrong API.
- Anthropic: added the managed-agents system prompt sink
(beta.agents.create/update Argument[system:]).
- Google GenAI: added models.edit_image Argument[prompt:] as user content.
OpenAI, agents and LangChain models were confirmed correct against their
SDK sources.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the JavaScript layout from PR #21953:
- Move SystemPromptInjection.ql / UserPromptInjection.ql to src/Security/CWE-1427
- Move customizations, query and framework libs to python/ql/lib
- Move the AIPrompt concept to the production Concepts.qll
- Drop the experimental tag; py/system-prompt-injection (high precision) now
joins the code-scanning, security-extended and security-and-quality suites,
while py/user-prompt-injection (low precision) stays out of the default suites
- Move query tests to python/ql/test/query-tests/Security
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
We now find an alert on this line as we hope to
It is not an alert for _full_ SSRF, though, since that configuration cannot handle multiple substitutions.
- remove `tupleStoreStep` and `dictStoreStep` from `containerStep`
These are imprecise compared to the content being precise.
- add implicit reads to recover taint at sinks
- add implicit read steps for decoders
to supplement the `AdditionalTaintStep`
that now only covers when the full container is tainted.
This one is potentially a bit iffy -- it checks for a very powerful
property (that implies many of the other queries), but as the test
results show, it can produce false positives when there is in fact no
problem. We may want to get rid of it entirely, if it becomes too noisy.
This looks for nodes annotated with `t[never]` in the test that are
reachable in the CFG. This should not happen (it messes with various
queries, e.g. the "mixed returns" query), but the test shows that in a
few particular cases (involving the `match` statement where all cases
contain `return`s), we _do_ have reachable nodes that shouldn't be.
This one demonstrates a bug in the current CFG. In a dictionary
comprehension `{k: v for k, v in d.items()}`, we evaluate the value
before the key, which is incorrect. (A fix for this bug has been
implemented in a separate PR.)
These use the annotated, self-verifying test files to check various
consistency requirements.
Some of these may be expressing the same thing in different ways, but
it's fairly cheap to keep them around, so I have not attempted to
produce a minimal set of queries for this.
These tests consist of various Python constructions (hopefully a
somewhat comprehensive set) with specific timestamp annotations
scattered throughout. When the tests are run using the Python 3
interpreter, these annotations are checked and compared to the "current
timestamp" to see that they are in agreement. This is what makes the
tests "self-validating".
There are a few different kinds of annotations: the basic `t[4]` style
(meaning this is executed at timestamp 4), the `t[dead(4)]` variant
(meaning this _would_ happen at timestamp 4, but it is in a dead
branch), and `t[never]` (meaning this is never executed at all).
In addition to this, there is a query, MissingAnnotations, which checks
whether we have applied these annotations maximally. Many expression
nodes are not actually annotatable, so there is a sizeable list of
excluded nodes for that query.
We won't be able to run these tests until Python 3.15 is actually out
(and our CI is using it), so it seemed easiest to just put them in their
own test directory.