Absent features are now represented implicitly by the absence of a row
in the `tokenFeatures` relation, rather than explicitly by an empty
string. This leads to improved runtime performance. To enable this
implicit representation, we pass the set of supported token features to
the `scoreEndpoints` HOP. Requires CodeQL CLI v2.7.4.
Previously heuristic sinks were always included, to avoid us filtering
them out due to not being an argument to an external library call.
In this commit we move the argument to an external library call
filtering to the query-specific endpoint filters.
This lets us filter out heuristic sinks if they match one of the other
endpoint filters, reducing FPs.
0e31439 introduces some occasional duplicate tokens due to duplicate AST
node attributes. The long-term fix is to update `CodeToFeatures.qll`,
but for the short-term, we update the concatenation to concatenate
unique (location, token) pairs.
Use a `strictcount` to identify whether there is exactly one feature or not.
If so, we use it. If not, we use the empty string.
Add context to ensure we filter the set of data flow nodes down to only
the set of endpoint nodes.
This performance optimisation avoids calculating the Cartesian product
of data flow nodes and feature names, but it does not avoid calculating
the (slightly smaller) Cartesian product of endpoint nodes and feature names.
Product size = number of endpoint nodes * number of feature names.
At time of writing there are 8 feature names.
Change the cutoff logic from `count` to `strictcount`, since we know it only applies
to a non-empty set of results.
Use a single `strictconcat` aggregate to combine tokens in order of location,
instead of computing a `rank` followed by a `concat`.
Strictness introduces a slight change of behaviour because missing tokens will now result
in no results from the predicate rather than an empty feature string.