Dataflow: Refactor pruning stages.#4668
Conversation
This column is functionally determined from the access path, and was merely included to help with some join-orders that no longer appear problematic.
This commit is a little bit risky, as it allows for some potentially bad join-orders. The best order starts with the delta and proceeds with the then functional `mid.getEnclosingCallable()` and `getLocalCallContext`. In this order `localFlowEntry` becomes superfluous. The standard order is however somewhat unwilling to choose this. If it picks `getLocalCallContext` and `getEnclosingCallable` as the first join, the result is really bad, but it appears that the existence of `localFlowEntry` at least means that it'll do `localFlowEntry`, `getEnclosingCallable`, `getLocalCallContext` in that order, which appears to be acceptable, although it isn't optimal. Without the `localFlowEntry` conjunct we end up with the worst case. We'll need to watch this particular join-ordering until we get better join-ordering directives.
This constructs a few more tuples in Stage3::fwdFlow0, which are then filtered in Stage3::fwdFlow. This is cleaner and appears faster.
Post-recursion we can filter the forward cons-candidates to only include those that met a read step, and similarly restrict the reverse flow cons-candidates to those that met a store step.
This actually provides a decent pruning improvement in stages 3 and 4.
|
Retrying Java: https://jenkins.internal.semmle.com/job/Changes/job/Java-Differences/998/ |
|
All jobs done and looking good. |
hvitved
left a comment
There was a problem hiding this comment.
This is really awesome, thanks for taking on this task ✨
| } | ||
|
|
||
| private module Stage2 { | ||
| class ApApprox = Stage1::Ap; |
There was a problem hiding this comment.
It looks like this is always PrevStage::Ap, so I would either call it class PrevStageAp = PrevStage::Ap or just use PrevStage::Ap directly.
There was a problem hiding this comment.
It ends up as class ApApprox = PrevStage::Ap; in a later commit (I tried minimising commits with changes that were obsoleted by later commits, but it was hard to avoid completely).
There was a problem hiding this comment.
Right, but what I meant was I am not a fan of the ApApprox name. I would rather have PrevStageAp or just PrevStage::Ap.
|
|
||
| class Cc = boolean; | ||
|
|
||
| /* Begin: Stage 1 logic. */ |
There was a problem hiding this comment.
I think I would prefer a module instead of begin/end comments.
There was a problem hiding this comment.
It would just be a marker-module to be immediately imported, so I prefer the comments.
| pragma[nomagic] | ||
| private predicate fwdFlowOutFromArg( | ||
| DataFlowCall call, Node out, boolean argAp, Ap ap, Configuration config | ||
| DataFlowCall call, Node node, Ap argAp, Ap ap, Configuration config |
There was a problem hiding this comment.
What was wrong with calling it out?
There was a problem hiding this comment.
Nothing, I guess. IIRC it was inconsistently named so I just chose one - do you prefer out over node here?
There was a problem hiding this comment.
I would prefer out because that is the name used by fwdFlowOut.
| private Ap apCons(TypedContent tc, Ap tail) { result = true and exists(tc) and exists(tail) } | ||
|
|
||
| pragma[inline] | ||
| private Content getHeadContent(Ap ap) { exists(result) and ap = true } |
There was a problem hiding this comment.
What happens if you remove these predicates, and instead replace their single use with
ap = apCons(any(TypedContent tc | c = tc.getContent()), _)There was a problem hiding this comment.
Stage2::apCons will have to be changed to
pragma[inline]
private Ap apCons(TypedContent tc, Ap tail) { result = true and exists(tc) and tail in [false, true] }There was a problem hiding this comment.
That would correspond to a pop of an Ap just to get the head, so we'd introduce a needless fan-out, which we would then hope that the compiler could eliminate in all stages. Besides this risk I think it would be more confusing, so I don't think it's worth it just to eliminate one predicate-parameter.
| not fullBarrier(node, config) and | ||
| not inBarrier(node, config) and | ||
| not outBarrier(node, config) |
There was a problem hiding this comment.
Not quite. But we don't have a lot of use for in- and out-barriers, so they are certainly very corner-case. (We'd need something like a parameter that's simultaneously part of through-flow and a source node and an in-barrier). Arguably these barriers could have been checked a little earlier, but they aren't.
There was a problem hiding this comment.
The fullBarrier check is done in fwdFlow.
There was a problem hiding this comment.
Ah, right. Good catch.
| // we don't expect a parameter to return stored in itself | ||
| not exists(int pos | | ||
| kind.(ParamUpdateReturnKind).getPosition() = pos and p.isParameterOf(_, pos) | ||
| ) |
There was a problem hiding this comment.
[Not related to your PR] I am not sure that this restriction makes sense anymore, now that flow summaries are content-sensitive. It is not unrealistic for a function to transfer data from one component of a parameter to another.
There was a problem hiding this comment.
That's potentially annoying to deal with as this restriction actually cuts off significant amounts of nonsense flow IIRC.
| bindingset[node, ap] | ||
| private predicate filter(Node node, Ap ap) { any() } | ||
|
|
||
| bindingset[ap, contentType] | ||
| private predicate typecheckStore(Ap ap, DataFlowType contentType) { any() } |
There was a problem hiding this comment.
Perhaps add comments that these are trivial because filtering/type checking happens in previous stage.
| /** The unit type. */ | ||
| private newtype TUnit = TMkUnit() | ||
|
|
||
| /** The trivial type with a single element. */ | ||
| class Unit extends TUnit { | ||
| /** Gets a textual representation of this element. */ | ||
| string toString() { result = "unit" } | ||
| } | ||
|
|
There was a problem hiding this comment.
This should be in DataFlowImplCommon.qll instead.
There was a problem hiding this comment.
It can't, because for Java the Unit type is in the common library.
There was a problem hiding this comment.
I don't see why that is a problem, as DataFlowImplCommon.qll is not exposed.
There was a problem hiding this comment.
Right, that ought to work if no instance of import java leaks through. Currently there is such a leak (otherwise I couldn't refer to the Java version of Unit), so it wouldn't work without fixing that (but that should probably be separate from this PR as it could break things if someone has referenced a non-data-flow Java QL class through a DataFlow:: prefix). Another angle on this is that we should possibly push towards all the languages adopting a common Unit.qll, and then use that instead of introducing such a fundamental class in DataFlowImplCommon.qll - it has other uses. In summary, I think the definition should still come from DataFlowPrivate.qll, but it should be in the form of import semmle.code.Unit.
There was a problem hiding this comment.
Currently there is such a leak
:-(
This is a rather large refactor of the pruning stages of the data-flow library. The end result is that stages 2, 3, and 4 are almost bitwise identical (except for 2 instances of
unbindBoolin stage 3, and 2 missing lines infwdFlow0/fwdFlowin stage 2). The idea is that this is now the desugared version of an imagined use of parameterised modules. It would be nice to also make stage 1 identical to the other stages, but I've postponed that for now.Commit-by-commit review is strongly encouraged.
I've tested for bad join-orders and general performance issues between each commit for Java, but the end result should definitely be checked for bad join-orders in all languages.
There are also some pruning improvements, in particular since filtering on potential parameter-flow-through wasn't done consistently. Making the stages share this logic ensures that this additional pruning is applied in all stages.