Skip to content

Dataflow: Refactor pruning stages.#4668

Merged
aschackmull merged 44 commits intogithub:mainfrom
aschackmull:dataflow/refactor-pruning
Nov 30, 2020
Merged

Dataflow: Refactor pruning stages.#4668
aschackmull merged 44 commits intogithub:mainfrom
aschackmull:dataflow/refactor-pruning

Conversation

@aschackmull
Copy link
Contributor

This is a rather large refactor of the pruning stages of the data-flow library. The end result is that stages 2, 3, and 4 are almost bitwise identical (except for 2 instances of unbindBool in stage 3, and 2 missing lines in fwdFlow0/fwdFlow in stage 2). The idea is that this is now the desugared version of an imagined use of parameterised modules. It would be nice to also make stage 1 identical to the other stages, but I've postponed that for now.

Commit-by-commit review is strongly encouraged.

I've tested for bad join-orders and general performance issues between each commit for Java, but the end result should definitely be checked for bad join-orders in all languages.

There are also some pruning improvements, in particular since filtering on potential parameter-flow-through wasn't done consistently. Making the stages share this logic ensures that this additional pruning is applied in all stages.

This column is functionally determined from the access path, and was
merely included to help with some join-orders that no longer appear
problematic.
This commit is a little bit risky, as it allows for some potentially bad
join-orders. The best order starts with the delta and proceeds with the
then functional `mid.getEnclosingCallable()` and `getLocalCallContext`.
In this order `localFlowEntry` becomes superfluous. The standard order
is however somewhat unwilling to choose this. If it picks
`getLocalCallContext` and `getEnclosingCallable` as the first join, the
result is really bad, but it appears that the existence of
`localFlowEntry` at least means that it'll do `localFlowEntry`,
`getEnclosingCallable`, `getLocalCallContext` in that order, which
appears to be acceptable, although it isn't optimal. Without the
`localFlowEntry` conjunct we end up with the worst case. We'll need to
watch this particular join-ordering until we get better join-ordering
directives.
This constructs a few more tuples in Stage3::fwdFlow0, which are then
filtered in Stage3::fwdFlow. This is cleaner and appears faster.
Post-recursion we can filter the forward cons-candidates to only include
those that met a read step, and similarly restrict the reverse flow
cons-candidates to those that met a store step.
@aschackmull
Copy link
Contributor Author

Retrying Java: https://jenkins.internal.semmle.com/job/Changes/job/Java-Differences/998/
C++ and Python look good, but it looks like C# has hit a bad join-order in a standard order (at least that's the most likely conclusion given that the perf problem only hits one project).

@aschackmull
Copy link
Contributor Author

aschackmull commented Nov 17, 2020

@aschackmull
Copy link
Contributor Author

All jobs done and looking good.

Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really awesome, thanks for taking on this task ✨

}

private module Stage2 {
class ApApprox = Stage1::Ap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is always PrevStage::Ap, so I would either call it class PrevStageAp = PrevStage::Ap or just use PrevStage::Ap directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ends up as class ApApprox = PrevStage::Ap; in a later commit (I tried minimising commits with changes that were obsoleted by later commits, but it was hard to avoid completely).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but what I meant was I am not a fan of the ApApprox name. I would rather have PrevStageAp or just PrevStage::Ap.


class Cc = boolean;

/* Begin: Stage 1 logic. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer a module instead of begin/end comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would just be a marker-module to be immediately imported, so I prefer the comments.

pragma[nomagic]
private predicate fwdFlowOutFromArg(
DataFlowCall call, Node out, boolean argAp, Ap ap, Configuration config
DataFlowCall call, Node node, Ap argAp, Ap ap, Configuration config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was wrong with calling it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing, I guess. IIRC it was inconsistently named so I just chose one - do you prefer out over node here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer out because that is the name used by fwdFlowOut.

private Ap apCons(TypedContent tc, Ap tail) { result = true and exists(tc) and exists(tail) }

pragma[inline]
private Content getHeadContent(Ap ap) { exists(result) and ap = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you remove these predicates, and instead replace their single use with

ap = apCons(any(TypedContent tc | c = tc.getContent()), _)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stage2::apCons will have to be changed to

pragma[inline]
private Ap apCons(TypedContent tc, Ap tail) { result = true and exists(tc) and tail in [false, true] }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would correspond to a pop of an Ap just to get the head, so we'd introduce a needless fan-out, which we would then hope that the compiler could eliminate in all stages. Besides this risk I think it would be more confusing, so I don't think it's worth it just to eliminate one predicate-parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Comment on lines 590 to 592
not fullBarrier(node, config) and
not inBarrier(node, config) and
not outBarrier(node, config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. But we don't have a lot of use for in- and out-barriers, so they are certainly very corner-case. (We'd need something like a parameter that's simultaneously part of through-flow and a source node and an in-barrier). Arguably these barriers could have been checked a little earlier, but they aren't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fullBarrier check is done in fwdFlow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Good catch.

Comment on lines +617 to +620
// we don't expect a parameter to return stored in itself
not exists(int pos |
kind.(ParamUpdateReturnKind).getPosition() = pos and p.isParameterOf(_, pos)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Not related to your PR] I am not sure that this restriction makes sense anymore, now that flow summaries are content-sensitive. It is not unrealistic for a function to transfer data from one component of a parameter to another.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's potentially annoying to deal with as this restriction actually cuts off significant amounts of nonsense flow IIRC.

Comment on lines 2133 to 2137
bindingset[node, ap]
private predicate filter(Node node, Ap ap) { any() }

bindingset[ap, contentType]
private predicate typecheckStore(Ap ap, DataFlowType contentType) { any() }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add comments that these are trivial because filtering/type checking happens in previous stage.

Comment on lines +283 to +291
/** The unit type. */
private newtype TUnit = TMkUnit()

/** The trivial type with a single element. */
class Unit extends TUnit {
/** Gets a textual representation of this element. */
string toString() { result = "unit" }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in DataFlowImplCommon.qll instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't, because for Java the Unit type is in the common library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why that is a problem, as DataFlowImplCommon.qll is not exposed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that ought to work if no instance of import java leaks through. Currently there is such a leak (otherwise I couldn't refer to the Java version of Unit), so it wouldn't work without fixing that (but that should probably be separate from this PR as it could break things if someone has referenced a non-data-flow Java QL class through a DataFlow:: prefix). Another angle on this is that we should possibly push towards all the languages adopting a common Unit.qll, and then use that instead of introducing such a fundamental class in DataFlowImplCommon.qll - it has other uses. In summary, I think the definition should still come from DataFlowPrivate.qll, but it should be in the form of import semmle.code.Unit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there is such a leak

:-(

@aschackmull aschackmull merged commit 931322e into github:main Nov 30, 2020
@aschackmull aschackmull deleted the dataflow/refactor-pruning branch November 30, 2020 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments