-
Notifications
You must be signed in to change notification settings - Fork 858
[NFC] Improve SimplifyLocals compile-time performance #8604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Changqing-JING
wants to merge
1
commit into
WebAssembly:main
Choose a base branch
from
Changqing-JING:opt/compile-speed-4
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+115
−21
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this tries to be faster by guessing that most interactions are local?
If that's the case, then I don't think that is true in general. We would need to measure on more codebases to check that. If it isn't generally true, then it could make us slower sometimes. It also adds chances for bugs to crop up here, as this code must remain in sync with
effects.h.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kripken thank you for review
Updated the PR: the fast path now classifies the current expression (not sinkables) using only existing
EffectAnalyzerfields — no new methods in effects.h, nothing to keep in sync. When the expression only touches locals, we use reverse indices (localReadBySinkable,localWrittenBySinkable) to look up exactly the conflicting sinkables in O(|locals touched|) instead of scanning all sinkables; otherwise we fall back to the exact original loop, so worst case is identical tomain. Benchmark still shows ~1.9x speedup.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, there is still something to keep in sync: the code
if (!effects.transfersControlFlow() && !effects.writesGlobalState() && !effects.readsMutableGlobalState() && !effects.danglingPop && !effects.trap && !effects.hasSynchronization() && !effects.mayNotReturnIf, say, we added some new type of effect there, we'd need to add it to this list of all possible effects. So this PR adds complexity and increases the risk of bugs, in my opinion.
If this is a huge speedup it might be worth those downsides. However, I still have my concern from before, which is that this is a speedup for the case where local interactions are the most common thing, but that is not generally true. There could be codebases where e.g. GC struct/array interactions are more common. That is, there is nothing fundamental about local interactions that justifies focusing on them, AFAICT - this pass happens to optimize locals, but that's a coincidence?
Taking a step back, if this pattern of local effects is very common, there may be a better way to optimize it. That they are local effects and nothing else suggests things like copies,
or trivial, effect-less operations
I would like to understand which of those (or something else) is very common in the benchmark you are looking at. Perhaps we can add a tailored optimization for it. E.g. if copies are the issue, then running coalesce-locals first (which can remove many copies) might be even faster than this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I instrumented
checkInvalidationson the benchmark. The 236MorderedAftercalls break down as:Local-only expressions dominate because every wasm function is full of local.get/set regardless of what higher-level ops it uses. The fast path reduces those 147M to 100K via reverse index (99.9% reduction).
Re sync concern: happy to move the check into
EffectAnalyzeras ahasOnlyLocalEffects()method so there's one place to maintain. Or open to other approaches.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share the code you used to measure those figures? I'm curious to check this on a variety of other wasm files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I have pushed the instrumentation code in this pr.
To use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kripken
What's your opinion to continue with this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not getting back to this sooner.
I did some measurements now, because my intuition is still that this is going to depend on the codebase. I tried it on a large Java testcase:
(this is just your first commit, of course - not with the instrumentation)
In this code, there are fewer local interactions, I suppose, and more other things (likely GC load/store overlaps). So all the extra bookkeeping this PR adds just add overhead.
As the patch adds a lot of complexity and maintenance burden, I don't think it makes sense to land. But I hope we can find general ways of speeding this up, that work on all or at least most wasms. To do that, we need to have a good rationale, I think, a reason why a fast path would be common, and local operations don't have a good reason to be exceptional here (and are not in practice, as this data shows).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kripken
I changed to another concept
This keeps the optimization inside SimplifyLocals: it batches invalidating effects and only flushes when the full sinkable set must be exact. For common local.get/ local.set cases, it checks just that one local, reducing repeated full scans without adding effects.h coupling.
Perf on test3.wasm
Regression test on kotlin case
Main
This PR