New and Improved MapFusion #1629

philip-paul-mueller · 2024-08-22T13:54:43Z

This PR introduces a new and improved version of MapFusion.

The PR fixes several bugs and several limitations of the previous versions.
This is a summary of all the changes:

The subsets (not the .subset member of the Memlet; I mean the concept) of the new intermediate data descriptor were not computed correctly in some cases, especially in presence of offsets. See the test_offset_correction_range_read(), test_offset_correction_scalar_read() and the test_offset_correction_empty() tests.
Upon the propagation of the subsets, due to the changed intermediate, was not handled properly. Essentially, the transformation only updated .subset and ignored .other_subset. Which is correct in most cases but not always. See the test_fusion_intrinsic_memlet_direction() for more.
The check if an intermediate could be fully removed or had to be recreated, as a new Map output, was not done properly. For this it is needed to scan the entire SDFG to determine if it is used somewhere else. To speed up this, a cache was introduced that scans the SDFG once and then reuses this information. Not a perfect solution as there is no way to check if the cache as to rebuild. The use in auto_optimizer() is such that it takes advantage of it. See also the comment about assume_always_shared flag.
During the check if two maps could be fused the .dynamic property of the Memelts were fully ignored leading to wrong code.
The read-write conflict checks were refined, before all arrays needed to be accessed the wrong way, i.e. before a fusion was rejected when one map accessed A[i, j] and the other map was accessing B[i + 1, j]. Now this is possible as long as every access is point wise. See the test_fusion_different_global_accesses() test for an example.
The shape of the reduced intermediate is cleaned, i.e. unnecessary dimensions of size 1, are removed, except they were present in the original shape. To make an example, the intermediate array, T, had shape (10, 1, 20) and inside the map was accessed T[__i, 0, __j], then the old transformation would have created an reduced intermediate of shape (1, 1, 1), new its shape is (1). Note if the intermediate has shape (10, 20) instead and would be accessed as T[__i, __j] then a Scalar would have been created. See also the struct_dataflow flag below.

In addition some new flags were introduced:

only_toplevel_maps: If True the transformation will only fuse maps that are located at the top level, i.e. maps inside maps will not be merged.
only_inner_maps: If True then the transformation will only fuse maps that are inside other maps.
assume_always_shared: If True` then the transformation will assume that every intermediate is shared, i.e. the referenced data is used somewhere else in the SDFG and has to become an output of the fused maps. This will create dead data flow, but avoids a scan of the full SDFG.
strict_dataflow: This flag is enabled by default. It has two effects, first it will disable the cleaning of reduced intermediate storage. The second effect is more important as it will preserve a much stricter data flow. Most importantly, if the intermediate array is used downstream (this is not limited to the case that the array is the output of the second map) then the maps will not be fused together. This is mostly to work around some other bugs in DaCe, where other transformations failed to pink up the dependency. Note that the fused map would be correct, the problem are other transformations.

Collection of known issues in other transformation:

Now using the 3.9 type hints.

But it is too restrictive.

When the function was fixing the innteriour of the second map, it did not remove the readiong.

It almost passes all fuction. However, the one that needs renaming are not yet done.

…t in the input and output set. However, it is very simple.

Before it was going to look for the memlet of the consumer or producer. However, one should actually only look at the memlets that are adjacent to the scope node. At least this is how the original worked. I noticed this because of the `buffer_tiling_test.py::test_basic()` test. I was not yet focused on maps that were nested and not multidimensional. It seems that the transformation has some problems there.

Whet it now cheks for covering (i.e. if the information to exchange is enough) it will now no longer decend into the maps, but only inspect the first outgoing/incomming edges of the map entrie and exit. I noticed that the other way was to restrictive, especially for map tiling.

Otherwise we can end up in recursion.

Before it was replacing the elimated variables by zero. Which actually worked pretty good, but I have now changed that such that `offset()` is used. I am not sure why I used `replace` in the first place, but I think that there was an issue. However, I am not sure.

…ck is taken.

Added a new special case.

…ured. Before the output edges were before set to dynamic. However, this was not true as it was always set, thus the new map fusion did not fuse them. My first attempt was to just disable the `dynamic` property, but now the SDFG is generated manually. It is almost the same, but uses lesss symbol, as it was simpler to implement it this way, and we are now using float.

For such edges we are sure that the data exists, so it is just a conditional read, which is fine.

Using `nodes()` on an SDFG will only give us the control flow regions, but using `state` will give us also the nested states. I looked through my code and this seems to be the only places where they appear. This fixes the correlaton test, but the heat test still fails.

The issue was similar as before. When I computed the name of the intermediate transient then I used `sdfg.node_id(state)` to get the state ID. However, now if the state is part of these recursive control flow regions then this may not work, because the state is not a direct node of the SDFG. However, if I use `self.state_id` then it works, this is what the old MapFusion was doing.

This tests dynamic Memlets inside producers; the original transformation fails on it.

…beral.

philip-paul-mueller · 2024-12-17T12:03:42Z

Thanks for reviewing and the wall of text.

To give you some context.
Initially I started with JaCe (JAX frontend for DaCe), I applied it to stencils from ICON4Py.
However, for some of them DaCe's auto_optimizer, was unable to handle them, either the fusion was not performed, the resulting SDFG was invalid or the computation was wrong.
I traced them down to MapFusion, at first I was trying to fix the original implementation, but I had trouble understanding the code at all, so I started to rewrite the transformation.

The main issues I found (not limited to ICON4Py) were:

The subsets (not the .subset member of the Memlet; I mean the concept of where we write to it and from where we read) of the new intermediate array were not computed correctly.
The transformation did not make a difference between .subset and .other_subset of a Memlet and in most cases just accessed .subset which might be wrong. In fact this is a general impression I had that a lot of code simply accesses .subset (which happens to be the right choice in most but not all cases) and does not care about the intrinsic direction of the Memlet.
The check if an intermediate can be removed or must be recreated afterwards was wrong. For this the whole SDFG has to be scanned, there is no way around it, but it was not done.
The .dynamic property of the Memelts where fully ignored.
As a side note, the check for WCR is on line 427
The code that propagates the change (removed intermediate) into the scope was wrong; again .subset was not handled correctly. (Although, I have to say that the current code should also be improved, but just a little.)

I want to point out that this PR adds a lot of tests for MapFusion (approximately 40% of the edits) and the previous version is not able to pass them; roughly 1/3 of them fails.

Regarding the description, I agree the doc string of the class is not that good, however, the code itself is in my view better documented than before, but I have updated the description of the transformation to give a better high level overview, which points to the functions that performs the tests.

I do not know OTFMapFusion and SubgraphFusion very well, however, I have seen that SubgraphFusion is much more general, for example, instead of reducing the intermediate it will move the intermediate data access inside the Map.
The only capability I know SubgraphFusion has is that it is able to handle Maps that are parallel.
This is a capability that my MapFusion currently lacks (it was originally included in the PR, but removed afterwards).

I think the best way to see my MapFusion is not as something new but just as a new iteration of what was already there, it just performs more analysis to handle more cases than before. This allows it to handle more cases. However, there are still some todo's that are open.

I have to admit that I have not performed any testing of the runtime, but I do not have the impression that it takes much more time than before. The reason is that MapFusion is, beside two exceptions, a very local operation.
The first exception is, the check if an intermediate can be removed or not. However, this information is more or less static, so the transformation computes this set at the beginning and then caches it. The downside is that it is hard to tell if the cache should be renewed. However, the cache remains valid as long as no AccessNodes are added. I checked that the use in auto_optimizer is fine. Further, to avoid this I added the assume_always_shared flag. This tells the transformation that every intermediate is shared. Thus no scan is ever needed, however, it will lead to dead dataflow.
The second exception is where we have to ensure that no cycles are created, however, this will only explore the dataflow graph locally (everything downstream).
Furthermore, when I wrote the thing I tried to order the checks in such a way that the ones that are either cheap or very likely to fail come first.

phschaad

Thank you for addressing the questions and concerns. This LGTM now in general. Please update the PR description to be in-line with the actual changes after revisions (ideally also with some of the details from the docstring of the transformation itself). After that I am happy to approve the PR.

The transformation checks if the first map satisifes the data dependencies of the second map. For this is looks at the writes and reads of the intermediate. It also checks if, a data container is used as input of the first and as output of the second map, if the access is pointwise and can be fused. Furthermore, it was allowed that the intermediate is also used as input to the first map. However, in that particular case, it was not checked if the the reads and writes of the first map alone to the intermediate are valid. I.e. it could read read `A[i]` but write `A[i+1]` which would cause problems (note that this usage is botherline legal anyway. This commit adds a check to make sure that this is not the case by enforcing if a data container is used as input and output of the first map and also as intermediate node then its read must be pointwise. Note that if it is not an intermediate node, i.e. not also read by the second map, then this rule does not apply. NOTE: It is forbidden that the intermediate is used as intermediate and output of the second map.

Started with a first version of the map fusion stuff.

aa433fe

philip-paul-mueller changed the title ~~Started with a first version of the map fusion stuff.~~ New and Improved MapFusion Aug 22, 2024

philip-paul-mueller marked this pull request as draft August 22, 2024 13:55

philip-paul-mueller added 27 commits August 23, 2024 08:32

Made some stylistic modification to teh code.

71a88a1

Now using the 3.9 type hints.

Added a function for estimating if something is pointwhise.

bc87ddb

But it is too restrictive.

Now there is an error in the actuall rewiering stuff.

497a2d6

Fixed a bug in the map fusion.

9e36447

When the function was fixing the innteriour of the second map, it did not remove the readiong.

Made some formating changes.

7a48e0d

Updated the tests of the map fusion.

d609045

It almost passes all fuction. However, the one that needs renaming are not yet done.

WIP: Started with a renamer function.

52c4542

Continued with the parallel fusion stuff.

3b758bf

The fusion transformation now also checks if there is a write conflic…

377b428

…t in the input and output set. However, it is very simple.

Updated some tests.

db4864b

Fixed an error. I shouild refactor that damn loop.

f395acd

Some improvements to the tests.

b1ab95e

Removed some debugging stuff.

945ca8f

Fixed some typing stuff.

940b9b6

Started with a better implementation for the data dependency test.

ecae361

First version of the pointwise checker in the map fusion.

64d07fd

Updated some test cases.

33a0edf

The shared data cache can not be dumped.

ff018f4

Otherwise we can end up in recursion.

Buffer tiling now finally works.

9267ea9

The Mapreduce now also works.

fc2db8a

Added a test to the map fusion stuff that ensures that the shared blo…

4d9f11d

…ck is taken.

Added a test for the indirect accesses case.

2b91465

Updated the heat 3d test. It now ensures that the fusion is now done.

73f4415

Fixed an error in the parallel map fusion.

94ecd19

philip-paul-mueller added 9 commits December 13, 2024 09:44

WIP: Started with implementing Phil's suggestions.

d8da3c6

Made some modification, time to save.

e2285f0

Updated the map fusion test a little bit.

4832c3c

Added a new special case.

Added a test for the next generation of MapFusion.

fd3b48a

Added a new test.

8fa7cb2

Allowed that consumer edge in MapFusion are dynamic.

0a5aeaf

For such edges we are sure that the data exists, so it is just a conditional read, which is fine.

Changed the doc string to the Sphinx one.

e2bc10d

Fixed some missing test.

aa3619f

philip-paul-mueller force-pushed the new-map-fusion branch from 25f5a73 to aa3619f Compare December 16, 2024 13:03

philip-paul-mueller added 10 commits December 16, 2024 15:43

Updated the description of the transformation.

a740d16

Added a new test.

d07e2c5

Merge remote-tracking branch 'spcl/main' into new-map-fusion

fbc8469

Added a new test.

2b17111

This tests dynamic Memlets inside producers; the original transformation fails on it.

Merge remote-tracking branch 'spcl/main' into new-map-fusion

1e8f66f

Added a flag to MapFusion that allows to consider everything as shared.

3ab46d8

Updated how the memlet adjustment works, this should be a bit more li…

b1fc9d1

…beral.

Added a new test to check the memlet update.

fdc6424

philip-paul-mueller requested a review from phschaad December 17, 2024 12:03

Merge branch 'main' into new-map-fusion

e2c41b5

phschaad requested changes Jan 13, 2025

View reviewed changes

philip-paul-mueller requested a review from phschaad January 14, 2025 11:55

phschaad approved these changes Jan 15, 2025

View reviewed changes

philip-paul-mueller added 4 commits January 17, 2025 08:35

Centralized the map fusion call in the testing.

bcaed23

Added a test that ensures that no cycles would be created.

243611d

Added more tests to the map fusion and refined some others.

a3842f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New and Improved MapFusion #1629

New and Improved MapFusion #1629

philip-paul-mueller commented Aug 22, 2024 •

edited

Loading

philip-paul-mueller commented Dec 17, 2024

phschaad left a comment

New and Improved MapFusion #1629

Are you sure you want to change the base?

New and Improved MapFusion #1629

Conversation

philip-paul-mueller commented Aug 22, 2024 • edited Loading

philip-paul-mueller commented Dec 17, 2024

phschaad left a comment

Choose a reason for hiding this comment

philip-paul-mueller commented Aug 22, 2024 •

edited

Loading