Opened 8 years ago
Last modified 3 years ago
#4398 reopened defect
comSubExp scales as O(N^2)
Reported by: | Francesco Casella | Owned by: | Volker Waurich |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | Backend | Version: | |
Keywords: | Cc: | Volker Waurich |
Description
These are the times spent by the comSubExp (simulation) backend function when compiling the ScalableTestSuite.Thermal.Advection.ScaledExperiments.SimpleAdvection_N_XXX models, see https://test.openmodelica.org/libraries/ScalableTestSuite_Experimental/BuildModelRecursive.html
N | time |
---|---|
3200 | 30 |
6400 | 213 |
12800 | 1444 |
The model is a very simple one
model SimpleAdvection "Basic thermal advection model with uniform speed" parameter Integer N = 2 "Number of nodes"; parameter Modelica.SIunits.Temperature Tstart[N]=ones(N)*300 "Start value of the temperature distribution"; parameter Modelica.SIunits.Length L = 10 "Pipe length"; final parameter Modelica.SIunits.Length l = L/N "Length of one volume"; Modelica.SIunits.Velocity u = 1 "Fluid speed"; Modelica.SIunits.Temperature Tin = 300 "Inlet temperature"; Modelica.SIunits.Temperature T[N] "Node temperatures"; Modelica.SIunits.Temperature Ttilde[N-1](start = Tstart[2:N], each fixed = true) "Temperature states"; Modelica.SIunits.Temperature Tout; equation for j in 1:N-1 loop der(Ttilde[j]) = u/l*(T[j]-T[j+1]); end for; T[1] = Tin; T[N] = Tout; Ttilde = T[2:N]; end SimpleAdvection;
I'm not sure what are the common subexpressions that OMC is looking for here, but even if there were any, spending half an hour to find them doesn't really make sense. Anyway, the complexity of the algorithm shouldn't be so bad.
The problem can be traced back to a commit on either Jan 16 or Jan 17, since from the history reports, it wasn't present with v1.12.0-dev.109+g6700d3e and it was present with v1.12.0-dev.116+g6f8bb14.
Change History (37)
comment:1 by , 8 years ago
Cc: | added |
---|
comment:2 by , 8 years ago
Yes, I know. The comSubExp-module is pretty old and I did not consider scaling during that time. One could also consider to use wrapFunctionCalls for it but it needs some enhancements. comSubExp is basically to find binary-cse instead of functionCalls.
comment:3 by , 8 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:4 by , 8 years ago
I looked up the commits on Jan 16 and 17, but none of them refers explicitly to that module. Before those commits, comSubExp
took absolutely reasonable amounts of time. What happened in those two days?
comment:6 by , 7 years ago
@vwaurich, is there any chance you could try to fix this? @hkiel added a comment two months ago but then removed it.
Thanks!
comment:7 by , 7 years ago
I did not check the scaling, but I improved that code in PR1719 at the cost of quite some RAM (0.8GB -> 6GB)
comment:9 by , 7 years ago
I have just started a new build (# 182) of the ScalableTestSuite on Hudson. You can check the results here https://test.openmodelica.org/libraries/ScalableTestSuite_Experimental/BuildModelRecursive.html as soon as the task is finished.
follow-up: 11 comment:10 by , 7 years ago
Summary: | comSubExp scales as O(N^3) → comSubExp scales as O(N^2) |
---|
After the last improvements by @hkiel, Hudson reported the following performance:
N | time |
---|---|
3200 | 1.08 |
6400 | 4.96 |
12800 | 17.71 |
The situation has very much improved, as the time now scales as O(N2). However, for significantly larger models it could still be a problem. I'm not sure whether O(N2) is the theoretical complexity of the underlying algorithm, or it is just some other non-optimal implementation issue.
I'm going to leave the ticket open with a modified summary. Feel free to have a further look at this issue.
follow-up: 12 comment:11 by , 7 years ago
Update, the situation as of 21 Sep 2017 is
N | time |
---|---|
3200 | 0.19 |
6400 | 0.32 |
12800 | 1.03 |
25600 | 3.34 |
The situation has improved further after @hkiel's interventions, but I still see bad scaling for larger sizes.
@vwaurich, I'm not sure how this is implemented, but if the idea is to collect an (ordered) list of expressions and check with each new one if it's already been encountered, then it should be O(N log(N)), not O(N2).
follow-up: 14 comment:12 by , 7 years ago
The bottleneck for SimpleAdvection is no longer comSubExp, but collectInitialBindings (initialization):
time /68.58, allocations: 59.69 MB / 5.811 GB, free: 13.04 MB / 0.8218 GB
N | time |
---|---|
3200 | 0.34 |
6400 | 3.15 |
12800 | 22.02 |
comment:13 by , 7 years ago
From a process trace I see that the cuplprit is in Initialization.collectInitialVarsEqnsSystem
:
Initialization.collectInitialVars
is called n times and each time it is calling BackendVariable.areAllCrefsInVarList
which is O(n2)
comment:14 by , 7 years ago
Replying to sjoelund.se:
The bottleneck for SimpleAdvection is no longer comSubExp, but collectInitialBindings (initialization):
This is also a known issue, see #4388. Please continue the discussion about it there
In any case, comSubExp still scales badly and it should be fixed, even if it is not the bottleneck for the specific models reported here.
comment:15 by , 7 years ago
So the problem is that comSubExp does things in 3 stages; where each stage is more complex than the previous ones. The last stage is getCSE3 "traverses the partitions and checks for CSE3 i.e a=b+c+const. ; d = b+c+const. --> a=d"
, which takes the lion's share of execution time and as far as I know it doesn't find any expression to eliminate.
Perhaps we should have a --cseLevel=[0|1|2|3]
flag? Or we improve performance of resolveLoops_findLoops
comment:16 by , 7 years ago
I have a fix for one of the parts of findLoops, using sets instead of List.unique. The second part is more complicated code, but it looks like it can also be improved dramatically.
comment:17 by , 7 years ago
If some phases cannot be brought down to O(N log(N)) please introduce a flag to disable them. It would be a shame not to be able to do any CSE at all on large models because of a tough phase that is hardly ever useful.
follow-up: 22 comment:18 by , 7 years ago
For ScalableTestSuite.Thermal.Advection.ScaledExperiments.SimpleAdvection_N_XXXX (not the model in the opening post, which was not affected):
comSubExp before my changes (the nightly build; optimized omc, should run faster):
N | time | memory |
---|---|---|
3200 | 0.58 | 197MB |
6400 | 2.10 | 0.73GB |
12800 | 8.53 | 2.77GB |
comSubExp after c21c29d (my local copy without clang optimizations):
N | time | memory |
---|---|---|
3200 | 0.82 | 39MB |
6400 | 3.51 | 80MB |
12800 | 14.87 | 164MB |
So at least linear in memory allocations now, but it is still still O(N2) in time
follow-up: 20 comment:19 by , 7 years ago
It is still ResolveLoops.getShortPathsBetweenEqCrossNodes that is the bottleneck. It takes as input a list of equations and it does:
- for each equation, get its variables; for each of those variables, get its related equations and check if for any of these equations the list of variables is the same.
Maybe I can have a look tomorrow...
comment:20 by , 7 years ago
Replying to sjoelund.se:
It is still ResolveLoops.getShortPathsBetweenEqCrossNodes that is the bottleneck. It takes as input a list of equations and it does:
- for each equation, get its variables; for each of those variables, get its related equations and check if for any of these equations the list of variables is the same.
I'm not sure I understand why this should be useful to remove common subexpressions.
I thought that CSE would basically require to build a binary tree or hash table of sub-expressions as they are encountered while parsing the DAEs, replacing any instance of an already encountered expression with an auxiliary variable. For a model with N equations each containing at most K sub-expressions this would be O(N log(N)) if a binary tree is used, or O(N) if a hash table is used. Do I miss something?
follow-up: 27 comment:21 by , 7 years ago
As far as I can tell, it is looking at the incidence matrix to do figure out which equations use the same variables to speed up subsequent checks.
And it would be O(N*K*(log(N)+log(K)))
for a binary tree; the hashtable would have to be quite large and consume a lot of memory, and essentially be at least O(N*K)
.
follow-up: 25 comment:22 by , 7 years ago
So it seems to me like the algorithm is also a bit wrong.
First of all, it finds all possible places where there could be common subexpressions. BUT it only works if there is exactly one match (there is 3198 matches for N=3200...). Secondly, the found paths do not consider that der(x)
and x
are different (this is found at a later stage).
If I just add an extra flag to findLoops
to stop when it finds a second loop... Performance is better, but overall comSubExp
still doesn't scale very well:
N | time |
---|---|
3200 | 0.15 |
6400 | 0.47 |
12800 | 1.80 |
comment:24 by , 7 years ago
Note that part of the complexity is that OpenModelica will re-order the internal sub-expressions. Like we have a = b + c + d
and e = b + d +c
and the algorithm does find it. So it's not just searching for subexpressions in a table.
comment:25 by , 7 years ago
PR1946 makes the comSubExp linear for the model in ScalableTestSuite (N=25600 causes stack overflow with my debugging flags) perform even better (for smaller N, it looks almost linear; there is probably somewhere that will blow up for very large N, but other modules are a bottleneck for debugging those as the turn-around time is huge):
N | time |
---|---|
3200 | 0.08 |
6400 | 0.17 |
12800 | 0.53 |
comment:26 by , 7 years ago
Stack overflow happens in SynchronousFeatures.partitionIndependentBlocksEq
[bt] #1 libOpenModelicaRuntimeC.dylib(mmc_setStacktraceMessages_threadData) [bt] #2...1024 libOpenModelicaCompiler.dylib(SynchronousFeatures.partitionIndependentBlocksEq) [bt] #1025 [...]
comment:27 by , 7 years ago
Replying to sjoelund.se:
As far as I can tell, it is looking at the incidence matrix to do figure out which equations use the same variables to speed up subsequent checks.
Later phases may be sped up, but if this becomes the bottleneck it may not be a good idea.
And it would be
O(N*K*(log(N)+log(K)))
for a binary tree;
No matter how large K is, O(N*K*(log(N)+log(K))) is always better than O(N2) for sufficiently large N.
comment:28 by , 7 years ago
In fact, in many cases the single most effective use of CSE is to avoid re-computing functions multiple times with the same argument, e.g. sin(theta) or some heavyweight thermodynamic function in Modelica.Media.
I understand that the wrapFunctionCalls function already does this, so maybe one could just switch off CSE for very large models, if the time spent finding other common subexpressions outweights the time spent computing them multiple times.
comment:29 by , 7 years ago
Milestone: | 1.12.0 → 1.13.0 |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
I see that the improvement for N = 12800 has been about 50X compared to two days ago, and a lot more compared to the initial situation, so I guess we can now focus on other more important bottlenecks.
We may reopen this later on if there are some larger models where this is still problematic.
comment:30 by , 7 years ago
Thanks a lot for the improvements.
Just for the records: The comsubExp implementation here, is not about declaring cse-variables for frequent function calls as in media. It is kind of a removeSimpleEquations which handles terms instead of single variables and I implemented it to get some models of ThermoSysPro running. E.g. for the equations:
centrifugalPump1.Wm = centrifugalPump1.Cm * centrifugalPump1.w; centrifugalPump1.Wm = centrifugalPump1.Cr * centrifugalPump1.w;
comSubExp introduces:
centrifugalPump1.Cm = centrifugalPump1.Cr;
If this is not the case, we end up with an equation, which leads to a division by zero:
centrifugalPump1.Cm=DIVISION(centrifugalPump1.Wm, centrifugalPump1.w)
comment:31 by , 7 years ago
OK. I think we should write a short chapter in the User's Guide explaining what the various optimizations in the back-end do, see #4596.
With respect to the optimization @vwaurich mentions, I think it could also be useful for our big grid models, because you may have connected components that compute quantities such as apparent power flow on each side, which are the same save for the sign. So, it would be good if this did not take too much time, though we have to see what the trade-off is between compilation and simulation times.
In the specific case mentioned in comment:30, I understand the spirit of what you are doing and I understand it can be useful to avoid singularities when centrifugalPump1.w approaches zero. On the other hand, it could be slighly questionable from a mathematical point of view: when centrifugalPump1.w is zero, centrifugalPump1.Cm is not necessarily equal to centrifugalPump1.Cr, so the transformation you are introducing is somewhat arbitrarily picking one solution.
I think this is a particular case of a more general question for Modelica models: what is their intended semantics when the system becomes singular but has infinitely many solutions? I guess picking one that avoid discontinuities is a good idea, but we should probably discuss this in the context of the Modelica Association.
comment:32 by , 6 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
I am currently running a suite of scalable test cases of large power grid models. Unfortunately, the method still scales as O(N2) and becomes the main bottleneck in the back-end when the size goes above 1e5 equations, which is required for that class of problems.
As long as wrapFunctionCalls takes care of CSE involving function calls, I guess I can just disable this module for the time being.
@wvaurich, I would be grateful if you could comment on the expected complexity of your algorithm: is it really O(N2), or is it just because of an inefficient implementation?
comment:34 by , 5 years ago
Milestone: | 1.14.0 → 1.16.0 |
---|
Releasing 1.14.0 which is stable and has many improvements w.r.t. 1.13.2. This issue is rescheduled to 1.16.0
comment:36 by , 4 years ago
Milestone: | 1.17.0 → 1.18.0 |
---|
Retargeted to 1.18.0 because of 1.17.0 timed release.
Volker, can you have a look?