Opened 8 years ago

Last modified 3 years ago

#4398 reopened defect

comSubExp scales as O(N^2)

Reported by: Francesco Casella Owned by: Volker Waurich
Priority: high Milestone:
Component: Backend Version:
Keywords: Cc: Volker Waurich

Description

These are the times spent by the comSubExp (simulation) backend function when compiling the ScalableTestSuite.Thermal.Advection.ScaledExperiments.SimpleAdvection_N_XXX models, see https://test.openmodelica.org/libraries/ScalableTestSuite_Experimental/BuildModelRecursive.html

N time
3200 30
6400 213
12800 1444

The model is a very simple one

    model SimpleAdvection "Basic thermal advection model with uniform speed"
      parameter Integer N = 2 "Number of nodes";
      parameter Modelica.SIunits.Temperature Tstart[N]=ones(N)*300
        "Start value of the temperature distribution";
      parameter Modelica.SIunits.Length L = 10 "Pipe length";
      final parameter Modelica.SIunits.Length l = L/N "Length of one volume";
      Modelica.SIunits.Velocity u = 1 "Fluid speed";
      Modelica.SIunits.Temperature Tin = 300 "Inlet temperature";
      Modelica.SIunits.Temperature T[N] "Node temperatures";
      Modelica.SIunits.Temperature Ttilde[N-1](start = Tstart[2:N], each fixed = true)
        "Temperature states";
      Modelica.SIunits.Temperature Tout;
    equation
      for j in 1:N-1 loop
        der(Ttilde[j]) = u/l*(T[j]-T[j+1]);
      end for;
      T[1] = Tin;
      T[N] = Tout;
      Ttilde = T[2:N];
end SimpleAdvection;

I'm not sure what are the common subexpressions that OMC is looking for here, but even if there were any, spending half an hour to find them doesn't really make sense. Anyway, the complexity of the algorithm shouldn't be so bad.

The problem can be traced back to a commit on either Jan 16 or Jan 17, since from the history reports, it wasn't present with v1.12.0-dev.109+g6700d3e and it was present with v1.12.0-dev.116+g6f8bb14.

Change History (37)

comment:1 by Lennart Ochel, 8 years ago

Cc: Volker Waurich added

Volker, can you have a look?

comment:2 by Volker Waurich, 8 years ago

Yes, I know. The comSubExp-module is pretty old and I did not consider scaling during that time. One could also consider to use wrapFunctionCalls for it but it needs some enhancements. comSubExp is basically to find binary-cse instead of functionCalls.

comment:3 by Volker Waurich, 8 years ago

Owner: changed from Lennart Ochel to Volker Waurich
Status: newassigned

comment:4 by Francesco Casella, 8 years ago

I looked up the commits on Jan 16 and 17, but none of them refers explicitly to that module. Before those commits, comSubExp took absolutely reasonable amounts of time. What happened in those two days?

comment:5 by Henning Kiel, 8 years ago

Last edited 8 years ago by Henning Kiel (previous) (diff)

comment:6 by Francesco Casella, 8 years ago

@vwaurich, is there any chance you could try to fix this? @hkiel added a comment two months ago but then removed it.

Thanks!

comment:7 by Henning Kiel, 8 years ago

I did not check the scaling, but I improved that code in PR1719 at the cost of quite some RAM (0.8GB -> 6GB)

comment:8 by Henning Kiel, 8 years ago

I could improve RAM usage and speed further with PR1723

comment:9 by Francesco Casella, 8 years ago

I have just started a new build (# 182) of the ScalableTestSuite on Hudson. You can check the results here https://test.openmodelica.org/libraries/ScalableTestSuite_Experimental/BuildModelRecursive.html as soon as the task is finished.

in reply to:  description ; comment:10 by Francesco Casella, 8 years ago

Summary: comSubExp scales as O(N^3)comSubExp scales as O(N^2)

After the last improvements by @hkiel, Hudson reported the following performance:

N time
3200 1.08
6400 4.96
12800 17.71

The situation has very much improved, as the time now scales as O(N2). However, for significantly larger models it could still be a problem. I'm not sure whether O(N2) is the theoretical complexity of the underlying algorithm, or it is just some other non-optimal implementation issue.

I'm going to leave the ticket open with a modified summary. Feel free to have a further look at this issue.

in reply to:  10 ; comment:11 by Francesco Casella, 7 years ago

Update, the situation as of 21 Sep 2017 is

N time
3200 0.19
6400 0.32
12800 1.03
25600 3.34

The situation has improved further after @hkiel's interventions, but I still see bad scaling for larger sizes.

@vwaurich, I'm not sure how this is implemented, but if the idea is to collect an (ordered) list of expressions and check with each new one if it's already been encountered, then it should be O(N log(N)), not O(N2).

in reply to:  11 ; comment:12 by Martin Sjölund, 7 years ago

The bottleneck for SimpleAdvection is no longer comSubExp, but collectInitialBindings (initialization):
time /68.58, allocations: 59.69 MB / 5.811 GB, free: 13.04 MB / 0.8218 GB

N time
3200 0.34
6400 3.15
12800 22.02

comment:13 by Henning Kiel, 7 years ago

From a process trace I see that the cuplprit is in Initialization.collectInitialVarsEqnsSystem:
Initialization.collectInitialVars is called n times and each time it is calling BackendVariable.areAllCrefsInVarList which is O(n2)

in reply to:  12 comment:14 by Francesco Casella, 7 years ago

Replying to sjoelund.se:

The bottleneck for SimpleAdvection is no longer comSubExp, but collectInitialBindings (initialization):

This is also a known issue, see #4388. Please continue the discussion about it there

In any case, comSubExp still scales badly and it should be fixed, even if it is not the bottleneck for the specific models reported here.

comment:15 by Martin Sjölund, 7 years ago

So the problem is that comSubExp does things in 3 stages; where each stage is more complex than the previous ones. The last stage is getCSE3 "traverses the partitions and checks for CSE3 i.e a=b+c+const. ; d = b+c+const. --> a=d", which takes the lion's share of execution time and as far as I know it doesn't find any expression to eliminate.

Perhaps we should have a --cseLevel=[0|1|2|3] flag? Or we improve performance of resolveLoops_findLoops

comment:16 by Martin Sjölund, 7 years ago

I have a fix for one of the parts of findLoops, using sets instead of List.unique. The second part is more complicated code, but it looks like it can also be improved dramatically.

comment:17 by Francesco Casella, 7 years ago

If some phases cannot be brought down to O(N log(N)) please introduce a flag to disable them. It would be a shame not to be able to do any CSE at all on large models because of a tough phase that is hardly ever useful.

comment:18 by Martin Sjölund, 7 years ago

For ScalableTestSuite.Thermal.Advection.ScaledExperiments.SimpleAdvection_N_XXXX (not the model in the opening post, which was not affected):

comSubExp before my changes (the nightly build; optimized omc, should run faster):

N timememory
3200 0.58 197MB
6400 2.10 0.73GB
12800 8.53 2.77GB

comSubExp after c21c29d (my local copy without clang optimizations):

N timememory
3200 0.82 39MB
6400 3.51 80MB
12800 14.87 164MB

So at least linear in memory allocations now, but it is still still O(N2) in time

comment:19 by Martin Sjölund, 7 years ago

It is still ResolveLoops.getShortPathsBetweenEqCrossNodes that is the bottleneck. It takes as input a list of equations and it does:

  • for each equation, get its variables; for each of those variables, get its related equations and check if for any of these equations the list of variables is the same.

Maybe I can have a look tomorrow...

in reply to:  19 comment:20 by Francesco Casella, 7 years ago

Replying to sjoelund.se:

It is still ResolveLoops.getShortPathsBetweenEqCrossNodes that is the bottleneck. It takes as input a list of equations and it does:

  • for each equation, get its variables; for each of those variables, get its related equations and check if for any of these equations the list of variables is the same.

I'm not sure I understand why this should be useful to remove common subexpressions.

I thought that CSE would basically require to build a binary tree or hash table of sub-expressions as they are encountered while parsing the DAEs, replacing any instance of an already encountered expression with an auxiliary variable. For a model with N equations each containing at most K sub-expressions this would be O(N log(N)) if a binary tree is used, or O(N) if a hash table is used. Do I miss something?

Last edited 7 years ago by Francesco Casella (previous) (diff)

comment:21 by Martin Sjölund, 7 years ago

As far as I can tell, it is looking at the incidence matrix to do figure out which equations use the same variables to speed up subsequent checks.

And it would be O(N*K*(log(N)+log(K))) for a binary tree; the hashtable would have to be quite large and consume a lot of memory, and essentially be at least O(N*K).

in reply to:  18 ; comment:22 by Martin Sjölund, 7 years ago

So it seems to me like the algorithm is also a bit wrong.

First of all, it finds all possible places where there could be common subexpressions. BUT it only works if there is exactly one match (there is 3198 matches for N=3200...). Secondly, the found paths do not consider that der(x) and x are different (this is found at a later stage).

If I just add an extra flag to findLoops to stop when it finds a second loop... Performance is better, but overall comSubExp still doesn't scale very well:

N time
3200 0.15
6400 0.47
12800 1.80

comment:24 by Martin Sjölund, 7 years ago

Note that part of the complexity is that OpenModelica will re-order the internal sub-expressions. Like we have a = b + c + d and e = b + d +c and the algorithm does find it. So it's not just searching for subexpressions in a table.

in reply to:  22 comment:25 by Martin Sjölund, 7 years ago

PR1946 makes the comSubExp linear for the model in ScalableTestSuite (N=25600 causes stack overflow with my debugging flags) perform even better (for smaller N, it looks almost linear; there is probably somewhere that will blow up for very large N, but other modules are a bottleneck for debugging those as the turn-around time is huge):

N time
3200 0.08
6400 0.17
12800 0.53

comment:26 by Henning Kiel, 7 years ago

Stack overflow happens in SynchronousFeatures.partitionIndependentBlocksEq

[bt] #1 libOpenModelicaRuntimeC.dylib(mmc_setStacktraceMessages_threadData)
[bt] #2...1024 libOpenModelicaCompiler.dylib(SynchronousFeatures.partitionIndependentBlocksEq)
[bt] #1025 [...]

in reply to:  21 comment:27 by Francesco Casella, 7 years ago

Replying to sjoelund.se:

As far as I can tell, it is looking at the incidence matrix to do figure out which equations use the same variables to speed up subsequent checks.

Later phases may be sped up, but if this becomes the bottleneck it may not be a good idea.

And it would be O(N*K*(log(N)+log(K))) for a binary tree;

No matter how large K is, O(N*K*(log(N)+log(K))) is always better than O(N2) for sufficiently large N.

Last edited 7 years ago by Francesco Casella (previous) (diff)

comment:28 by Francesco Casella, 7 years ago

In fact, in many cases the single most effective use of CSE is to avoid re-computing functions multiple times with the same argument, e.g. sin(theta) or some heavyweight thermodynamic function in Modelica.Media.

I understand that the wrapFunctionCalls function already does this, so maybe one could just switch off CSE for very large models, if the time spent finding other common subexpressions outweights the time spent computing them multiple times.

comment:29 by Francesco Casella, 7 years ago

Milestone: 1.12.01.13.0
Resolution: fixed
Status: assignedclosed

I see that the improvement for N = 12800 has been about 50X compared to two days ago, and a lot more compared to the initial situation, so I guess we can now focus on other more important bottlenecks.

We may reopen this later on if there are some larger models where this is still problematic.

comment:30 by Volker Waurich, 7 years ago

Thanks a lot for the improvements.
Just for the records: The comsubExp implementation here, is not about declaring cse-variables for frequent function calls as in media. It is kind of a removeSimpleEquations which handles terms instead of single variables and I implemented it to get some models of ThermoSysPro running. E.g. for the equations:

centrifugalPump1.Wm = centrifugalPump1.Cm * centrifugalPump1.w;
centrifugalPump1.Wm = centrifugalPump1.Cr * centrifugalPump1.w;

comSubExp introduces:

centrifugalPump1.Cm = centrifugalPump1.Cr;

If this is not the case, we end up with an equation, which leads to a division by zero:

centrifugalPump1.Cm=DIVISION(centrifugalPump1.Wm, centrifugalPump1.w)

comment:31 by Francesco Casella, 7 years ago

OK. I think we should write a short chapter in the User's Guide explaining what the various optimizations in the back-end do, see #4596.

With respect to the optimization @vwaurich mentions, I think it could also be useful for our big grid models, because you may have connected components that compute quantities such as apparent power flow on each side, which are the same save for the sign. So, it would be good if this did not take too much time, though we have to see what the trade-off is between compilation and simulation times.

In the specific case mentioned in comment:30, I understand the spirit of what you are doing and I understand it can be useful to avoid singularities when centrifugalPump1.w approaches zero. On the other hand, it could be slighly questionable from a mathematical point of view: when centrifugalPump1.w is zero, centrifugalPump1.Cm is not necessarily equal to centrifugalPump1.Cr, so the transformation you are introducing is somewhat arbitrarily picking one solution.

I think this is a particular case of a more general question for Modelica models: what is their intended semantics when the system becomes singular but has infinitely many solutions? I guess picking one that avoid discontinuities is a good idea, but we should probably discuss this in the context of the Modelica Association.

comment:32 by Francesco Casella, 7 years ago

Resolution: fixed
Status: closedreopened

I am currently running a suite of scalable test cases of large power grid models. Unfortunately, the method still scales as O(N2) and becomes the main bottleneck in the back-end when the size goes above 1e5 equations, which is required for that class of problems.

As long as wrapFunctionCalls takes care of CSE involving function calls, I guess I can just disable this module for the time being.

@wvaurich, I would be grateful if you could comment on the expected complexity of your algorithm: is it really O(N2), or is it just because of an inefficient implementation?

comment:33 by Francesco Casella, 6 years ago

Milestone: 1.13.01.14.0

Rescheduled to 1.14.0 after 1.13.0 releasee

comment:34 by Francesco Casella, 5 years ago

Milestone: 1.14.01.16.0

Releasing 1.14.0 which is stable and has many improvements w.r.t. 1.13.2. This issue is rescheduled to 1.16.0

comment:35 by Francesco Casella, 4 years ago

Milestone: 1.16.01.17.0

Retargeted to 1.17.0 after 1.16.0 release

comment:36 by Francesco Casella, 4 years ago

Milestone: 1.17.01.18.0

Retargeted to 1.18.0 because of 1.17.0 timed release.

comment:37 by Francesco Casella, 3 years ago

Milestone: 1.18.0

Ticket retargeted after milestone closed

Note: See TracTickets for help on using tickets.