Opened 9 years ago

Last modified 3 years ago

#3781 assigned defect

Fix scaling of detectJacobianSparsePattern

Reported by: sjoelund.se Owned by: Karim.Abdelhak
Priority: high Milestone:
Component: Backend Version: v1.10.0-dev-nightly
Keywords: Cc: AnHeuermann, lochel

Description

loadString("model M
  constant Integer n=10000;
  Real r[n](each start=0, each fixed=true);
equation
  for i in 1:n loop
    der(r[i]) = i*time;
  end for;
end M;");getErrorString();
translateModel(M);getErrorString();

n=1000

Notification: Performance of postOpt detectJacobianSparsePattern (simulation): time 0.1987/1.124, memory: 185.4 MB / 0.5414 GB
Total time: 1.576, memory: 0.7237 GB

n=2000

Notification: Performance of postOpt detectJacobianSparsePattern (simulation): time 0.7653/2.696, memory: 0.6772 GB / 1.388 GB
Total time: 3.595, memory: 1.816 GB

n=10000

Notification: Performance of postOpt detectJacobianSparsePattern (simulation): time 18.73/28.87, memory: 16.02 GB / 19.54 GB
Total time: 34.7, memory: 24 GB

Seems to be O(N^2) memory/time in detectJacobianSparsePattern, which is taking >50% of translation time for larger models.

Change History (32)

comment:1 Changed 9 years ago by sjoelund.se

The slow part seems to be BackendDAEOptimize.collapseIndependentBlocks, which is bad for a couple of reasons:

  1. It should do almost no actual work
  2. The work it does should not be needed at all since we would rather like to have the sparse Jacobians separate (we know there is no dependency between them).
Notification: Performance of detectSparsePatternODE -> copy dae : time 0.1039/7.903, memory: 159.6 MB / 3.681 GB
Notification: Performance of detectSparsePatternODE -> collapse blocks : time 17.9/25.8, memory: 15.8 GB / 19.48 GB

comment:2 Changed 9 years ago by sjoelund.se

  • Owner changed from lochel to sjoelund.se
  • Status changed from new to accepted

comment:3 Changed 9 years ago by sjoelund.se

With my changes: https://github.com/OpenModelica/OMCompiler/pull/548

n=10000

Notification: Performance of postOpt detectJacobianSparsePattern (simulation): time 0.3142/8.106, memory: 229.1 MB / 3.748 GB
Notification: Performance of simCode: created modelInfo and variables: time 4.118/12.86, memory: 3.294 GB / 7.199 GB
Total time: 15.82, memory: 8.385 GB

n=80000

Notification: Performance of postOpt detectJacobianSparsePattern (simulation): time 2.192/72.66, memory: 1.789 GB / 29.92 GB
Notification: Performance of simCode: created modelInfo and variables: time 226.6/304.2, memory: 198.8 GB / 230 GB
Total time: 327.9, memory: 239.6 GB

Which means that SimCode is the next culprit.

comment:4 follow-ups: Changed 9 years ago by sjoelund.se

@casella: Should this stupid testcase be added to ScalableTestsuite as well? It's not a useful physical model, but it does push a compiler to its limits.

comment:5 Changed 9 years ago by sjoelund.se

  • Cc vwaurich added

The poor scaling in SimCode is from:

function getHighestDerivation"computes the highest derivative among all states. this includes derivatives of derivatives as well
author: waurich TUD 2015-05"
  input BackendDAE.BackendDAE inDAE;
  output Integer highestDerivation;

comment:6 Changed 9 years ago by sjoelund.se

Using dynamic programming for getHighestDerivation:

n=10000

Notification: Performance of simCode: created modelInfo and variables: time 0.2758/9.092, memory: 190.6 MB / 4.091 GB
Total time: 11.62, memory: 5.277 GB

n=70000 (computer ran out of RAM at n=80000; slight change where memory was allocated)

Notification: Performance of simCode: created modelInfo and variables: time 7.335/77.9, memory: 1.358 GB / 28.65 GB
Total time: 91.42, memory: 37.01 GB

Most of the memory/time now comes from other parts of createModelInfo. Basically it now just iterates over all variables and finds out they are not derivatives of other variables and just returns. Previously, it tried all combinations and created list ranges for 1:2, 1:3, 1:4, 1:5, 1:10000, etc. which was very expensive. Now, n=70000 uses less memory than n=10000 did before for createModelInfo.

Note: Almost half of the total 37GB allocated is actually in use since n=80000 fails for 20GB virtual memory ulimit. Of this, we have the following stages using a lot of memory:

FrontEnd: time 29.2/29.2, memory: 7.466 GB / 7.471 GB
preOpt comSubExp (simulation): time 0.8942/35.2, memory: 0.8501 GB / 9.901 GB
matching and sorting (n=70000): time 3.431/38.97, memory: 3.892 GB / 13.89 GB
prepare postOptimizeDAE: time 8.277/51.09, memory: 7.758 GB / 23.36 GB
postOpt detectJacobianSparsePattern (simulation): time 4.387/65.02, memory: 1.566 GB / 26.2 GB (still)
simCode: created modelInfo and variables: time 7.335/77.9, memory: 1.358 GB / 28.65 GB
simCode: created linear, non-linear and system jacobian parts: time 1.351/79.25, memory: 0.5233 GB / 29.17 GB
Templates: time 14.2/94.23, memory: 7.523 GB / 37.01 GB

comment:7 Changed 9 years ago by vwaurich

Thanks for finding out. I will remove the call anyway. There is no need for this information anymore.

comment:8 in reply to: ↑ 4 Changed 9 years ago by casella

Replying to sjoelund.se:

@casella: Should this stupid testcase be added to ScalableTestsuite as well? It's not a useful physical model, but it does push a compiler to its limits.

In fact it is not stupid at all, I have tried something similar when writing my Modelica conference paper, and it was useful to find out interesting things (e.g., that a lot of memory was allocated for the dense Jacobians of f(x,t), which doesn't really work for N > 10000)

However, this model lacks one of the requirements of the library, i.e., that the models should be meaningful and representative of real-life problems - this one obviously isn't, as it doesn't make sense to simulate so many independent systems together.

The question is: are there meaningful models with (many) totally independent Jacobian blocks? For pure continuous-time systems, I would rather say no, because of the above-mentioned reason. If you have many independent systems, you'd better simulate them separately (maybe using parallel threads). The only case where this would make sense is if the dependency between the various sub-systems is given by discrete variables and when-equations, or when it is given by delay operators, such as in systems using TLM for coupling. Seems to me quite a corner case, but if you have some example in mind, I'd add it gladly.

For the SimCode part, you probably don't need the Jacobian to be block diagonal, but only very sparse, in order to get an improvement. For this purpose, you can check the performance on the SimpleAdvection models, which contain

      for j in 1:N-1 loop
        der(Ttilde[j]) = u/l*(T[j]-T[j+1]);
      end for;
      T[1] = Tin;
      T[N] = Tout;
      Ttilde = T[2:N];

In this case the Jacobian has nonzero elements only on the main diagonal and on the diagonal immediately below it.

comment:9 Changed 9 years ago by sjoelund.se

@casella: I just started with this stupid model to see where the compiler does not scale. That is its only purpose. detectJacobianSparsePattern was just the first module that did not scale.

comment:10 Changed 9 years ago by casella

I'd recommend using ScalableTestSuite.Thermal.Advection.ScaledExperiments.SimpleAdvection_N_XXX. That's a good starting point for compiler scaling analysis.

comment:11 Changed 9 years ago by wbraun

  • Cc wbraun added

comment:12 in reply to: ↑ 4 Changed 8 years ago by casella

Some futher remarks on the topic of this ticket

Replying to sjoelund.se:

@casella: Should this stupid testcase be added to ScalableTestsuite as well? It's not a useful physical model, but it does push a compiler to its limits.

This was done a while ago, see the CascadedFirstOrder_N_XXX test cases. They are slightly less stupid than the original test case of this ticket, because the Jacobian is not the zero matrix, but rather a matrix with a upper pseudo-diagonal of ones. It has a structure similar to SimpleAdvection_N_XXX, but with even less code. Anyway, only O(N) nonzero elements.

Looking at the time took by detectJacobianSparsePattern on CascadedFirstOrder_N_12800 (0.40 s) and CascadedFirstOrder_N_25600 (1.15 s), it seems that the situation has much improved, at least for systems with a very sparse Jacobian structure.

However, a recently added battery of new test cases DistributionSystemAC.ScaledExperiments.DistributionSystemLinear_N_XX_M_XX (sim) shows a case where the execution time of detectJacobianSparsePattern blows up nicely as O(N3).

In this case, the reason is probably that the Jacobian of the system in ODE form is dense, with O(N2) nonzero elements. Maybe we could try to reduce the computational effort to O(N2log(N)) in this case, but I guess the right way to go here is to not compute the ODE Jacobian at all, and use a direct implicit DAE solver instead, whose Jacobian only has O(N) nonzero entries. Besides requiring less action by the back-end, a direct sparse DAE solver will also be O(N) times faster during runtime.

Unfortunately, all the analyses of the causalized form are still performed when -daeMode is used, so a lot of time is wasted there. We could switch this module off for these tests using the custom annotation.

comment:13 Changed 8 years ago by sjoelund.se

  • Milestone changed from 1.10.0 to 1.11.0

Ticket retargeted after milestone closed

comment:14 Changed 8 years ago by sjoelund.se

  • Milestone changed from 1.11.0 to 1.12.0

Milestone moved to 1.12.0 due to 1.11.0 already being released.

comment:15 follow-up: Changed 7 years ago by casella

I've made some more analysis on yesterday's run of the ScalableTestSuite on Hudson.

I noticed something really weird. The execution times of detectJacobianSparsePattern on the BreakerNetwork_N_XXX model are the following:

N Time [s]
320 3.7
640 27.5
1280 203

The time nicely scales as O(N3), where N is proportional to the number of equations in the model. But the most interesting thing is, all these models have only one state variable, which belongs to a simple first-order linear system.

Most of the model is a large system of implicit algebraic equations and a lot of events. Everything is triggered by the output of the first order system, but the Jacobian of all these models is always the 1x1 matrix [-1].

Please investigate why so much time is spent there, and posssibly update the documentation in case this function also does something more than "Detects the sparse pattern for jacobian in the causalized representation"

comment:16 in reply to: ↑ 15 Changed 7 years ago by casella

Replying to casella:

I noticed something really weird. The execution times of detectJacobianSparsePattern on the BreakerNetwork_N_XXX model are the following:

N Time [s]
320 3.7
640 27.5
1280 203

The time nicely scales as O(N3), where N is proportional to the number of equations in the model. But the most interesting thing is, all these models have only one state variable, which belongs to a simple first-order linear system.

Most of the model is a large system of implicit algebraic equations and a lot of events. Everything is triggered by the output of the first order system, but the Jacobian of all these models is always the 1x1 matrix [-1].

It seems that this problem is now solved. The execution time now scales linearly and is just 1.9 seconds for N = 1280.

The problem is still there with the DistributionSystemAC.ScaledExperiments.DistributionSystemLinear_N_XX_M_XX models.

comment:17 follow-up: Changed 7 years ago by hkiel

I fixed a bug in Graph.mo which re-created a constant option SOME({inNode}) in a for loop, wasting GBs of memory. The fix is in PR1743.

Please check if scaling is fine after my fix.

comment:18 in reply to: ↑ 17 Changed 7 years ago by casella

Replying to hkiel:

I fixed a bug in Graph.mo which re-created a constant option SOME({inNode}) in a for loop, wasting GBs of memory. The fix is in PR1743.

Please check if scaling is fine after my fix.

I noticed that the total time spent by the backend on the DistributionSystemLinear test cases has drastically been reduced with the last run of the Hudson job, compare the 19 july run with the 21 july run.

Particularly, the largest figures went down from 1800+ seconds to about 300. This also reduced the total time to run the whole library testing from 7250 s to 4588 s, which is remarkable. In fact, if you comparer the running times of the Hudson task it is apparent that the improvement took place on 21 july betweeen 10:09 and 21:01.

I am not sure where these huge savings (about 1500 s) come from, unfortunately the detailed log is only available for the latest results, not for the historical ones.

However, the scaling of detectJacobianSparsePattern is still not satisfactory. If you check DistributionSystemLinearIndividual_N_XX_M_XX, on the latest run you can see

N M Time
28 28 7.5
40 40 37.7
56 56 260.1
Last edited 7 years ago by casella (previous) (diff)

comment:19 follow-up: Changed 7 years ago by hkiel

The big time was spent allocating 150GB memory (for the 56x56 model). That was significantly improved with PR 1743.

I guess the overall time savings are due to SteamPipe not being compiled. I assume that was due to the flags not being set correctly (daeMode)m however, starting with the proper flags this morning failed completely... https://test.openmodelica.org/hudson/job/LibraryExperimental/187/

@casella Maybe you can restart the job properly?

comment:20 in reply to: ↑ 19 ; follow-up: Changed 7 years ago by casella

Replying to hkiel:

I guess the overall time savings are due to SteamPipe not being compiled. I assume that was due to the flags not being set correctly (daeMode)

You are right, sorry, I messed up.

@casella Maybe you can restart the job properly?

It failed also for me. I've asket @sjolund.se to check why.

comment:21 in reply to: ↑ 20 Changed 7 years ago by hkiel

@casella Maybe you can restart the job properly?

It failed also for me. I've asket @sjolund.se to check why.

https://svn.modelica.org had trouble with their SSL. I now restarted and everything seems to be fine now.

comment:22 Changed 7 years ago by casella

Summing up, comparing the 19 jul run to the 23 jul run there has indeed been a huge reduction in the backend times on the DistributionSystemLinear models, which also reduced the overall time to run the testsuite from 7200 to 6800 seconds.

The complexity of detectJacobianSparsePattern is still O(N3) on these models, which I think should not be the case, since there are only O(N) elements in the Jacobian, but at least the time taken was reduced by a factor about 5, which is good anyway.

comment:23 Changed 7 years ago by hkiel

detectJacobianSparsePattern is using Graph.partialDistance2colorInt ("A greedy partial distance-2 coloring algorithm.") which has three nested for-loops, so I expect something like O(n3) for this.

comment:24 Changed 7 years ago by casella

I'm not much in the details of this, but if a system is sparse to begin with, is it really necessary to have three nested for loops each spanning the entire set of variables to figure out its structure?

comment:25 follow-up: Changed 7 years ago by hkiel

Another question is why for the 28x28 and bigger systems the sparse solver is seletced such that simulation finishes, whereas for for the smaller systems only 10x10 can be solved in adequate time. 14x14 and 20x20 just fail (too long simulation run?).

comment:26 in reply to: ↑ 25 Changed 7 years ago by casella

Update: time spent by detectJacobianSparsePattern in DistributionSystemLinearIndividual_N_XX_M_XX as of yesterday's test run

N M Time
28 28 7.4
40 40 26.0
56 56 209.3

Results have not changed significantly.

Replying to hkiel:

Another question is why for the 28x28 and bigger systems the sparse solver is seletced such that simulation finishes, whereas for for the smaller systems only 10x10 can be solved in adequate time. 14x14 and 20x20 just fail (too long simulation run?).

This depends on the default minimum size for selecting the sparse solver, which is currently 4001 (see here). From the error messages, it seems that the default dense linear solver has issues, see the simulation log

I just opened ticket #4542 on this topic.

comment:27 Changed 7 years ago by casella

  • Milestone changed from 1.12.0 to Future

The milestone of this ticket has been reassigned to "Future".

If you think the issue is still valid and relevant for you, please select milestone 1.13.0 for back-end, code generation and run-time issues, or 2.0.0 for front-end issues.

If you are aware that the problem is no longer present, please select the milestone corresponding to the version of OMC you used to check that, and set the status to "worksforme".

In both cases, a short informative comment would be welcome.

comment:28 Changed 4 years ago by casella

The module is now called symbolicJacobian, but the problem is still there as of 13/05/2020, see e.g. the DistributionSystemLinearIndividual tests in report.

What is interesting is that this scaling issue is not present when daeMode is used, see report. Is that because daeMode currently computes the Jacobians numerically?

Last edited 4 years ago by casella (previous) (diff)

comment:29 Changed 4 years ago by casella

  • Cc AnHeuermann lochel added; casella vwaurich wbraun removed
  • Milestone changed from Future to 1.16.0
  • Owner changed from sjoelund.se to Karim.Abdelhak
  • Status changed from accepted to assigned

comment:30 Changed 4 years ago by casella

  • Milestone changed from 1.16.0 to 1.17.0

Retargeted to 1.17.0 after 1.16.0 release

comment:31 Changed 4 years ago by casella

  • Milestone changed from 1.17.0 to 1.18.0

Retargeted to 1.18.0 because of 1.17.0 timed release.

comment:32 Changed 3 years ago by casella

  • Milestone 1.18.0 deleted

Ticket retargeted after milestone closed

Note: See TracTickets for help on using tickets.