Opened 9 years ago
Last modified 3 years ago
#3781 assigned defect
Fix scaling of detectJacobianSparsePattern
Reported by: | Martin Sjölund | Owned by: | Karim Adbdelhak |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | Backend | Version: | v1.10.0-dev-nightly |
Keywords: | Cc: | Andreas Heuermann, Lennart Ochel |
Description
loadString("model M constant Integer n=10000; Real r[n](each start=0, each fixed=true); equation for i in 1:n loop der(r[i]) = i*time; end for; end M;");getErrorString(); translateModel(M);getErrorString();
n=1000
Notification: Performance of postOpt detectJacobianSparsePattern (simulation): time 0.1987/1.124, memory: 185.4 MB / 0.5414 GB Total time: 1.576, memory: 0.7237 GB
n=2000
Notification: Performance of postOpt detectJacobianSparsePattern (simulation): time 0.7653/2.696, memory: 0.6772 GB / 1.388 GB Total time: 3.595, memory: 1.816 GB
n=10000
Notification: Performance of postOpt detectJacobianSparsePattern (simulation): time 18.73/28.87, memory: 16.02 GB / 19.54 GB Total time: 34.7, memory: 24 GB
Seems to be O(N^2)
memory/time in detectJacobianSparsePattern, which is taking >50% of translation time for larger models.
Change History (32)
comment:1 by , 9 years ago
comment:2 by , 9 years ago
Owner: | changed from | to
---|---|
Status: | new → accepted |
comment:3 by , 9 years ago
With my changes: https://github.com/OpenModelica/OMCompiler/pull/548
n=10000
Notification: Performance of postOpt detectJacobianSparsePattern (simulation): time 0.3142/8.106, memory: 229.1 MB / 3.748 GB Notification: Performance of simCode: created modelInfo and variables: time 4.118/12.86, memory: 3.294 GB / 7.199 GB Total time: 15.82, memory: 8.385 GB
n=80000
Notification: Performance of postOpt detectJacobianSparsePattern (simulation): time 2.192/72.66, memory: 1.789 GB / 29.92 GB Notification: Performance of simCode: created modelInfo and variables: time 226.6/304.2, memory: 198.8 GB / 230 GB Total time: 327.9, memory: 239.6 GB
Which means that SimCode is the next culprit.
follow-ups: 8 12 comment:4 by , 9 years ago
@casella: Should this stupid testcase be added to ScalableTestsuite as well? It's not a useful physical model, but it does push a compiler to its limits.
comment:5 by , 9 years ago
Cc: | added |
---|
The poor scaling in SimCode is from:
function getHighestDerivation"computes the highest derivative among all states. this includes derivatives of derivatives as well author: waurich TUD 2015-05" input BackendDAE.BackendDAE inDAE; output Integer highestDerivation;
comment:6 by , 9 years ago
Using dynamic programming for getHighestDerivation:
n=10000
Notification: Performance of simCode: created modelInfo and variables: time 0.2758/9.092, memory: 190.6 MB / 4.091 GB Total time: 11.62, memory: 5.277 GB
n=70000 (computer ran out of RAM at n=80000; slight change where memory was allocated)
Notification: Performance of simCode: created modelInfo and variables: time 7.335/77.9, memory: 1.358 GB / 28.65 GB Total time: 91.42, memory: 37.01 GB
Most of the memory/time now comes from other parts of createModelInfo. Basically it now just iterates over all variables and finds out they are not derivatives of other variables and just returns. Previously, it tried all combinations and created list ranges for 1:2, 1:3, 1:4, 1:5, 1:10000, etc. which was very expensive. Now, n=70000 uses less memory than n=10000 did before for createModelInfo.
Note: Almost half of the total 37GB allocated is actually in use since n=80000 fails for 20GB virtual memory ulimit. Of this, we have the following stages using a lot of memory:
FrontEnd: time 29.2/29.2, memory: 7.466 GB / 7.471 GB preOpt comSubExp (simulation): time 0.8942/35.2, memory: 0.8501 GB / 9.901 GB matching and sorting (n=70000): time 3.431/38.97, memory: 3.892 GB / 13.89 GB prepare postOptimizeDAE: time 8.277/51.09, memory: 7.758 GB / 23.36 GB postOpt detectJacobianSparsePattern (simulation): time 4.387/65.02, memory: 1.566 GB / 26.2 GB (still) simCode: created modelInfo and variables: time 7.335/77.9, memory: 1.358 GB / 28.65 GB simCode: created linear, non-linear and system jacobian parts: time 1.351/79.25, memory: 0.5233 GB / 29.17 GB Templates: time 14.2/94.23, memory: 7.523 GB / 37.01 GB
comment:7 by , 9 years ago
Thanks for finding out. I will remove the call anyway. There is no need for this information anymore.
comment:8 by , 9 years ago
Replying to sjoelund.se:
@casella: Should this stupid testcase be added to ScalableTestsuite as well? It's not a useful physical model, but it does push a compiler to its limits.
In fact it is not stupid at all, I have tried something similar when writing my Modelica conference paper, and it was useful to find out interesting things (e.g., that a lot of memory was allocated for the dense Jacobians of f(x,t), which doesn't really work for N > 10000)
However, this model lacks one of the requirements of the library, i.e., that the models should be meaningful and representative of real-life problems - this one obviously isn't, as it doesn't make sense to simulate so many independent systems together.
The question is: are there meaningful models with (many) totally independent Jacobian blocks? For pure continuous-time systems, I would rather say no, because of the above-mentioned reason. If you have many independent systems, you'd better simulate them separately (maybe using parallel threads). The only case where this would make sense is if the dependency between the various sub-systems is given by discrete variables and when-equations, or when it is given by delay operators, such as in systems using TLM for coupling. Seems to me quite a corner case, but if you have some example in mind, I'd add it gladly.
For the SimCode part, you probably don't need the Jacobian to be block diagonal, but only very sparse, in order to get an improvement. For this purpose, you can check the performance on the SimpleAdvection models, which contain
for j in 1:N-1 loop der(Ttilde[j]) = u/l*(T[j]-T[j+1]); end for; T[1] = Tin; T[N] = Tout; Ttilde = T[2:N];
In this case the Jacobian has nonzero elements only on the main diagonal and on the diagonal immediately below it.
comment:9 by , 9 years ago
@casella: I just started with this stupid model to see where the compiler does not scale. That is its only purpose. detectJacobianSparsePattern was just the first module that did not scale.
comment:10 by , 9 years ago
I'd recommend using ScalableTestSuite.Thermal.Advection.ScaledExperiments.SimpleAdvection_N_XXX. That's a good starting point for compiler scaling analysis.
comment:11 by , 9 years ago
Cc: | added |
---|
comment:12 by , 8 years ago
Some futher remarks on the topic of this ticket
Replying to sjoelund.se:
@casella: Should this stupid testcase be added to ScalableTestsuite as well? It's not a useful physical model, but it does push a compiler to its limits.
This was done a while ago, see the CascadedFirstOrder_N_XXX
test cases. They are slightly less stupid than the original test case of this ticket, because the Jacobian is not the zero matrix, but rather a matrix with a upper pseudo-diagonal of ones. It has a structure similar to SimpleAdvection_N_XXX
, but with even less code. Anyway, only O(N) nonzero elements.
Looking at the time took by detectJacobianSparsePattern
on CascadedFirstOrder_N_12800
(0.40 s) and CascadedFirstOrder_N_25600
(1.15 s), it seems that the situation has much improved, at least for systems with a very sparse Jacobian structure.
However, a recently added battery of new test cases DistributionSystemAC.ScaledExperiments.DistributionSystemLinear_N_XX_M_XX (sim)
shows a case where the execution time of detectJacobianSparsePattern
blows up nicely as O(N3).
In this case, the reason is probably that the Jacobian of the system in ODE form is dense, with O(N2) nonzero elements. Maybe we could try to reduce the computational effort to O(N2log(N)) in this case, but I guess the right way to go here is to not compute the ODE Jacobian at all, and use a direct implicit DAE solver instead, whose Jacobian only has O(N) nonzero entries. Besides requiring less action by the back-end, a direct sparse DAE solver will also be O(N) times faster during runtime.
Unfortunately, all the analyses of the causalized form are still performed when -daeMode
is used, so a lot of time is wasted there. We could switch this module off for these tests using the custom annotation.
comment:14 by , 8 years ago
Milestone: | 1.11.0 → 1.12.0 |
---|
Milestone moved to 1.12.0 due to 1.11.0 already being released.
follow-up: 16 comment:15 by , 7 years ago
I've made some more analysis on yesterday's run of the ScalableTestSuite on Hudson.
I noticed something really weird. The execution times of detectJacobianSparsePattern
on the BreakerNetwork_N_XXX
model are the following:
N | Time [s] |
320 | 3.7 |
640 | 27.5 |
1280 | 203 |
The time nicely scales as O(N3), where N is proportional to the number of equations in the model. But the most interesting thing is, all these models have only one state variable, which belongs to a simple first-order linear system.
Most of the model is a large system of implicit algebraic equations and a lot of events. Everything is triggered by the output of the first order system, but the Jacobian of all these models is always the 1x1 matrix [-1]
.
Please investigate why so much time is spent there, and posssibly update the documentation in case this function also does something more than "Detects the sparse pattern for jacobian in the causalized representation"
comment:16 by , 7 years ago
Replying to casella:
I noticed something really weird. The execution times of
detectJacobianSparsePattern
on theBreakerNetwork_N_XXX
model are the following:
N Time [s] 320 3.7 640 27.5 1280 203 The time nicely scales as O(N3), where N is proportional to the number of equations in the model. But the most interesting thing is, all these models have only one state variable, which belongs to a simple first-order linear system.
Most of the model is a large system of implicit algebraic equations and a lot of events. Everything is triggered by the output of the first order system, but the Jacobian of all these models is always the 1x1 matrix
[-1]
.
It seems that this problem is now solved. The execution time now scales linearly and is just 1.9 seconds for N = 1280.
The problem is still there with the DistributionSystemAC.ScaledExperiments.DistributionSystemLinear_N_XX_M_XX
models.
follow-up: 18 comment:17 by , 7 years ago
I fixed a bug in Graph.mo which re-created a constant option SOME({inNode})
in a for loop, wasting GBs of memory. The fix is in PR1743.
Please check if scaling is fine after my fix.
comment:18 by , 7 years ago
Replying to hkiel:
I fixed a bug in Graph.mo which re-created a constant option
SOME({inNode})
in a for loop, wasting GBs of memory. The fix is in PR1743.
Please check if scaling is fine after my fix.
I noticed that the total time spent by the backend on the DistributionSystemLinear
test cases has drastically been reduced with the last run of the Hudson job, compare the 19 july run with the 21 july run.
Particularly, the largest figures went down from 1800+ seconds to about 300. This also reduced the total time to run the whole library testing from 7250 s to 4588 s, which is remarkable. In fact, if you comparer the running times of the Hudson task it is apparent that the improvement took place on 21 july betweeen 10:09 and 21:01.
I am not sure where these huge savings (about 1500 s) come from, unfortunately the detailed log is only available for the latest results, not for the historical ones.
However, the scaling of detectJacobianSparsePattern is still not satisfactory. If you check DistributionSystemLinearIndividual_N_XX_M_XX
, on the latest run you can see
N | M | Time |
28 | 28 | 7.5 |
40 | 40 | 37.7 |
56 | 56 | 260.1 |
follow-up: 20 comment:19 by , 7 years ago
The big time was spent allocating 150GB memory (for the 56x56 model). That was significantly improved with PR 1743.
I guess the overall time savings are due to SteamPipe not being compiled. I assume that was due to the flags not being set correctly (daeMode)m however, starting with the proper flags this morning failed completely... https://test.openmodelica.org/hudson/job/LibraryExperimental/187/
@casella Maybe you can restart the job properly?
follow-up: 21 comment:20 by , 7 years ago
Replying to hkiel:
I guess the overall time savings are due to SteamPipe not being compiled. I assume that was due to the flags not being set correctly (daeMode)
You are right, sorry, I messed up.
@casella Maybe you can restart the job properly?
It failed also for me. I've asket @sjolund.se to check why.
comment:21 by , 7 years ago
@casella Maybe you can restart the job properly?
It failed also for me. I've asket @sjolund.se to check why.
https://svn.modelica.org had trouble with their SSL. I now restarted and everything seems to be fine now.
comment:22 by , 7 years ago
Summing up, comparing the 19 jul run to the 23 jul run there has indeed been a huge reduction in the backend times on the DistributionSystemLinear models, which also reduced the overall time to run the testsuite from 7200 to 6800 seconds.
The complexity of detectJacobianSparsePattern is still O(N3) on these models, which I think should not be the case, since there are only O(N) elements in the Jacobian, but at least the time taken was reduced by a factor about 5, which is good anyway.
comment:23 by , 7 years ago
detectJacobianSparsePattern
is using Graph.partialDistance2colorInt
("A greedy partial distance-2 coloring algorithm.") which has three nested for-loops, so I expect something like O(n3) for this.
comment:24 by , 7 years ago
I'm not much in the details of this, but if a system is sparse to begin with, is it really necessary to have three nested for loops each spanning the entire set of variables to figure out its structure?
follow-up: 26 comment:25 by , 7 years ago
Another question is why for the 28x28 and bigger systems the sparse solver is seletced such that simulation finishes, whereas for for the smaller systems only 10x10 can be solved in adequate time. 14x14 and 20x20 just fail (too long simulation run?).
comment:26 by , 7 years ago
Update: time spent by detectJacobianSparsePattern
in DistributionSystemLinearIndividual_N_XX_M_XX
as of yesterday's test run
N | M | Time |
28 | 28 | 7.4 |
40 | 40 | 26.0 |
56 | 56 | 209.3 |
Results have not changed significantly.
Replying to hkiel:
Another question is why for the 28x28 and bigger systems the sparse solver is seletced such that simulation finishes, whereas for for the smaller systems only 10x10 can be solved in adequate time. 14x14 and 20x20 just fail (too long simulation run?).
This depends on the default minimum size for selecting the sparse solver, which is currently 4001 (see here). From the error messages, it seems that the default dense linear solver has issues, see the simulation log
I just opened ticket #4542 on this topic.
comment:27 by , 7 years ago
Milestone: | 1.12.0 → Future |
---|
The milestone of this ticket has been reassigned to "Future".
If you think the issue is still valid and relevant for you, please select milestone 1.13.0 for back-end, code generation and run-time issues, or 2.0.0 for front-end issues.
If you are aware that the problem is no longer present, please select the milestone corresponding to the version of OMC you used to check that, and set the status to "worksforme".
In both cases, a short informative comment would be welcome.
comment:28 by , 5 years ago
The module is now called symbolicJacobian
, but the problem is still there as of 13/05/2020, see e.g. the DistributionSystemLinearIndividual tests in report.
What is interesting is that this scaling issue is not present when daeMode
is used, see report. Is that because daeMode currently computes the Jacobians numerically?
comment:29 by , 5 years ago
Cc: | added; removed |
---|---|
Milestone: | Future → 1.16.0 |
Owner: | changed from | to
Status: | accepted → assigned |
comment:31 by , 4 years ago
Milestone: | 1.17.0 → 1.18.0 |
---|
Retargeted to 1.18.0 because of 1.17.0 timed release.
The slow part seems to be
BackendDAEOptimize.collapseIndependentBlocks
, which is bad for a couple of reasons: