#4071 closed discussion (fixed)
Performance of new front end
Reported by: | Francesco Casella | Owned by: | Per Östlund |
---|---|---|---|
Priority: | high | Milestone: | 2.0.0 |
Component: | New Instantiation | Version: | |
Keywords: | Cc: |
Description (last modified by )
I have tried to get some preliminary results on the performance of the new front-end with a simple test model, reported here
package P model M0 Real x, y, z; Real a, b, c; Real d, e, f, g; equation x = b; y = 2*x + a; z - x = 0; a = 10; b - a = 15; c + a - 2 = 14; d = 40; e = 0; f = 50; c+ d+ e +x +a = 0; end M0; model M M0 m01; M0 m02; M0 m03; M0 m04; M0 m05; end M; model S M m1; M m2; end S; end P; model S P.M m1; P.M m2; ... P.M m10000; end S;
Model M0 has 10 variables and equations. Model M contains 5 instances of model M0, so it has 50 variables and equations. Model S contains 10000 instances of M, so it has half a million variables and equations. This test case has no physical meaning, but its structure is representative of many large-scale models, which contain large numbers of instances of the same basic models.
I have run instantiateModel(S)
with and without -d=newInst
on my i5 4200 @1.6 GHz, using OpenModelica-v1.11.0-dev-127-g043a756-64bit on Windows 10. These are the results
Time [s] | Memory [MB] | |
oldInst | 120 | 4500 |
newInst | 25 | 2700 |
The good news is that the performance has improved significantly: about a factor 5 in terms of speed, and about a factor 2 in terms of memory, which is not bad.
On the other hand, I was hoping for a much more drastic improvement. The performance with newInst is around 20000 equations/s, which seems to me still quite slow, and about 5 kB/equation, which still seems to me an excessive amount of memory. I have tried a test case with half instances in model S, and both time and memory is halved, so it makes sense to compute this kind of average quantities.
Note that the model doesn't have connections, doesn't use inheritance, and only uses basic Real types. At the end of the day, it is just re-instantiating the same model over and over again.
Q1: are there further optimizations possible, to exploit the repeated instances of the same class?
Q2: can you explain the memory usage of 5 kB for each variable/equation?
Attachments (1)
Change History (25)
by , 8 years ago
Attachment: | TestNewInst.zip added |
---|
comment:1 by , 8 years ago
Description: | modified (diff) |
---|
follow-up: 14 comment:2 by , 8 years ago
Actually, it takes ~20s to run your script for me with the new front end. Out of that, only 2.5s is taken up by the actual front end. The rest of the time is spent by the dumping of the flat model, which is known to be slow and a big memory hog. It's written in Susan, dumps everything to a buffer before printing, etc. It's not an issue when simulating models, since the dumper is never called then.
The compiler uses up about 500 MB of memory until the front end is finished and the dumper takes over, but since the compiler uses a garbage collector I don't know how much of that memory is actually used for something.
Also, note that the flat modelica dumper is used even when the -q flag is used to suppress the output. The only thing the -q flag does is disable the actual printing of the result from the dumper, which is stupid and something I will fix.
I removed some stuff from the main function that doesn't concern the front end, such as the mentioned dumper and some back end transformation stuff that's always done for some reason. These are the results (with a i7 5820K):
Time [s] | Memory [MB] | |
---|---|---|
oldInst | 38 | 2700 |
newInst | 2.7 | 800 |
The numbers are a bit bigger than the 2.5s and 500MB I mentioned above, because those numbers are only for the front end and nothing else. Also note that I measured the memory consumption just by looking at the output from top, so the numbers are not very accurate.
As for your questions:
A1: Maybe, I know there are things that can be optimized, but there are also things missing. It's far too early to say anything about the final performance.
A2: Yes, badly implemented dumper, see above.
follow-up: 4 comment:3 by , 8 years ago
I've now made a pull request that will disable the dumping of the flat model when -q is used. This does not apply when using instantiateModel in a script though, only when flattening a model in a file directly with the compiler. There's also an overhead of about 2s from the back end transformation stuff which I don't dare touch.
follow-ups: 5 8 comment:4 by , 8 years ago
Thanks Per for the analysis! The results look a lot more encouraging to me now.
A1: Maybe, I know there are things that can be optimized, but there are also things missing. It's far too early to say anything about the final performance.
I fully agree that in general it doesn't make sense to talk about optimization before the implementation is complete. Premature optimization is the root of all evil.
That said, better performance is one of the reasons why we are moving to a new front end (the other more important one being increased coverage), so if the new design turned out not to be substantially better performing than the old one, I would have been worried. In fact, I was mislead by the dumper overhead. The figures you show, which mean a factor 14 in speed and a factor 3.4 in memory occupation, are much more reassuring from this point of view.
I've now made a pull request that will disable the dumping of the flat model when -q is used. This does not apply when using instantiateModel in a script though, only when flattening a model in a file directly with the compiler.
That means running time omc -d=newInst -q S.mo
in Linux, right?
There's also an overhead of about 2s from the back end transformation stuff which I
don't dare touch.
Anyone else can comment on what is the back-end doing when just flattening the model, and possibly remove that from the code if not strictly necessary?
comment:5 by , 8 years ago
Replying to casella:
That means running
time omc -d=newInst -q S.mo
in Linux, right?
Yes. I had to add the models from your script to S.mo first of course.
follow-up: 7 comment:6 by , 8 years ago
Sure, of course.
If there is only one model as a top-level entity in S.mo (besides package P), that is the one that gets flattened, right? This is not very clear from the omc online help, that just says
omc Model.mo will produce flattened Model on standard output
Or I could make two files and do time omc -d=newInst -q S.mo P.mo
comment:7 by , 8 years ago
Replying to casella:
If there is only one model as a top-level entity in S.mo (besides package P), that is the one that gets flattened, right?
The compiler will flatten the last top-level class in the file, regardless of what type of class it is. You can also use -i to specify the class you wish to flatten.
comment:8 by , 8 years ago
Replying to casella:
There's also an overhead of about 2s from the back end transformation stuff which I
don't dare touch.
Anyone else can comment on what is the back-end doing when just flattening the model, and possibly remove that from the code if not strictly necessary?
Notification: Performance of stateMachineToDataFlow: time 1.207/3.948, allocations: 267 MB / 1.644 GB, free: 91.58 MB / 0.8872 GB Notification: Performance of FCore.getEvaluatedParams: time 2.13e-06/3.948, allocations: 1.797 kB / 1.644 GB, free: 91.57 MB / 0.8872 GB Notification: Performance of makeEvaluatedParamFinal: time 0.07548/4.023, allocations: 30.52 MB / 1.673 GB, free: 77.53 MB / 0.8872 GB
I really dislike the state machine code taking up a lot of time even when doing nothing... PR:1134 should fix it.
follow-up: 10 comment:9 by , 8 years ago
@perost: could you please check if Martin's PR got rid of the unnecessary back-end overhead?
Thanks!
comment:10 by , 8 years ago
Replying to casella:
@perost: could you please check if Martin's PR got rid of the unnecessary back-end overhead?
Thanks!
It does.
comment:11 by , 8 years ago
I re-ran the tests and I got very interesting results. A model with one million equations takes about 100 s to flatten on my PC with the old front-end, and about 5 seconds with the new one. That's a factor 20 speed-up, which is indeed very good.
I was also able to flatten a model with ten million equations in one minute, using less than 10 GB or RAM. With some patience and enough memory, we could actually tackle 100 million equations.
I hope this kind of performance holds when other features (such as modifier and connection handling) are introduced.
comment:12 by , 8 years ago
Test cases to evaluate the performance of the new front-end with large models are available on GitHub here: https://github.com/casella/TestNewInst
comment:13 by , 8 years ago
Milestone: | 1.11.0 → 1.12.0 |
---|
Milestone moved to 1.12.0 due to 1.11.0 already being released.
comment:14 by , 8 years ago
Update: I ran some new tests that include types with unit attributes, parameter propagation, nested models, and connections, see https://github.com/casella/TestNewInst.
These are the results on our Xeon E5-2650 v3 @2.30 GHz server, running 64-bit Linux:
# raw equations | Time new FE [s] | Memory new FE [GB] | Time old FE [s] | Memory old FE [GB] |
---|---|---|---|---|
50,000 | 0.97 | n.d | 10.4 | n.d. |
500,000 | 7.12 | 2.7 | 123.9 | 7.5 |
1,000,000 | 16.42 | 4.6 | 285.4 | 18.0 |
5,000,000 | 95.56 | 19.8 | n.d. | n.d. |
10,000,000 | 181.81 | 37.8 | n.d. | n.d. |
The performance of the new front end scales linearly both in terms of time and memory. With these test cases, the new front end is 10 to 18 times faster than the old one, the factor 18 holding for the larger models. That corresponds to about 50000 equations/s. Memory consumption is around 4 kB/equation, which could possibly be further improved but starts to seem like a reasonable figure, four times smaller than with the old front-end.
Summing up, for systems up to one million equations the performance is now quite good, and the 20X improvement of earlier tests (see comment:11) seem to be confirmed also when slightly more involved Modelica constructs are used. Comparing these results with those of earlier experiments, it seems that equations stemming from connect statements are (obviously) more expensive than those coming from equations sections, but reasonably so.
I am convinced that for systems above one million equations in which comparatively few models are instantiated many many times, it is really necessary to radically change the approach, see #3678. This would very drastically reduce the instantiation time, though the time to generate connection equations will always be there. The main advantage would be to avoid blowing up the generated code, which could lead to much faster performance due to reduced chance of cache misses.
follow-ups: 16 17 comment:15 by , 7 years ago
Update: I ran the ScalableTestSuite Hudson job with the new FE. At the moment there are only three models working, yet the results are interesting.
The table reports the time spent by the old and new front end in seconds
Model | Time old FE | Time new FE | Speed-up |
---|---|---|---|
CascadedFirstOrder_N_6400 | 2.99 | 0.69 | 4.3 |
CascadedFirstOrder_N_12800 | 5.23 | 0.91 | 5.7 |
CascadedFirstOrder_N_25600 | 10.47 | 1.24 | 8.4 |
ManyEvents_N_2000 | 1.51 | 0.52 | 2.9 |
ManyEvents_N_4000 | 2.60 | 0.43 | 6.0 |
ManyEvents_N_8000 | 3.94 | 0.72 | 5.8 |
AdvectionReaction_N_3200 | 3.63 | 0.47 | 7.7 |
AdvectionReaction_N_6400 | 5.48 | 0.73 | 7.5 |
AdvectionReaction_N_12800 | 12.65 | 0.86 | 14.7 |
The speed-up ratios range from about 3 to about 15. What is more important is that, while the old FE running times scaled about linearly with the number of equations, the new FE running times grow much less than O(N). As these tests have a repetitive structure (which is typical of large-scale systems), it seems that the new FE is rather efficient at reusing already processed information, rather than restarting every time from scratch.
Note that these tests are in fact not particularly large, as they contain less than 30000 equations. Significantly larger systems (one million equations or more) could thus show much larger speed-up ratios.
comment:16 by , 7 years ago
Replying to casella:
As these tests have a repetitive structure (which is typical of large-scale systems), it seems that the new FE is rather efficient at reusing already processed information, rather than restarting every time from scratch.
I was going to say that it's because they use large arrays, which the new frontend doesn't scalarize until the very end when it generates a DAE for the backend. But the instantiation and typing takes an insignificant amount of time compared to the generation of the DAE (you can use -d=execstat for statistics), and that phase should scale linearly. I've seen that it does scale linearly at small sizes, only to then make a big jump in time when crossing a certain size. So I suspect it might actually be the GC kicking in and messing up the timing, but I'll need to do some more testing to verify that.
follow-up: 18 comment:17 by , 7 years ago
I did a bit of benchmarking of the CascadedFirstOrder model. The numbers below are only for the last phase of the new instantiation, the generation of the DAE, since the other phases doesn't scale with N and takes an insignificant amount of time. These are also single runs, with and without garbage collection (the largest sizes couldn't be run without GC, due to running out of memory).
N | Time (s) | N/Time | Time no GC | N/Time |
---|---|---|---|---|
100 | 0.003 | 33333 | 0.003 | 33333 |
200 | 0.007 | 28571 | 0.006 | 33333 |
400 | 0.012 | 33333 | 0.011 | 36363 |
800 | 0.35 | 2285 | 0.021 | 38095 |
1600 | 0.40 | 4000 | 0.042 | 38095 |
3200 | 0.43 | 7442 | 0.077 | 41558 |
6400 | 0.52 | 12307 | 0.14 | 45714 |
12800 | 1.1 | 11636 | 0.27 | 47407 |
25600 | 1.5 | 17066 | 0.53 | 48301 |
52100 | 2.5 | 20480 | 1.0 | 52100 |
102400 | 4.3 | 23814 | 1.9 | 53894 |
204800 | 9.4 | 21787 | 3.9 | 52512 |
409600 | 17.4 | 23540 | 7.897 | 51867 |
819200 | 34.9 | 23473 | --- | --- |
1638400 | 64.8 | 25284 | --- | --- |
From this I think it's fairly obvious that the GC is to blame for the weird scaling. The scaling without GC is still a bit better than linear though, but it seems to plane out for larger sizes. Of course, running without GC is not feasible for really large models (409600 was the limit for me with 32GB memory). But it does show that the GC introduces significant overhead, so that's something we should target in the future.
follow-up: 19 comment:18 by , 7 years ago
Replying to perost:
I did a bit of benchmarking of the CascadedFirstOrder model.
@perost, were you using a single thread or allowed multiple threads in parallel?
I've seen that having a large number of CPUs (say, 20) can bring a very significant speed-up, because the GC task is spread over all of them.
comment:19 by , 7 years ago
Replying to casella:
Replying to perost:
I did a bit of benchmarking of the CascadedFirstOrder model.
@perost, were you using a single thread or allowed multiple threads in parallel?
I've seen that having a large number of CPUs (say, 20) can bring a very significant speed-up, because the GC task is spread over all of them.
I was using a 5820K (6 cores/12 threads), with default GC settings.
comment:21 by , 7 years ago
Milestone: | 1.12.0 → 2.0.0 |
---|
comment:22 by , 6 years ago
Almost all the ScalableTestSuite now runs with the NF. Comparing the performance of the old and new FE consistently shows one order of magnitude speed up on all but the smallest models, for which speed doesn't matter anyway.
After achieving 100% coverage (v. 2.0.0), we may consider further optimizations to address the current bottlenecks.
comment:23 by , 6 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Closing this ticket, the performance of the NF is more than one order of magnitude better than the OF, and further optimizations for large systems require to avoid flattening arrays, which is a different topic.
comment:24 by , 6 years ago
Component: | Frontend → New Instantiation |
---|
Test models