Context Navigation

← Previous Ticket
Next Ticket →

#4071 closed discussion (fixed)

Performance of new front end

Reported by:	Francesco Casella	Owned by:	Per Östlund
Priority:	high	Milestone:	2.0.0
Component:	New Instantiation	Version:
Keywords:		Cc:

Description (last modified by Francesco Casella)

I have tried to get some preliminary results on the performance of the new front-end with a simple test model, reported here

package P
  model M0
    Real x, y, z;
    Real a, b, c;
    Real d, e, f, g;
  equation
    x = b;
    y = 2*x + a;
    z - x = 0;
    a = 10;
    b - a = 15;
    c + a - 2 = 14;
    d = 40;
    e = 0;
    f = 50;
    c+ d+ e +x +a = 0;
  end M0;
  
  model M
    M0 m01;
    M0 m02;
    M0 m03;
    M0 m04;
    M0 m05;
  end M;
  
  model S
    M m1;
    M m2;
  end S;
end P;

model S
  P.M m1;
  P.M m2;
  ...
  P.M m10000;
end S;

Model M0 has 10 variables and equations. Model M contains 5 instances of model M0, so it has 50 variables and equations. Model S contains 10000 instances of M, so it has half a million variables and equations. This test case has no physical meaning, but its structure is representative of many large-scale models, which contain large numbers of instances of the same basic models.

I have run instantiateModel(S) with and without -d=newInst on my i5 4200 @1.6 GHz, using OpenModelica-v1.11.0-dev-127-g043a756-64bit on Windows 10. These are the results

	Time [s]	Memory [MB]
oldInst	120	4500
newInst	25	2700

The good news is that the performance has improved significantly: about a factor 5 in terms of speed, and about a factor 2 in terms of memory, which is not bad.

On the other hand, I was hoping for a much more drastic improvement. The performance with newInst is around 20000 equations/s, which seems to me still quite slow, and about 5 kB/equation, which still seems to me an excessive amount of memory. I have tried a test case with half instances in model S, and both time and memory is halved, so it makes sense to compute this kind of average quantities.

Note that the model doesn't have connections, doesn't use inheritance, and only uses basic Real types. At the end of the day, it is just re-instantiating the same model over and over again.

Q1: are there further optimizations possible, to exploit the repeated instances of the same class?

Q2: can you explain the memory usage of 5 kB for each variable/equation?

Attachments (1)

TestNewInst.zip (13.5 KB ) - added by Francesco Casella 9 years ago.: Test models

Download all attachments as: .zip

Change History (25)

by Francesco Casella, 9 years ago

Attachment:	TestNewInst.zip added

Test models

comment:1 by Francesco Casella, 9 years ago

Description:	modified (diff)

follow-up: 14 comment:2 by Per Östlund, 9 years ago

Actually, it takes ~20s to run your script for me with the new front end. Out of that, only 2.5s is taken up by the actual front end. The rest of the time is spent by the dumping of the flat model, which is known to be slow and a big memory hog. It's written in Susan, dumps everything to a buffer before printing, etc. It's not an issue when simulating models, since the dumper is never called then.

The compiler uses up about 500 MB of memory until the front end is finished and the dumper takes over, but since the compiler uses a garbage collector I don't know how much of that memory is actually used for something.

Also, note that the flat modelica dumper is used even when the -q flag is used to suppress the output. The only thing the -q flag does is disable the actual printing of the result from the dumper, which is stupid and something I will fix.

I removed some stuff from the main function that doesn't concern the front end, such as the mentioned dumper and some back end transformation stuff that's always done for some reason. These are the results (with a i7 5820K):

	Time [s]	Memory [MB]
oldInst	38	2700
newInst	2.7	800

The numbers are a bit bigger than the 2.5s and 500MB I mentioned above, because those numbers are only for the front end and nothing else. Also note that I measured the memory consumption just by looking at the output from top, so the numbers are not very accurate.

As for your questions:
A1: Maybe, I know there are things that can be optimized, but there are also things missing. It's far too early to say anything about the final performance.
A2: Yes, badly implemented dumper, see above.

follow-up: 4 comment:3 by Per Östlund, 9 years ago

I've now made a pull request that will disable the dumping of the flat model when -q is used. This does not apply when using instantiateModel in a script though, only when flattening a model in a file directly with the compiler. There's also an overhead of about 2s from the back end transformation stuff which I don't dare touch.

in reply to: 3 ; follow-ups: 5 8 comment:4 by Francesco Casella, 9 years ago

Thanks Per for the analysis! The results look a lot more encouraging to me now.

A1: Maybe, I know there are things that can be optimized, but there are also things missing. It's far too early to say anything about the final performance.

I fully agree that in general it doesn't make sense to talk about optimization before the implementation is complete. Premature optimization is the root of all evil.

That said, better performance is one of the reasons why we are moving to a new front end (the other more important one being increased coverage), so if the new design turned out not to be substantially better performing than the old one, I would have been worried. In fact, I was mislead by the dumper overhead. The figures you show, which mean a factor 14 in speed and a factor 3.4 in memory occupation, are much more reassuring from this point of view.

I've now made a pull request that will disable the dumping of the flat model when -q is used. This does not apply when using instantiateModel in a script though, only when flattening a model in a file directly with the compiler.

That means running time omc -d=newInst -q S.mo in Linux, right?

There's also an overhead of about 2s from the back end transformation stuff which I
don't dare touch.

Anyone else can comment on what is the back-end doing when just flattening the model, and possibly remove that from the code if not strictly necessary?

in reply to: 4 comment:5 by Per Östlund, 9 years ago

Replying to casella:

That means running time omc -d=newInst -q S.mo in Linux, right?

Yes. I had to add the models from your script to S.mo first of course.

follow-up: 7 comment:6 by Francesco Casella, 9 years ago

Sure, of course.

If there is only one model as a top-level entity in S.mo (besides package P), that is the one that gets flattened, right? This is not very clear from the omc online help, that just says

  omc Model.mo will produce flattened Model on standard output

Or I could make two files and do time omc -d=newInst -q S.mo P.mo

in reply to: 6 comment:7 by Per Östlund, 9 years ago

Replying to casella:

If there is only one model as a top-level entity in S.mo (besides package P), that is the one that gets flattened, right?

The compiler will flatten the last top-level class in the file, regardless of what type of class it is. You can also use -i to specify the class you wish to flatten.

in reply to: 4 comment:8 by Martin Sjölund, 9 years ago

Replying to casella:

There's also an overhead of about 2s from the back end transformation stuff which I
don't dare touch.

Anyone else can comment on what is the back-end doing when just flattening the model, and possibly remove that from the code if not strictly necessary?

Notification: Performance of stateMachineToDataFlow: time 1.207/3.948, allocations: 267 MB / 1.644 GB, free: 91.58 MB / 0.8872 GB
Notification: Performance of FCore.getEvaluatedParams: time 2.13e-06/3.948, allocations: 1.797 kB / 1.644 GB, free: 91.57 MB / 0.8872 GB
Notification: Performance of makeEvaluatedParamFinal: time 0.07548/4.023, allocations: 30.52 MB / 1.673 GB, free: 77.53 MB / 0.8872 GB

I really dislike the state machine code taking up a lot of time even when doing nothing... PR:1134 should fix it.

follow-up: 10 comment:9 by Francesco Casella, 9 years ago

@perost: could you please check if Martin's PR got rid of the unnecessary back-end overhead?

Thanks!

in reply to: 9 comment:10 by Per Östlund, 9 years ago

Replying to casella:

@perost: could you please check if Martin's PR got rid of the unnecessary back-end overhead?

Thanks!

It does.

comment:11 by Francesco Casella, 9 years ago

I re-ran the tests and I got very interesting results. A model with one million equations takes about 100 s to flatten on my PC with the old front-end, and about 5 seconds with the new one. That's a factor 20 speed-up, which is indeed very good.

I was also able to flatten a model with ten million equations in one minute, using less than 10 GB or RAM. With some patience and enough memory, we could actually tackle 100 million equations.

I hope this kind of performance holds when other features (such as modifier and connection handling) are introduced.

comment:12 by Francesco Casella, 9 years ago

Test cases to evaluate the performance of the new front-end with large models are available on GitHub here: https://github.com/casella/TestNewInst

comment:13 by Martin Sjölund, 9 years ago

Milestone:	1.11.0 → 1.12.0

Milestone moved to 1.12.0 due to 1.11.0 already being released.

in reply to: 2 comment:14 by Francesco Casella, 9 years ago

Update: I ran some new tests that include types with unit attributes, parameter propagation, nested models, and connections, see https://github.com/casella/TestNewInst.

These are the results on our Xeon E5-2650 v3 @2.30 GHz server, running 64-bit Linux:

# raw equations	Time new FE [s]	Memory new FE [GB]	Time old FE [s]	Memory old FE [GB]
50,000	0.97	n.d	10.4	n.d.
500,000	7.12	2.7	123.9	7.5
1,000,000	16.42	4.6	285.4	18.0
5,000,000	95.56	19.8	n.d.	n.d.
10,000,000	181.81	37.8	n.d.	n.d.

The performance of the new front end scales linearly both in terms of time and memory. With these test cases, the new front end is 10 to 18 times faster than the old one, the factor 18 holding for the larger models. That corresponds to about 50000 equations/s. Memory consumption is around 4 kB/equation, which could possibly be further improved but starts to seem like a reasonable figure, four times smaller than with the old front-end.

Summing up, for systems up to one million equations the performance is now quite good, and the 20X improvement of earlier tests (see comment:11) seem to be confirmed also when slightly more involved Modelica constructs are used. Comparing these results with those of earlier experiments, it seems that equations stemming from connect statements are (obviously) more expensive than those coming from equations sections, but reasonably so.

I am convinced that for systems above one million equations in which comparatively few models are instantiated many many times, it is really necessary to radically change the approach, see #3678. This would very drastically reduce the instantiation time, though the time to generate connection equations will always be there. The main advantage would be to avoid blowing up the generated code, which could lead to much faster performance due to reduced chance of cache misses.

follow-ups: 16 17 comment:15 by Francesco Casella, 8 years ago

Update: I ran the ScalableTestSuite Hudson job with the new FE. At the moment there are only three models working, yet the results are interesting.

The table reports the time spent by the old and new front end in seconds

Model	Time old FE	Time new FE	Speed-up
CascadedFirstOrder_N_6400	2.99	0.69	4.3
CascadedFirstOrder_N_12800	5.23	0.91	5.7
CascadedFirstOrder_N_25600	10.47	1.24	8.4
ManyEvents_N_2000	1.51	0.52	2.9
ManyEvents_N_4000	2.60	0.43	6.0
ManyEvents_N_8000	3.94	0.72	5.8
AdvectionReaction_N_3200	3.63	0.47	7.7
AdvectionReaction_N_6400	5.48	0.73	7.5
AdvectionReaction_N_12800	12.65	0.86	14.7

The speed-up ratios range from about 3 to about 15. What is more important is that, while the old FE running times scaled about linearly with the number of equations, the new FE running times grow much less than O(N). As these tests have a repetitive structure (which is typical of large-scale systems), it seems that the new FE is rather efficient at reusing already processed information, rather than restarting every time from scratch.

Note that these tests are in fact not particularly large, as they contain less than 30000 equations. Significantly larger systems (one million equations or more) could thus show much larger speed-up ratios.

in reply to: 15 comment:16 by Per Östlund, 8 years ago

Replying to casella:

As these tests have a repetitive structure (which is typical of large-scale systems), it seems that the new FE is rather efficient at reusing already processed information, rather than restarting every time from scratch.

I was going to say that it's because they use large arrays, which the new frontend doesn't scalarize until the very end when it generates a DAE for the backend. But the instantiation and typing takes an insignificant amount of time compared to the generation of the DAE (you can use -d=execstat for statistics), and that phase should scale linearly. I've seen that it does scale linearly at small sizes, only to then make a big jump in time when crossing a certain size. So I suspect it might actually be the GC kicking in and messing up the timing, but I'll need to do some more testing to verify that.

in reply to: 15 ; follow-up: 18 comment:17 by Per Östlund, 8 years ago

I did a bit of benchmarking of the CascadedFirstOrder model. The numbers below are only for the last phase of the new instantiation, the generation of the DAE, since the other phases doesn't scale with N and takes an insignificant amount of time. These are also single runs, with and without garbage collection (the largest sizes couldn't be run without GC, due to running out of memory).

N	Time (s)	N/Time	Time no GC	N/Time
100	0.003	33333	0.003	33333
200	0.007	28571	0.006	33333
400	0.012	33333	0.011	36363
800	0.35	2285	0.021	38095
1600	0.40	4000	0.042	38095
3200	0.43	7442	0.077	41558
6400	0.52	12307	0.14	45714
12800	1.1	11636	0.27	47407
25600	1.5	17066	0.53	48301
52100	2.5	20480	1.0	52100
102400	4.3	23814	1.9	53894
204800	9.4	21787	3.9	52512
409600	17.4	23540	7.897	51867
819200	34.9	23473	---	---
1638400	64.8	25284	---	---

From this I think it's fairly obvious that the GC is to blame for the weird scaling. The scaling without GC is still a bit better than linear though, but it seems to plane out for larger sizes. Of course, running without GC is not feasible for really large models (409600 was the limit for me with 32GB memory). But it does show that the GC introduces significant overhead, so that's something we should target in the future.

in reply to: 17 ; follow-up: 19 comment:18 by Francesco Casella, 8 years ago

Replying to perost:

I did a bit of benchmarking of the CascadedFirstOrder model.

@perost, were you using a single thread or allowed multiple threads in parallel?

I've seen that having a large number of CPUs (say, 20) can bring a very significant speed-up, because the GC task is spread over all of them.

in reply to: 18 comment:19 by Per Östlund, 8 years ago

Replying to casella:

Replying to perost:

I did a bit of benchmarking of the CascadedFirstOrder model.

@perost, were you using a single thread or allowed multiple threads in parallel?

I've seen that having a large number of CPUs (say, 20) can bring a very significant speed-up, because the GC task is spread over all of them.

I was using a 5820K (6 cores/12 threads), with default GC settings.

comment:20 by Martin Sjölund, 8 years ago

And the default is to use parallel GC.

comment:21 by Francesco Casella, 8 years ago

Milestone:	1.12.0 → 2.0.0

comment:22 by Francesco Casella, 7 years ago

Almost all the ScalableTestSuite now runs with the NF. Comparing the performance of the old and new FE consistently shows one order of magnitude speed up on all but the smallest models, for which speed doesn't matter anyway.

After achieving 100% coverage (v. 2.0.0), we may consider further optimizations to address the current bottlenecks.

comment:23 by Francesco Casella, 7 years ago

Resolution:	→ fixed
Status:	new → closed

Closing this ticket, the performance of the NF is more than one order of magnitude better than the OF, and further optimizations for large systems require to avoid flattening arrays, which is a different topic.

comment:24 by Francesco Casella, 7 years ago

Component:	Frontend → New Instantiation

Note: See TracTickets for help on using tickets.

Download in other formats: