Opened 6 years ago

Closed 6 years ago

#4879 closed enhancement (fixed)

clang optimizations for large models

Reported by: casella Owned by: sjoelund.se
Priority: high Milestone: 1.13.0
Component: Code Generation Version:
Keywords: Cc: wbraun

Description

A few weeks ago the default optimization level for clang compilation was changed from -O0 to -Os. This has a positive impact on simulation time, because the code is optimized for size and better fits the CPU cache, reducing the cache miss overhead. On the other hand, it increases the C-code compilation time.

Most of the models of the MSL only take a few seconds for C compilation also with -Os, and a few of them which have fairly large simulation times benefit a lot from that, e.g., EngineV6 or BranchingDynamicPipes.

On the other hand, the impact on many larger models, such as those in the ScalableTestSuite, in definitely unfavourable. In that library, the average effect is that the simulation time is rduced by 20-50%, but the C compilation time is increased 4X, and the latter is usually one or more orders of magnitude larger.

For example, if you compare the performance before and after the change,
the largest models DistributionSystemModelica_N_112_M_112 and DistributionSystemModelicaActiveLoads_N_80_M_80 jumped from 130 and 80 s to 590 and 260 s, while the reduction in simulation times was only a few seconds. That is clearly sub-optimal.

In general what is the best choice depends on the specific use case.

I guess the optimal solution would probably be to apply -Os to all parts that are executed during simulation, while applying -O0 to all initialization-related code, which usually takes a neglibile fraction of the simulation time.

It is also advisable to optimize the C code, e.g. avoiding unnecessary macro expansions (see #4871), and moving all structural information (incidence matrices and similar) from the C sources to XML files.

For the time being, I would suggest to also run the ScalableTestSuite with -O0, so we can actually check all the other performance figures of big models, without incurring in timeouts. As C-compilation times takes the lion's share and is reduced dramatically, the additional burden on the testsuite is probably going to be limited.

Change History (22)

comment:1 in reply to: ↑ description ; follow-ups: Changed 6 years ago by sjoelund.se

Replying to casella:

I guess the optimal solution would probably be to apply -Os to all parts that are executed during simulation, while applying -O0 to all initialization-related code, which usually takes a neglibile fraction of the simulation time.

This is done in most cases (some initialization code should be optimized; non-linear systems can take a long time to solve). There are some cases where initialization-code is executed in each time step though.

It is also advisable to optimize the C code, e.g. avoiding unnecessary macro expansions (see #4871), and moving all structural information (incidence matrices and similar) from the C sources to XML files.

Macro expansion here isn't the bad part: the bad part here is that there is a check being done on the denominator every time step even though it is a literal...

If structural information needs to be read back and is a simple array it can be added to the generated code as long as the function storing the information is marked for no optimization; that should be a rather reasonable solution (for FMUs this used to be required; perhaps we could treat some of our XML-files as resources in FMI 2.0 to avoid compiling the C-code.

For the time being, I would suggest to also run the ScalableTestSuite with -O0, so we can actually check all the other performance figures of big models, without incurring in timeouts. As C-compilation times takes the lion's share and is reduced dramatically, the additional burden on the testsuite is probably going to be limited.

Done

comment:2 in reply to: ↑ 1 ; follow-up: Changed 6 years ago by casella

Replying to sjoelund.se:

Replying to casella:

I guess the optimal solution would probably be to apply -Os to all parts that are executed during simulation, while applying -O0 to all initialization-related code, which usually takes a neglibile fraction of the simulation time.

This is done in most cases

Do you mean it's already implemented in this way? How does this interact with setCFlags?

(some initialization code should be optimized; non-linear systems can take a long time to solve).

Well, this is true in some cases (particularly if homotopy is used),

There are some cases where initialization-code is executed in each time step though.

I guess hear you mean "code that should belong to the initialization section but it's not".

It is also advisable to optimize the C code, e.g. avoiding unnecessary macro expansions (see #4871), and moving all structural information (incidence matrices and similar) from the C sources to XML files.

Macro expansion here isn't the bad part: the bad part here is that there is a check being done on the denominator every time step even though it is a literal...

We also compile the corresponding code. But it's true that if we simply move it elsewhere the runtime will improve, but the compilation time probably won't.

If structural information needs to be read back and is a simple array it can be added to the generated code as long as the function storing the information is marked for no optimization; that should be a rather reasonable solution (for FMUs this used to be required; perhaps we could treat some of our XML-files as resources in FMI 2.0 to avoid compiling the C-code.

@wbraun, please take note.

comment:3 in reply to: ↑ 2 Changed 6 years ago by sjoelund.se

Replying to casella:

Replying to sjoelund.se:

Replying to casella:

I guess the optimal solution would probably be to apply -Os to all parts that are executed during simulation, while applying -O0 to all initialization-related code, which usually takes a neglibile fraction of the simulation time.

This is done in most cases

Do you mean it's already implemented in this way? How does this interact with setCFlags?

There is a macro OMC_DISABLE_OPT that sets the function attribute telling GCC/clang to not optimize a function. If you want to force optimization also for functions that we deem to expensive to optimize, you use setCFlags("-Os -DOMC_DISABLE_OPT="). If you look at MSL 3.2.2, OMC 1.12.0 takes 25+14 minutes to compile+execute and the master takes 22+16 minutes. (With some additional models simulating in the master, so it's not a perfect comparison). 1.12.0 does not have the OMC_DISABLE_OPT macro, so it compiles everything with -Os.

comment:4 in reply to: ↑ 1 ; follow-up: Changed 6 years ago by casella

Replying to sjoelund.se:

Done

@sjoelund.se would you mind adding --daeMode=new to the settings of that test?

comment:5 in reply to: ↑ 4 ; follow-up: Changed 6 years ago by sjoelund.se

Replying to casella:

@sjoelund.se would you mind adding --daeMode=new to the settings of that test?

Should not be necessary. The daeMode tests run all libraries in the configuration files.

comment:6 in reply to: ↑ 1 ; follow-up: Changed 6 years ago by wbraun

Replying to sjoelund.se:

Replying to casella:

I guess the optimal solution would probably be to apply -Os to all parts that are executed during simulation, while applying -O0 to all initialization-related code, which usually takes a neglibile fraction of the simulation time.

This is done in most cases (some initialization code should be optimized; non-linear systems can take a long time to solve). There are some cases where initialization-code is executed in each time step though.

But we could and probably should mark all functions in <modelName>_06inz.c and <modelName>_08bnd.c with the non-optimization flags, since the non-linear parts are in the file <modelName>_02nls.c.


It is also advisable to optimize the C code, e.g. avoiding unnecessary macro expansions (see #4871), and moving all structural information (incidence matrices and similar) from the C sources to XML files.

Macro expansion here isn't the bad part: the bad part here is that there is a check being done on the denominator every time step even though it is a literal...

Ah, the issue with variable names we had to replace them in quite a lot of files, right?

If structural information needs to be read back and is a simple array it can be added to the generated code as long as the function storing the information is marked for no optimization; that should be a rather reasonable solution (for FMUs this used to be required; perhaps we could treat some of our XML-files as resources in FMI 2.0 to avoid compiling the C-code.

So you don't think we would be win a lot, if we would move static information like sparsity pattern into an xml file? Of course it basically just a moving of work from compile to runtime time, but at compile phase it probably more work than just read the file.

comment:7 in reply to: ↑ 6 ; follow-up: Changed 6 years ago by sjoelund.se

Replying to wbraun:

Replying to sjoelund.se:
But we could and probably should mark all functions in <modelName>_06inz.c and <modelName>_08bnd.c with the non-optimization flags, since the non-linear parts are in the file <modelName>_02nls.c.

I am not fully certain. Most bindings are marked I believe. Better gains would come from merging identical equations from continuous-time and simulation-time (since those equations are already optimized).

It is also advisable to optimize the C code, e.g. avoiding unnecessary macro expansions (see #4871), and moving all structural information (incidence matrices and similar) from the C sources to XML files.

Macro expansion here isn't the bad part: the bad part here is that there is a check being done on the denominator every time step even though it is a literal...

Ah, the issue with variable names we had to replace them in quite a lot of files, right?

Not sure what issue you are referring to :)

If structural information needs to be read back and is a simple array it can be added to the generated code as long as the function storing the information is marked for no optimization; that should be a rather reasonable solution (for FMUs this used to be required; perhaps we could treat some of our XML-files as resources in FMI 2.0 to avoid compiling the C-code.

So you don't think we would be win a lot, if we would move static information like sparsity pattern into an xml file? Of course it basically just a moving of work from compile to runtime time, but at compile phase it probably more work than just read the file.

That depends on how you write it to the file. Arrays or matrices of known size that are just compiled into the executable should not be problematic.

Last edited 6 years ago by sjoelund.se (previous) (diff)

comment:8 in reply to: ↑ 7 ; follow-up: Changed 6 years ago by wbraun

Replying to sjoelund.se:

Replying to wbraun:

Replying to sjoelund.se:
But we could and probably should mark all functions in <modelName>_06inz.c and <modelName>_08bnd.c with the non-optimization flags, since the non-linear parts are in the file <modelName>_02nls.c.

I am not fully certain. Most bindings are marked I believe. Better gains would come from merging identical equations from continuous-time and simulation-time (since those equations are already optimized).

Yes, I agree we need to reduce the amount of equations we generate at all, in my option we are generating, most equations 3 times (initialization, continuous and discrete) evaluation.
The merge with the initialization might be not that easy, but the continuous and discrete equations can be merged straight forward, here I mean the functions: functionODE, functionAlgebraics and functionDAE.

It is also advisable to optimize the C code, e.g. avoiding unnecessary macro expansions (see #4871), and moving all structural information (incidence matrices and similar) from the C sources to XML files.

Macro expansion here isn't the bad part: the bad part here is that there is a check being done on the denominator every time step even though it is a literal...

Ah, the issue with variable names we had to replace them in quite a lot of files, right?

Not sure what issue you are referring to :)

Some days ago we removed all variable name macros for performance reason.
For some reasons (historical and it's quite easy) I still using macros for jacobians and in the daeMode, but should probably change that, too.

If structural information needs to be read back and is a simple array it can be added to the generated code as long as the function storing the information is marked for no optimization; that should be a rather reasonable solution (for FMUs this used to be required; perhaps we could treat some of our XML-files as resources in FMI 2.0 to avoid compiling the C-code.

So you don't think we would be win a lot, if we would move static information like sparsity pattern into an xml file? Of course it basically just a moving of work from compile to runtime time, but at compile phase it probably more work than just read the file.

That depends on how you write it to the file. Arrays or matrices of known size that are just compiled into the executable should not be problematic.

Basically the code looks like that:

const int someConstArray[size] = { constant integers };
memcpy(dynamicMemory, someConstArray, size*sizeof(int));
Last edited 6 years ago by wbraun (previous) (diff)

comment:9 in reply to: ↑ 8 ; follow-up: Changed 6 years ago by sjoelund.se

Replying to wbraun:

The merge with the initialization might be not that easy, but the continuous and discrete equations can be merged straight forward, here I mean the functions: functionODE, functionAlgebraics and functionDAE.

It's pretty much making a hash-function to look for identical equations and aliasing them in some way. It should be possible to write this in SimCode.

Some days ago we removed all variable name macros for performance reason.
For some reasons (historical and it's quite easy) I still using macros for jacobians and in the daeMode, but should probably change that, too.

Yes, but that's different. You introduce some variable number of names that might need to be included everywhere, etc. Simply expanding for example the DIVISION macro is not in itself a problem. The problem there is that we have some special DIVISION DAE.CALL in the compiler and don't handle the special cases in the code generation (such as constant divisior).

Basically the code looks like that:
const int someConstArray[size] = { constant integers };
memcpy(dynamicMemory, someConstArray, size*sizeof(int));

If this is part of a function marked no optimization, there should be no major bottleneck for compiling this into the executable.

comment:10 in reply to: ↑ 9 ; follow-up: Changed 6 years ago by wbraun

Replying to sjoelund.se:

Replying to wbraun:

The merge with the initialization might be not that easy, but the continuous and discrete equations can be merged straight forward, here I mean the functions: functionODE, functionAlgebraics and functionDAE.

It's pretty much making a hash-function to look for identical equations and aliasing them in some way. It should be possible to write this in SimCode.

Ah okay, yes sounds doable. But probably it would be even better to do such merge on Backend data to avoid double SimCode transformation of the same equation.

Some days ago we removed all variable name macros for performance reason.
For some reasons (historical and it's quite easy) I still using macros for jacobians and in the daeMode, but should probably change that, too.

Yes, but that's different. You introduce some variable number of names that might need to be included everywhere, etc. Simply expanding for example the DIVISION macro is not in itself a problem. The problem there is that we have some special DIVISION DAE.CALL in the compiler and don't handle the special cases in the code generation (such as constant divisior).

True, we should get rid of this DIVISION call replacements, since this should be handled by the code generator. However, it's just an unnecessary operation, but probably not really a bottleneck.

comment:11 in reply to: ↑ 5 ; follow-up: Changed 6 years ago by casella

Replying to sjoelund.se:

Replying to casella:

@sjoelund.se would you mind adding --daeMode=new to the settings of that test?

Should not be necessary. The daeMode tests run all libraries in the configuration files.

I'm not sure I understand what you mean. Is there any report of the results obtained with --daeMode=new, -d=newInst, and -O0? I can't find it...

comment:12 in reply to: ↑ 11 Changed 6 years ago by sjoelund.se

Replying to casella:

I'm not sure I understand what you mean. Is there any report of the results obtained with --daeMode=new, -d=newInst, and -O0? I can't find it...

That's because the daeMode tests only run every other day. The next scheduled run is tonight.

comment:13 follow-up: Changed 6 years ago by casella

I started the job manually and it completed successfully. I looked here but unfortunately, I still cannot find such a report.

comment:14 in reply to: ↑ 13 ; follow-up: Changed 6 years ago by sjoelund.se

Replying to casella:

I started the job manually and it completed successfully. I looked here but unfortunately, I still cannot find such a report.

Look closer :) Note that the reporting is a different job from the testing and that you may need to force your browser to refresh the cache (as usual).

comment:15 in reply to: ↑ 10 ; follow-up: Changed 6 years ago by sjoelund.se

Replying to wbraun:

It's pretty much making a hash-function to look for identical equations and aliasing them in some way. It should be possible to write this in SimCode.

Ah okay, yes sounds doable. But probably it would be even better to do such merge on Backend data to avoid double SimCode transformation of the same equation.

I started this by trying to alias equations based on the Backend equation (SimCode.createEquation), but some equations are converted to residuals even after this stage and the residuals cannot be shared since a = b might be a residual equation in initialization but not in the continuous equation system (0=a-b vs. b:=a).

While it's slower, I will try to alias some equations based on the final SimCode instead. That should be much more reliable.

comment:16 in reply to: ↑ 15 Changed 6 years ago by wbraun

Replying to sjoelund.se:

Replying to wbraun:

It's pretty much making a hash-function to look for identical equations and aliasing them in some way. It should be possible to write this in SimCode.

Ah okay, yes sounds doable. But probably it would be even better to do such merge on Backend data to avoid double SimCode transformation of the same equation.

I started this by trying to alias equations based on the Backend equation (SimCode.createEquation), but some equations are converted to residuals even after this stage and the residuals cannot be shared since a = b might be a residual equation in initialization but not in the continuous equation system (0=a-b vs. b:=a).

While it's slower, I will try to alias some equations based on the final SimCode instead. That should be much more reliable.

Please note that we also have the function SimCodeUtil.createEquations, which could be also used ....

comment:17 in reply to: ↑ 14 Changed 6 years ago by casella

Replying to sjoelund.se:

Look closer :) Note that the reporting is a different job from the testing and that you may need to force your browser to refresh the cache (as usual).

All right, I had to look for the noopt report there but didn't see that :)

comment:18 Changed 6 years ago by sjoelund.se

https://github.com/OpenModelica/OMCompiler/pull/2361 seems remove roughly 30% of equations from the code generation. I hope the performance gains are good; I suppose some linear or non-linear systems might be identical especially lambda0 and lambda1.

comment:19 follow-up: Changed 6 years ago by casella

I checked the corresponding Hudson report, but there is not much there.

I would have expected some significant improvement in C compile time, why isn't it showing up here?

Last edited 6 years ago by casella (previous) (diff)

comment:20 in reply to: ↑ 19 Changed 6 years ago by sjoelund.se

Replying to casella:

I checked the corresponding Hudson report, but there is not much there.

I would have expected some significant improvement in C compile time, why isn't it showing up here?

For some models, I suppose because the initial equations are usually marked no optimization, so the gain would be minimal. For things like the CombiTable, no alias equations are detected and the hashing seems broken for that one as it takes an enormous amount of time and memory for the alias elimination.

comment:21 Changed 6 years ago by sjoelund.se

The problem with the SimCode taking a long time should be fixed.

TransmissionLineModelica_N_1280 went from 51s to 38s compilation time. The no optimization version from 15.5s to 13s. I do wish I could get the total times for some reports right now (because the C-compilation times vary a lot for smaller models such as EngineV6). If linear/non-linear systems could be shared in a similar way, there should be greater gains. Or if array equations were vectorized back...

comment:22 Changed 6 years ago by casella

  • Resolution set to fixed
  • Status changed from new to closed

We are currently running the ScalableTestSuite both with -O0 and -Os flags, so we can evaluate the trade-off between compile time and simulation time.

I guess I can now close this ticket.

Note: See TracTickets for help on using tickets.