Opened 10 years ago
Last modified 3 years ago
#3111 accepted defect
Convergence issues on nonlinear equations involving der()
Reported by: | Francesco Casella | Owned by: | Willi Braun |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | Backend | Version: | trunk |
Keywords: | Cc: | Vitalij Ruge, Lennart Ochel, stefano.carli@…, roberto.bonifetto@… |
Description
The nonlinear solver makes an excessively large number of iterations in some test cases where derivatives are involved as iteration variables (see, e.g., ThermoPower.Examples.CISE.CISESim120501
or ThermoPower.Test.DistributedParameterComponents.TestWaterFlow1DFEM_A
). In some cases involving external medium models, the solver actually fails to converge after a large number of iterations (over 1700).
A workaround for this problem has been implemented, which adds alias algebraic variables equal to the derivatives and uses them as iteration variables. It can be turned on with the debug flag +d=addDerAliases
. Unfortunately, it breaks some models in the testsuite, but it is very beneficial in the above-mentioned test cases.
The problem still needs to be thoroughly assessed and solved in general.
One interesting question is: what is the correct scaling factor for der(x)? Obviously, the nominal value of x is not necessarily good, because the magnitude of dx/dt also depends on how fast the variable changes.
One possible option during variable step size integration is to use x_nominal/current_step_length. If the step length is short, it is likely that the variables are changing fast (possibly very fast if initial state values are given far from equilibrium), so it is reasonable to use a larger scaling factor. When the step length gets longer because the system is close to equilibrium, smaller values are employed. The main limitation of this approach is that it consider one time scale for the entire system, while an appropriate time scale should be selected for each single derivative. Also, it is not clear which value to use at initialization.
Another idea could be to employ an adaptive approach: if one can detect that the convergence is hampered by the effect of very small changes of some der() variables, their scaling factor could be reduced by a factor of 10 or something like that.
Attachments (2)
Change History (31)
comment:1 by , 10 years ago
Milestone: | 2.0.0 → 1.9.2 |
---|
comment:2 by , 10 years ago
Milestone: | 1.9.2 → 1.9.3 |
---|
comment:3 by , 10 years ago
Ongoing re-structuring of OMC-Backend in order to improve symbolic treatment of underlying DAEs.
Should make it possible to replace current workaround with a stable solution.
follow-up: 5 comment:4 by , 9 years ago
I have made some more analysis on the test case ThermoPower.Test.DistributedParameterComponents.TestWaterFlow1DFEM_D
with default flags.
This model has a nonlinear system, which is torn to a system of 11 equations with derivatives as tearing variable. This system takes a large number of iterations to converge at each time step.
Based on the mathematical and physical structure of the underlying problem, I expect this system to be only mildly nonlinear. What I observed is that, once the residual of the system has gotten quite small but not small enough for the convergence to succeed, the solver takes dozens or even hundreds of iterations, whereby the smallest diagonal terms of the LU factorization of the Jacobian change wildly. This is not consistent with the mildly nonlinear nature of the problem, and strongly suggests that there is something wrong with the computation of the Jacobian.
Questions:
- is the Jacobian of this nonlinear system computed numerically (I guess so, because of IF97 functions from Modelica.Media involved)
- in case the computation is numerical, how are the dx's selected for the tearing variables?
I suspect that too small dx's are selected, which makes the computation prone to bad roundoff errors or even singular at times, but I have no means to check that out.
Also, I noticed that at each new time step, the initial guess of these derivative tearing variables is always zero, instead of an extrapolation of previous variables. This sounds very much like a bug, which could also explain why addDerAliases has such a beneficial effect.
I would suggest to take a look at that test case with the standard settings of the testsuite, and investigate the above-mentioned issues.
follow-up: 7 comment:5 by , 9 years ago
Replying to casella:
Questions:
- is the Jacobian of this nonlinear system computed numerically (I guess so, because of IF97 functions from Modelica.Media involved)
- in case the computation is numerical, how are the dx's selected for the tearing variables?
Yes, you are right again ;) it is computed numerically.
This actually how dx is calculated.
for(i = 0; i < solverData->n; i++) { xsave = x[i]; delta_hh = delta_h * (fabs(xsave) + 1.0); if ((xsave + delta_hh >= solverData->maxValue[i])) delta_hh *= -1; x[i] += delta_hh; /* Calculate scaled difference quotient */ delta_hh = 1. / delta_hh * solverData->xScaling[i]; [...]
Replying to casella:
Also, I noticed that at each new time step, the initial guess of these derivative tearing
variables is always zero, instead of an extrapolation of previous variables. This sounds very
much like a bug, which could also explain why addDerAliases has such a beneficial effect.
Yes, that sound really very much like a bug, but like one the should be easy to fix ;)
comment:6 by , 9 years ago
Status: | new → accepted |
---|
follow-up: 8 comment:7 by , 9 years ago
Replying to wbraun:
This actually how dx is calculated.
for(i = 0; i < solverData->n; i++) { xsave = x[i]; delta_hh = delta_h * (fabs(xsave) + 1.0); if ((xsave + delta_hh >= solverData->maxValue[i])) delta_hh *= -1; x[i] += delta_hh; /* Calculate scaled difference quotient */ delta_hh = 1. / delta_hh * solverData->xScaling[i]; [...] }
Further questions:
- how is
delta_h
set? Can it be changed by the user? - is xScaling[i] set to the nominal attribue? Also for derivatives?
comment:8 by , 9 years ago
Replying to casella:
Further questions:
- how is
delta_h
set? Can it be changed by the user?
delta_h = sqrt(DBL_EPSILON*2e1)
.
Currently it can't changed by the user, but perhaps I could add an bypass.
Replying to casella:
- is xScaling[i] set to the nominal attribue? Also for derivatives?
solverData->xScaling[i] = fmax(systemData->nominal[i],fabs(solverData->x0[i]))
follow-up: 10 comment:9 by , 9 years ago
My educated guess is that for my kind of systems DBL_EPSILON*20, i.e. about 1e-14 is way, way too small. The reason is that a lot of properties in these systems (e.g., thermodynamic properties) are not computed with 15-digit precision, so if delta_hh is too small, numerical errors will dominate. I would rather set delta_h between 1e-7 and 1e-10.
Please make it possible to change delta_h with some flag, so I can do some experiments and report to you.
Thanks
comment:10 by , 9 years ago
We already played a lot with this epsilon:
delta_h = sqrt(DBL_EPSILON*2e1) = sqrt(2.22044604925e-16*2e1) = sqrt(4.4408920985e-15) = 6.66400187463e-08
However, I'll can such flags as simulation options.
comment:11 by , 9 years ago
Oops, sorry, I missed the sqrt() function. So, the value is already within the interval I was expecting to be correct.
Anyway, I'd like to see what happens to those Jacobians by changing this value, so I can learn something more about why they keep changing all the time.
comment:13 by , 9 years ago
I think I have some more definite clue where the problem is. Instructions follow to reproduce it and analyze it.
I have run the following .mos script, using the ThermoPower library and the attached .mo file
loadFile("D:/Lavoro/ThermoPower/ThermoPower_trunk/ThermoPower/package.mo"); getErrorString(); loadFile("TestWaterFlow1DFEM_D_small.mo"); cd("./results10"); setCommandLineOptions("+cseCall +cseEachCall +tearingMethod=omcTearing +d=dumpSimCode,addDerAliases,disablePartitioning,nogen,initialization,backenddaeinfo,discreteinfo,stateselection --profiling=blocks+html"); simulate(TestWaterFlow1DFEM_D_small, stopTime = 0.1, numberOfIntervals = 20, simflags = "-lv LOG_STATS,LOG_NLS_V,LOG_NLS_JAC,LOG_NLS_JAC_TEST -emit_protected"); getErrorString();
and redirected the console output to the attached test3.log file.
Identify the line in the log file which contains
LOG_NLS_V | info | SOLVING NON-LINEAR SYSTEM USING HOMOTOPY SOLVER LOG_NLS_V | info | EQUATION NUMBER: 320 LOG_NLS_V | info | TIME: 3.5000000000e-005 LOG_NLS_V | info | number of function calls (so far!): 561
for me it's line 4273.
The nonlinear solver is trying to solve a nonlinear torn system whose tearing variables are:
hex.w[3] hex.w[2] $DERAliashex.h[4] $DERAliashex.h[1] $DERAliashex.h[3] $DERAliashex.h[2]
The first problem can be seen from line 4311, when the Newton solver starts. The extrapolation values used at the first iteration are:
{0.01, 0.01, 0, 0, 0, 0}
, while the values found at the previous time step were
{0.0099999999, 0.0099999997, -1.3075397, -0.78452387, 1.8305556,-0.78452384 }
. It seems that the solver just ignores previous values of the unknown derivatives, which doesn't help the solver, please have this fixed.
The second, much more serious problem, regards the Jacobian of the system. Move on to line 4448, Iteration 4. The (4,4) element of the jacobian is 0.89504501, and the (4,6) element is -1. If you go to the next iteration, (4,4) becomes 1, and (4,6) becomes -0.9300044. That's a 10% change at the fifth iteration of the solver, when the solution has already been found with four significant digits, and the residuals are already quite small. In other similar, but larger, test cases, which I am not reporting here, I have noticed larger changes, possibly over 30%.
Residual 4 is:
316: hex.B[1,1] * hex.h[1] + hex.B[1,2] * hex.h[2] + hex.C[1,1] * hex.h[1] + hex.Y[1,2] * (1.0 - hex.ML) * $DERAliashex.h[2] * hex.A + (hex.Y[1,1] * (1.0 - hex.ML) + hex.YY[1,1] * hex.ML) * $DERAliashex.h[1] * hex.A + (-hex.K[1,1]) * hex.w[1] - hex.G[1] * $DERAliashex.p * hex.A (RESIDUAL)
see line 237 of the log.
If you compute these two Jacobian elements analytically, considering that hex.ML = 0
, it turns out that
(4,4) = hex.Y[1,1] * hex.A = 0.11829*3.14e-4 (4,6) = hex.Y[1,2] * hex.A = -0.11829*3.14e-4
I am not sure if the Jacobian rows are rescaled with the residual scaling factor at each iteration, but even in that case, the scaling factors for the unknowns are constant, so there is no way we can get 0.89 and -1 at one iteration, then 1 and -0.93 at the next.
My understanding is that the excessive number of iterations (8 in this specific cases, but I have larger examples where they might be 130 or more) is due to these very large fluctuations in the Jacobian from one iteration to the next, which are hampering the convergence.
In this specific case, given the expressions behind elements (4,4) and (4,6), I don't really understand why the numerical derivative should be so bad, as the residual is actually linear w.r.t. the unknowns, so the Jacobian elements should be constants!
I think there is really some serious problem with the Jacobian evaluation. Please have a closer look at it.
by , 9 years ago
Attachment: | TestWaterFlow1DFEM_D_small.mo added |
---|
.mo file of the test case (needs ThermoPower)
comment:14 by , 9 years ago
In 212791 I've change the way starting values for the non-linear system are calculated. For the example here it reduces the number of iteration a bit from 2535 to 1944, but unfortunately the overall impact on other coverage tests is smaller than expected.
comment:15 by , 9 years ago
In most cases, the large number of iterations does not depend on the initial guess - most iterations are spent very close to the solution, due to the large numerical fluctuations of the Jacobian.
I would suggest that you take the example I posted above, put some ad-hoc printf statements in the C code to monitor the (4,4) element and the intermediate results that are computed to obtain it, and see what happens during those iterations.
comment:16 by , 9 years ago
Cc: | added |
---|
After some analysis with Willi, we have probably understood the root cause of this problem.
We inspected both the code to compute the numerical jacobian and the code to compute the residuals, and did some debugging printing out values, but everything was apparently correct. It seems then that the reported issue is due to round-off errors.
The line of the Jacobian in question is computed numerically, by subtracting the values of the residuals computed at two nearby points. Close to the solution, the residual is close to zero. It turns out that the residual is the result of the summation of several terms which are opposite in sign, which results in a loss of precision due to round-off errors. As a consequence, the residuals are numerically very "noisy", and the numerical derivatives turn out to be extremely inaccurate.
As this Jacobian can definitely be computed analytically, the best solution to this problem is to ensure that in all cases where the analytical jacobian can be computed (i.e., when record functions are not involved), this is actually used, and numerical computation used as a fall-back option if this is not possible.
Since the round-off errors in the residual computation remains there anyway, it might also be necessary to relax the convergence conditions a bit.
We will check the performance with this solution as soon as Willi implements it; currently, if one turns analytical jacobians on and one system cannot be handled, the code generation aborts.
comment:17 by , 9 years ago
I confirm the analysis: in the specific test cases considered here, the above-mentioned residual boils down to
1.5e4 - 1.0142e-5*der(h[2]) + 1.0142e-5*der(h[1]) - 1.5e4
The first and the last term cancel out exactly at steady state, where both derivatives are by definition zero. Of course this way of computing the residual is not very numerically robust, in terms or round-off error. Note that the computation of delta_hh does not take into any account the presence of these extra terms, which are severely reducing the numerical precision of the numerical derivative.
In order to get a more accurate evaluation of d_residual/d_der(h), it would be better to move the terms depending on der(h) to the end of the expression
1.5e4 - 1.5e4 - 1.0142e-5*der(h[2]) + 1.0142e-5*der(h[1])
This would allow to compute a more precise numerical Jacobian (which however in this case can be computed analytically without problems), but also to reduce the numerical errors on the evaluation of the residual, which could badly affect the termination criterion for the Newton solver, as it seems to happen in this case. However, I'm not sure how to formulate rules to do so that have any validity in general.
Besides computing analytical Jacobians to avoid the numerical oscillations in the Jacobian evaluation, we should also review the scaling criteria for the residuals and the termination criteria for the Newton iterations, in view of the lessons learnt with this test case.
comment:18 by , 9 years ago
The situation has improved with the most recent nightlies, though I'm not sure exactly why. Using these settings
setCommandLineOptions("--preOptModules-=clockPartitioning --postOptModules+=wrapFunctionCalls --tearingMethod=omcTearing -d=dumpSimCode,nogen,initialization,backenddaeinfo,discreteinfo,stateselection");
the attached example, as well as ThermoPower.Test.DistributedParameterComponents.TestWaterFlow1DFEM_A
now no longer cause repeated iterations of the Newton solver close to the solution. The typical number of iterations is now two. This could be probably further improved, as during the steady-state part of the transient the solution is actually constant, so there should be no need of iterations at all. But this is no longer critical.
I'll investigate more and see if we can actually close the ticket.
comment:22 by , 8 years ago
Milestone: | 1.11.0 → 1.12.0 |
---|
Milestone moved to 1.12.0 due to 1.11.0 already being released.
comment:24 by , 7 years ago
Milestone: | 1.12.0 → 1.13.0 |
---|
Milestone moved to 1.13.0 due to 1.12.0 already being released.
comment:26 by , 5 years ago
Milestone: | 1.14.0 → 1.16.0 |
---|
Releasing 1.14.0 which is stable and has many improvements w.r.t. 1.13.2. This issue is rescheduled to 1.16.0
comment:28 by , 4 years ago
Milestone: | 1.17.0 → 1.18.0 |
---|
Retargeted to 1.18.0 because of 1.17.0 timed release.
Milestone changed to 1.9.3 since 1.9.2 was released.