#3804 closed defect (invalid)
Is there something wrong with UTF-8?
Reported by: | massimo ceraolo | Owned by: | Adeel Asghar |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | OMEdit | Version: | v1.9.4-v1.9.x |
Keywords: | Cc: |
Description
I enclose two files
1) TestUFT.mo. Contains characters outside latin-1. When saved by OM it is interpreted correctly as UTF by Dymola and my text editor (Superedi)
2) UTFPkg.mo. This too contains characters outside latin-1. When this file is saved by OM is NOT interpreted as UTF-8 by Dymola and Superedi.
What is the difference between the two?
Can something be done to make OM output fully compatibile with Dymola? I.e., make OM create files that are recognised as UTF-8 by Dymola and other text editors?
Attachments (5)
Change History (29)
by , 9 years ago
Attachment: | TestUTF.mo added |
---|
by , 9 years ago
comment:1 by , 9 years ago
Summary: | Is there something wronng with TF-8? → Is there something wrong with UTF-8? |
---|
comment:2 by , 9 years ago
comment:4 by , 9 years ago
I made all my tests under Windows. I will send more details in the afternoon, because I want to send them exactly from the same computer and programs that I've used this morning (from my work place, now I'm at home).
BTW, I received this message this morning from Modelon:
The default behavior of Dymola is to only use UTF-8 when needed. In Dymola 2016 FD01 a flag was introduced to enforce UTF-8, I have attached the text from the release notes below.
Using this flag the customer model should be stored in UTF-8.
Improved handling of UniCode (UTF-8) versus iso-latin-1 coding
A flag Advanced.PreferWritingLatin1 is now available to set what coding to use. The
default value of the flag is true, meaning that iso-latin-1 will be preferred instead of UTF-8
when writing files.
However, if any non iso-latin-1 character is present, UTF-8 will always be used
comment:5 by , 9 years ago
Well, Martin, now I had similar results as yours.
In the enclosed UTF report.pdf file I've put several screenshots with the corresponding legends.
All my tests were made in a Win 10 system using Dymola 2016 FD01 (64 bit) and OM 1.9.5-dev20
About the enclosed files:
- the file TestUTFByDym is saved by dymola and then unchanged.
- the file TestUFTByOM is saved by OM and then unchanged.
Now it seems clear that something gets wrong in the OM UTF encoding.
At least both Dymola and SuperEdi (that is advertised as a "unicode editor") believe so.
by , 9 years ago
Attachment: | TestUTfByDym.mo added |
---|
by , 9 years ago
Attachment: | TestUTFByOM.mo added |
---|
by , 9 years ago
Attachment: | UTF report.pdf added |
---|
comment:6 by , 9 years ago
The difference between the three files is:
- Original file uses LF for line endings
- Dymola uses CRLF for line endings and uses a UTF-8 byte-order mark
- OMEdit uses LF for line endings
All of the files are valid UTF-8. OMEdit will strip the byte-order mark when saving since you are recommended to not put these in files (UTF-8 does not need a byte-order mark unlike UTF-16 which has different encoding depending on endian). Since this is all OMEdit does, I can just conclude that the Dymola implementation of UTF-8 reading is broken.
comment:7 by , 9 years ago
Milestone: | 1.10.0 |
---|---|
Resolution: | → invalid |
Status: | new → closed |
I'll close this as invalid as it is a problem with another tool.
comment:8 by , 9 years ago
BTW,
I could not find any statement about line endings in Modelica specifications.
Maybe I'm wrong. Do you know whether in some place it is specified that Unix-style (only LF) and DOS-style (CRLF) are both correct under both operating systems?
follow-up: 10 comment:9 by , 9 years ago
Spec says:
The whitespace characters are space, tabulator, and line separators (carriage return and line feed)
comment:10 by , 9 years ago
Replying to sjoelund.se:
Spec says:
The whitespace characters are space, tabulator, and line separators (carriage return and line feed)
We are only doing line feed which is against the specs then.
comment:11 by , 9 years ago
Yeah,
we never learn enough from the specs!
I think that if switching to saving with CRLF makes our mo's Dymola compliant (and possibly all tools compliant) it is worth while doing so.
First of all we must be sure that this would solve the observed problem. I will make some checks myself.
comment:12 by , 9 years ago
No, it says line separators are whitespace. This means CR or LF. Which means LFCR and CRLF are both allowed. Or only one of them.
comment:13 by , 9 years ago
I also recently discovered that OM stores file with non-native line endings (i.e. LF on Windows, instead of the native CRLF). This is somewhat annoying (e.g. when using version control) if you also edit/create mo files with other editors, which typically write platform-native line endings. It would be great if OM would follow the platform convention (LF on non-windows, CRLF on windows).
comment:14 by , 9 years ago
OM does follow the convention on all its platforms: the "Windows" platform is MinGW - a Unix platform which of course uses Unix line endings. If you use git, it is trivial to set it to store files as LF (and of course, to tell the repository to normalize line endings; setting .mo-files to text mode is essential).
comment:15 by , 9 years ago
OM does follow the convention on all its platforms: the "Windows" platform is MinGW - a Unix platform which of course uses Unix line endings.
Ah I was not aware of this.
If you use git, it is trivial to set it to store files as LF (and of course, to tell the repository to normalize line endings; setting .mo-files to text mode is essential).
This is not the whole story, though - if you use .gitattributes to do this, a simple
* text=auto
should normally be sufficient (and is sufficient for files created with other editors). If you want to be sure that Modelica files get recognized as text, you can also add
*.mo text *.mos text *.order text
but this does not work as expected, because OM creates LF files, which Git recognizes as not being native (and warns you about that). You have to set
# Use LF as eol to keep things uniform (OMEdit creates LF files) *.mo text eol=lf *.mos text eol=lf *.order text eol=lf
for the warnings to not appear, and then instruct all your other editors to *also* create LF files (contrary to Windows convention).
comment:16 by , 9 years ago
You can use *.mo text eol=native
as well. Or change the per-repository crlf settings, etc. Git knows how to deal with these things even if you get warnings when the line endings change.
follow-up: 18 comment:17 by , 9 years ago
You can use *.mo text eol=native as well.
If I do that, and create a new model with OM, and stage the new file, Git warns me with
Updating the Git index failed. A rescan will be automatically started to resynchronize git-gui. warning: LF will be replaced by CRLF in MH_FC_system/CRLF_test.mo. The file will have its original line endings in your working directory.
Or change the per-repository crlf settings, etc.
Setting this via .gitattributes as described above is a per-repo setting.
comment:18 by , 9 years ago
Replying to Christoph <buchner@…>:
Or change the per-repository crlf settings, etc.
Setting this via .gitattributes as described above is a per-repo setting.
That affects everyone else checking out the same code (who might be working on CRLF).
If I do that, and create a new model with OM, and stage the new file, Git warns me with
But it works as expected.
comment:19 by , 9 years ago
That affects everyone else checking out the same code (who might be working on CRLF).
Sure, that's the idea -- to have harmonious line-ending handling even if people using different OSes work together on a repo.
I previously handled the transition of a big-ish FOSS repo from the config hell of having to rely on individual devs (Win, Mac, Linux) to *all* have the respective correct setting in git config autocrlf, to having this specified in the repo, and it solved nearly all line-ending related problems that kept cropping up regularly. * text=auto
was the correct (and sufficient for most files) setting in .gitattributes.
But it works as expected.
All right, if you consider source control issuing a warning for every newly added file as "works as expected", I don't have any arguments left to say.
follow-up: 24 comment:20 by , 9 years ago
I've checked.
If OM added as the first three bytes the UTF-8 BOM (EE BB BF) then the resulting files are correctly read by Dymola!
I think it is highly advisable to add them, since we would gain good compatibility with a very little effort.
At least the option to add them should be given (a checkBox in OMEdit, that I recommend to use checked as default)
comment:21 by , 9 years ago
@Christoph: I opened new enhancement ticket #3807 for the LF vs CRLF discussion. I guess this could be a setting in OMEdit.
Personally I prefer LF even in Windows but yes, there are always issues with the line endings and version control so anything we can do to make life easier for modelers we should investigate.
comment:23 by , 9 years ago
Well I myself will resume the BOM issue in another ticket I'm going to open
comment:24 by , 9 years ago
Replying to ceraolo:
I've checked.
If OM added as the first three bytes the UTF-8 BOM (EE BB BF) then the resulting files are correctly read by Dymola!
I think it is highly advisable to add them, since we would gain good compatibility with a very little effort.
At least the option to add them should be given (a checkBox in OMEdit, that I recommend to use checked as default)
From the almighty wikipedia:
The Unicode Standard permits the BOM in UTF-8 but does not require or recommend its use. Byte order has no meaning in UTF-8 ... The IETF recommends that if a protocol either (a) always uses UTF-8 [like Modelica] or [...]
I tested both on Linux and OMEdit does not change any of the characters (adding a space, saving, removing the space and re-saving ends up in a file identical to the previous file). My installed version of Dymola (Dymola 2016, Linux) handles neither of the attached files. Does the Windows version of Dymola open the attached files properly?