- I download interactively with Firefox without being asked about text encoding.
- I copy&pasted the file name with Windows Explorer, e.g. right-click, rename, copy.
- Notepad++ says that both the pasted file name string and the file content, "test", are ANSI-encoded
- Matlab double says that "é" of the file name is character 233
- Here (Sweden) "é" is an "extended ascii" character(?)
Working with unicode paths
4 views (last 30 days)
Show older comments
The following is a followup to:
This question however is a bit more specific. I have a file which was created using a program on Windows. I can browse to the file in Windows Explorer (Win 7). I am however unable to:
- Open the file in Matlab (using fopen)
- If I create a directory with the same name, I am unable to cd to the directory. cd(directory)
I have uploaded the file to a public folder on my dropbox account. https://www.dropbox.com/sh/d2mghr9xyb426lz/ZEM4DH8XTp
The files are: v. Békésy - 1957.txt v. Békésy - 1957.zip
I am currently unable to provide instructions as to how one would create such a file in Matlab (hence providing them for download). For handling naming, I have also included the file in a zip, so that even if the zip is renamed on download, the file inside should maintain the same name. Incidentally, it was by exporting the zip to a folder with the same name that created the folder which I cannot cd to with Matlab.
Thus, the question is how do I get around issues #1 and #2 (without renaming them using manually using a windows interface). I am assuming this might mean using a custom library (mex and/or Java code).
The ideal solution is to provide a generic class of code that actually works for path/file manipulation instead of needing manual interference any time this problem is encountered.
Thanks, Jim
4 Comments
per isakson
on 4 Sep 2013
Edited: per isakson
on 4 Sep 2013
Now, I'm on a different computer (same installation: R2013a 64bit on Windows 7). I read your file without problems here too. "Standard Swedish" installation, I guess. And:
a = get(0, 'Language')
import java.nio.charset.Charset
b = Charset.defaultCharset()
c = feature('DefaultCharacterSet')
returns
a =
sv_se
b =
windows-1252
c =
windows-1252
Accepted Answer
More Answers (2)
Malcolm Lidierth
on 6 Sep 2013
Edited: Malcolm Lidierth
on 6 Sep 2013
Jim
MATLAB/Java need to talk to an OS and a file system beneath so this is likely to vary across FAT12/16/32, NTFS etc as well as OS or MATLAB/Java.
From @Pers comments: the windows-1252 charset is proprietary, not unicode, and to convert a Java String to the originating byte[] requires the CharSet.
So, telling the difference between "'v. Békésy" on this screen to the byte[] that it was created from requires information that the string does not contain and, AFAIK, neither does the directory entry of any file system.
On my Mac:
>> java.nio.charset.Charset.availableCharsets.size()
ans =
166
The answer then is that there is no answer beyond "don't use special characters in file names" as suggested by Jan on your first post. But, on the assumption that nobody is likely to have used anything but 8 bit encoding:
>> java.lang.String('v. Békésy').getBytes()
ans =
118
46
32
66
-23
107
-23
115
121
But compare that with the MATLAB char array:
>> char(java.lang.String('v. Békésy').getBytes())
ans =
v .
B
k
s y
and with:
>> uint8(java.lang.String('v. Békésy').getBytes())
ans =
118
46
32
66
0
107
0
115
121
For this problem, MATLAB's uint arithmetic rules may not be the most useful.
>> java.lang.String(java.lang.String('v. Békésy').getBytes(),java.nio.charset.Charset.defaultCharset())
ans =
v. Békésy
>> java.nio.charset.Charset.defaultCharset()
ans =
ISO-8859-1
but:
>> java.lang.String(java.lang.String('v. Békésy').getBytes(), 'US-ASCII')
ans =
v. Bks
Regards ML
5 Comments
Malcolm Lidierth
on 6 Sep 2013
@Jim Not a Java issue. On my Mac e.g. with both a Mac HD and a FAT32 drive
>> java.io.File('v. Békésy')
ans =
v. Békésy
>> ans.exists()
ans =
1
Its a MATLAB issue.
P.S. SciLab works fine while R gives the name as v. Be\314\201ke\314\201sy but works happily with that.
per isakson
on 12 Sep 2013
Edited: per isakson
on 13 Sep 2013
Googling taught me
- NTFS stores file names in Unicode.
- Not all zip-tools are Unicode-aware.
- The name of files transferred in zip-files between systems with different default character sets may be "corrupted".
Links
0 Comments
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!