How to access unicode strings through MEX/Engine C interfaces?
Show older comments
Short version
How can I access the underlying unicode data of MATLAB strings through the MATLAB Engine or MEX interface?
Here's an example. Let's put unicode characters in a UTF-8 encoded file test.txt, then read it as
fid=fopen('test.txt','r','l','UTF-8');
s=fscanf(fid, '%s')
If I first do feature('DefaultCharacterSet', 'UTF-8'), then engEvalString(ep, "s"), then I get back the text from the file as UTF-8. This proves that MATLAB stores it as unicode internally. However if I do mxArrayToString(engGetVariable(ep, "s")), I get what unicode2native(s, 'Latin-1') would give me in MATLAB: all non-Latin-1 characters replaces by code 26. What I need is getting access to the underlying unicode data as a C string in any unicode format (UTF-8, UTF-16, etc.) Is this possible?
- "[mxArrayToString] supports multibyte encoded characters."
So how can I get the multibyte non-Latin-1 characters then?
----
Original long version What character encoding does MATLAB use internally---if any---and is there a way to control this? To be precise, I would like to know if there is a way to guarantee that any character array I retrieve is going to adhere to a particular encoding, preferably a unicode one.
I am interfacing MATLAB with another library through the MATLAB Engine interface and I need to guarantee a character encoding when sending strings to the other library. Is this possible at all, or are MATLAB's strings plain char arrays with no associated encoding?
Related things I found:
- This here says that it uses UTF-16, but that's not what I see when I retrieve strings in C code.
- I found references to feature('DefaultCharacterEncoding', 'UTF-8') on the web. What this appears to do is control what encoding the input commands (engEvalString) are assumed to have, and how the output is encoded. If I supply a UTF-8 encoded á as s='á', then retrieve this in C, I get an ISO-Latin-1 encoded á. If I send something that's not in Latin-1, I get nonsense (actually character code 26). (At least this is my impression after a few simple tests---these are time consuming)
In light of this finding, I'd like to know: does MATLAB support unicode for all its strings? If yes, how do I get access to these from the C interface? (Any unicode encoding is acceptable, UTF8, UTF16, UCS32, etc.) If it doesn't support unicode, is ISO-Latin-1 its default? Can I assume that all strings I retrieve though the C interface can be interpreted as ISO-Latin-1?
Also, any pointers to the relevant documentation on the issue is most welcome.
(I should probably mention that I was testing this on OS X as I'm aware that there are differences in the implementation of the matlab engine interface between platforms.)
2 Comments
Accepted Answer
More Answers (1)
Walter Roberson
on 18 Feb 2013
1 vote
MATLAB uses a 16 bit character internally, but it does not use UTF-anything. It simply uses the first 65536 Unicode code points.
Categories
Find more on MATLAB Compiler in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!