Discussion:
String with unicode 'digit-4' code
Jake Larrimore
2014-09-19 16:51:46 UTC
Permalink
Hey Group--
I am being passed a string that may contain Unicode in it's 4-digit format
-- u'\u4eb0'. I'm asked to be able to display said string in a Richtext
and need to decode the Unicode to display as well. Is there a simple way
to decode a string that may (or may not) contain Unicode so that when I
pass it into richtext.WriteText(text), it knows to display the unicode,
without parsing the string to do it manually?

I attached an Example of what I'm dealing with. I can get it to display
the Unicode if I pass it in alone--

richtext.WriteText( u'\u4eb0')

-- but if it is in the middle of a string it doesn't recognize it as
Unicode. I'd rather not have to parse the hole thing to search for
unicode, I know there must be a simpler way.

Thanks,
Jake

Environment:
Python 2.7/ 32 bit
wxPython 2.8 (unicode version)
--
You received this message because you are subscribed to the Google Groups "wxPython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wxpython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Werner
2014-09-19 17:22:36 UTC
Permalink
Hi Jake,
Post by Jake Larrimore
Hey Group--
I am being passed a string that may contain Unicode in it's 4-digit
format -- u'\u4eb0'. I'm asked to be able to display said string in a
Richtext and need to decode the Unicode to display as well. Is there
a simple way to decode a string that may (or may not) contain Unicode
so that when I pass it into richtext.WriteText(text), it knows to
display the unicode, without parsing the string to do it manually?
I attached an Example of what I'm dealing with. I can get it to
display the Unicode if I pass it in alone--
richtext.WriteText( u'\u4eb0')
-- but if it is in the middle of a string it doesn't recognize it as
Unicode. I'd rather not have to parse the hole thing to search for
unicode, I know there must be a simpler way.
Is the string really coming in as you show in the code? It would be a
bit odd to have "u'\u4eb0'" in the middle of some string you get.

If it is defined as u'' it works for me, e.g.:

testScript = u" Hello world! \u4eb0' . Hello, again \n" # works
panel.ed.WriteText(testScript)
t2 = " Hello world! u'\u4eb0'. Hello, again \n" # does not work
panel.WriteText(t2)
panel.WriteText( u'\u4eb0') # works

Werner
--
You received this message because you are subscribed to the Google Groups "wxPython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wxpython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Tim Roberts
2014-09-19 17:40:37 UTC
Permalink
Post by Jake Larrimore
I am being passed a string that may contain Unicode in it's 4-digit
format -- u'\u4eb0'. I'm asked to be able to display said string in a
Richtext and need to decode the Unicode to display as well. Is there
a simple way to decode a string that may (or may not) contain Unicode
so that when I pass it into richtext.WriteText(text), it knows to
display the unicode, without parsing the string to do it manually?
I attached an Example of what I'm dealing with. I can get it to
display the Unicode if I pass it in alone--
richtext.WriteText( u'\u4eb0')
-- but if it is in the middle of a string it doesn't recognize it as
Unicode. I'd rather not have to parse the hole thing to search for
unicode, I know there must be a simpler way.
The difference here is not between "passing it in alone" and "in the
middle". The difference is in the datatypes. If you change your
example from this:
testScript = " Hello world! u'\u4eb0'. Hello, again \n"
to this:
testScript = u" Hello world! \u4eb0. Hello, again \n"
it works fine. Can you see the difference?

The u"xxx" and u'xxx' forms are part of Python syntax for string
constants, and are handled at compile time (for some value of
"compile"). It's not part of run-time string handling. The first case
is creating an 8-bit string. In an 8-bit string, the interpreter looks
for things like \x00 and converts it to a single byte. The second case
is creating a Unicode string. In a Unicode string, the interpreter
looks for things like \u4eb0 and converts it to a single character.
--
Tim Roberts, ***@probo.com
Providenza & Boekelheide, Inc.
--
You received this message because you are subscribed to the Google Groups "wxPython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wxpython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Jake Larrimore
2014-09-19 17:45:46 UTC
Permalink
Post by Tim Roberts
Post by Jake Larrimore
I am being passed a string that may contain Unicode in it's 4-digit
format -- u'\u4eb0'. I'm asked to be able to display said string in a
Richtext and need to decode the Unicode to display as well. Is there
a simple way to decode a string that may (or may not) contain Unicode
so that when I pass it into richtext.WriteText(text), it knows to
display the unicode, without parsing the string to do it manually?
I attached an Example of what I'm dealing with. I can get it to
display the Unicode if I pass it in alone--
richtext.WriteText( u'\u4eb0')
-- but if it is in the middle of a string it doesn't recognize it as
Unicode. I'd rather not have to parse the hole thing to search for
unicode, I know there must be a simpler way.
The difference here is not between "passing it in alone" and "in the
middle". The difference is in the datatypes. If you change your
testScript = " Hello world! u'\u4eb0'. Hello, again \n"
testScript = u" Hello world! \u4eb0. Hello, again \n"
it works fine. Can you see the difference?
The u"xxx" and u'xxx' forms are part of Python syntax for string
constants, and are handled at compile time (for some value of
"compile"). It's not part of run-time string handling. The first case
is creating an 8-bit string. In an 8-bit string, the interpreter looks
for things like \x00 and converts it to a single byte. The second case
is creating a Unicode string. In a Unicode string, the interpreter
looks for things like \u4eb0 and converts it to a single character.
--
Providenza & Boekelheide, Inc.
Tim and Werner:
I see now. Thanks for the quick reply. Tim your explanation was spot on
and very helpful. I completely understand now.

Best regards,
Jake
--
You received this message because you are subscribed to the Google Groups "wxPython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wxpython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Jake Larrimore
2014-09-19 18:23:47 UTC
Permalink
Post by Jake Larrimore
Post by Tim Roberts
Post by Jake Larrimore
I am being passed a string that may contain Unicode in it's 4-digit
format -- u'\u4eb0'. I'm asked to be able to display said string in a
Richtext and need to decode the Unicode to display as well. Is there
a simple way to decode a string that may (or may not) contain Unicode
so that when I pass it into richtext.WriteText(text), it knows to
display the unicode, without parsing the string to do it manually?
I attached an Example of what I'm dealing with. I can get it to
display the Unicode if I pass it in alone--
richtext.WriteText( u'\u4eb0')
-- but if it is in the middle of a string it doesn't recognize it as
Unicode. I'd rather not have to parse the hole thing to search for
unicode, I know there must be a simpler way.
The difference here is not between "passing it in alone" and "in the
middle". The difference is in the datatypes. If you change your
testScript = " Hello world! u'\u4eb0'. Hello, again \n"
testScript = u" Hello world! \u4eb0. Hello, again \n"
it works fine. Can you see the difference?
The u"xxx" and u'xxx' forms are part of Python syntax for string
constants, and are handled at compile time (for some value of
"compile"). It's not part of run-time string handling. The first case
is creating an 8-bit string. In an 8-bit string, the interpreter looks
for things like \x00 and converts it to a single byte. The second case
is creating a Unicode string. In a Unicode string, the interpreter
looks for things like \u4eb0 and converts it to a single character.
--
Providenza & Boekelheide, Inc.
I see now. Thanks for the quick reply. Tim your explanation was spot on
and very helpful. I completely understand now.
Best regards,
Jake
Sorry all it looks like I spoke too soon...
I'm getting the following:

testScript = u"Hello world! \u4eb0. Hello, again \n" *#works fine*

However:

testScript = "Hello world! \u4eb0. Hello, again \n"
testUnicode = unicode(testScript,"utf-8")
richText.WriteText(testUnicode)* #doesn't work?*

Even though type(testUnicode) = <type 'unicode'>

What am I missing here?
Thanks again,
Jake
--
You received this message because you are subscribed to the Google Groups "wxPython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wxpython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Tim Roberts
2014-09-19 19:37:40 UTC
Permalink
Post by Jake Larrimore
Sorry all it looks like I spoke too soon...
testScript = u"Hello world! \u4eb0. Hello, again \n" *#works fine*
testScript = "Hello world! \u4eb0. Hello, again \n"
testUnicode = unicode(testScript,"utf-8")
richText.WriteText(testUnicode)*#doesn't work?*
Even though type(testUnicode) = <type 'unicode'>
What am I missing here?
It's because the \u syntax is not handled by the string module. It is
handled by the interpreter when it parses a Unicode string CONSTANT, not
when it converts to Unicode. When I do this:
sss = "abc\n\x33"
I am creating a string that contains 5 characters, with the values 0x61
0x62 0x63 0x0A 0x33. The string contains no backslashes, nor does it
contain the letter "x". Similarly, when I do this:
uuu = u"abc\n\u4eb0"
I am creating a string that contains 5 characters: 0x0061 0x0062 0x0063
0x000A 0x4EB0. Again, the string contains no backslashes, nor does it
contain the letter "u".

But when I say this:
sss = "abc\n\u4eb0"
I have created a 10-character string. It starts with 61 62 63 0A, but
then it actually contains a backslash and a "u". Those are valid ASCII
characters, and they are valid Unicode characters. So, when you convert
that to Unicode, it happily converts the backslash and the "u4eb0" to
their Unicode equivalents.

If you are receiving 8-bit strings that contain these Unicode escapes,
then you are going to have to parse it by hand after you convert it to
Unicode. If you need to embed Unicode code points in an 8-bit string,
then you need to check into using UTF-8. The UTF-8 for U+4EB0 is E4 BA
B0. So, you could say this:

testScript = "Hello world! \xe4\xba\xb0. Hello, again \n"
testUnicode = testScript.decode('utf-8')
--
Tim Roberts, ***@probo.com
Providenza & Boekelheide, Inc.
--
You received this message because you are subscribed to the Google Groups "wxPython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wxpython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Chris Barker
2014-09-23 16:40:42 UTC
Permalink
Post by Tim Roberts
If you are receiving 8-bit strings that contain these Unicode escapes,
then you are going to have to parse it by hand after you convert it to
Unicode.
or use eval() -- much easier, though always potentially dangerous:

In [10]: print rs
this is a 'raw' string that has a unicode escape in it: \u00B0

In [11]: eval('u"%s"'%rs)
Out[11]: u"this is a 'raw' string that has a unicode escape in it: \xb0"

in this case, the eval() creates an actual unicode object.

There may be a way to invoke python's string parsing without the general
purpose eval -- I haven't looked.

-Chris


If you need to embed Unicode code points in an 8-bit string, then you need
Post by Tim Roberts
to check into using UTF-8. The UTF-8 for U+4EB0 is E4 BA B0. So, you
testScript = "Hello world! \xe4\xba\xb0. Hello, again \n"
testUnicode = testScript.decode('utf-8')
In that case, I'd just use unicode, then decode it to utf-8 -- just like
the normal old way to do it. why write what is essentially a utf-8 encoder,
when python gives you one?

-CHB

----------------------------------
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov
--
You received this message because you are subscribed to the Google Groups "wxPython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wxpython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Nathan McCorkle
2014-09-19 21:03:38 UTC
Permalink
Post by Jake Larrimore
Sorry all it looks like I spoke too soon...
testScript = u"Hello world! \u4eb0. Hello, again \n" *#works fine*
testScript = "Hello world! \u4eb0. Hello, again \n"
testUnicode = unicode(testScript,"utf-8")
richText.WriteText(testUnicode)* #doesn't work?*
Even though type(testUnicode) = <type 'unicode'>
What am I missing here?
This works for me... don't try printing it to the console though, it
doesn't seem to know how to print that character (it shows up in the
RichText box as a Asian looking character)

testScript = "Hello world! \u4eb0. Hello, again \n"
testUnicode = unicode(testScript,"unicode-escape")
print type(testUnicode)
panel.WriteText(testUnicode)
--
You received this message because you are subscribed to the Google Groups "wxPython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wxpython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Jake Larrimore
2014-09-23 13:40:55 UTC
Permalink
Post by Nathan McCorkle
Post by Jake Larrimore
Sorry all it looks like I spoke too soon...
testScript = u"Hello world! \u4eb0. Hello, again \n" *#works fine*
testScript = "Hello world! \u4eb0. Hello, again \n"
testUnicode = unicode(testScript,"utf-8")
richText.WriteText(testUnicode)* #doesn't work?*
Even though type(testUnicode) = <type 'unicode'>
What am I missing here?
This works for me... don't try printing it to the console though, it
doesn't seem to know how to print that character (it shows up in the
RichText box as a Asian looking character)
testScript = "Hello world! \u4eb0. Hello, again \n"
testUnicode = unicode(testScript,"unicode-escape")
print type(testUnicode)
panel.WriteText(testUnicode)
Tim--
Thanks for the explanation. That makes sense.

Nathan--
That worked for me also. This is what I ended up using. Though Tim's
explanation was also very helpful as to WHY it wasn't working. I'm glad I
don't have to parse through by hand.

Thanks again guys,
Jake
--
You received this message because you are subscribed to the Google Groups "wxPython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wxpython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Continue reading on narkive:
Loading...