String with unicode 'digit-4' code

Discussion:

Jake Larrimore

2014-09-19 16:51:46 UTC

Hey Group--
I am being passed a string that may contain Unicode in it's 4-digit format
-- u'\u4eb0'. I'm asked to be able to display said string in a Richtext
and need to decode the Unicode to display as well. Is there a simple way
to decode a string that may (or may not) contain Unicode so that when I
pass it into richtext.WriteText(text), it knows to display the unicode,
without parsing the string to do it manually?

I attached an Example of what I'm dealing with. I can get it to display
the Unicode if I pass it in alone--

richtext.WriteText( u'\u4eb0')

-- but if it is in the middle of a string it doesn't recognize it as
Unicode. I'd rather not have to parse the hole thing to search for
unicode, I know there must be a simpler way.

Thanks,
Jake

Environment:
Python 2.7/ 32 bit
wxPython 2.8 (unicode version)

--
You received this message because you are subscribed to the Google Groups "wxPython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wxpython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Werner

2014-09-19 17:22:36 UTC

Permalink

Hi Jake,

Post by Jake Larrimore
Hey Group--
I am being passed a string that may contain Unicode in it's 4-digit
format -- u'\u4eb0'. I'm asked to be able to display said string in a
Richtext and need to decode the Unicode to display as well. Is there
a simple way to decode a string that may (or may not) contain Unicode
so that when I pass it into richtext.WriteText(text), it knows to
display the unicode, without parsing the string to do it manually?
I attached an Example of what I'm dealing with. I can get it to
display the Unicode if I pass it in alone--
richtext.WriteText( u'\u4eb0')
-- but if it is in the middle of a string it doesn't recognize it as
Unicode. I'd rather not have to parse the hole thing to search for
unicode, I know there must be a simpler way.

Is the string really coming in as you show in the code? It would be a
bit odd to have "u'\u4eb0'" in the middle of some string you get.

If it is defined as u'' it works for me, e.g.:

testScript = u" Hello world! \u4eb0' . Hello, again \n" # works
panel.ed.WriteText(testScript)
t2 = " Hello world! u'\u4eb0'. Hello, again \n" # does not work
panel.WriteText(t2)
panel.WriteText( u'\u4eb0') # works

Werner

Tim Roberts

2014-09-19 17:40:37 UTC

Permalink

Post by Jake Larrimore
I am being passed a string that may contain Unicode in it's 4-digit
format -- u'\u4eb0'. I'm asked to be able to display said string in a
Richtext and need to decode the Unicode to display as well. Is there
a simple way to decode a string that may (or may not) contain Unicode
so that when I pass it into richtext.WriteText(text), it knows to
display the unicode, without parsing the string to do it manually?
I attached an Example of what I'm dealing with. I can get it to
display the Unicode if I pass it in alone--
richtext.WriteText( u'\u4eb0')
-- but if it is in the middle of a string it doesn't recognize it as
Unicode. I'd rather not have to parse the hole thing to search for
unicode, I know there must be a simpler way.

The difference here is not between "passing it in alone" and "in the
middle". The difference is in the datatypes. If you change your
example from this:
testScript = " Hello world! u'\u4eb0'. Hello, again \n"
to this:
testScript = u" Hello world! \u4eb0. Hello, again \n"
it works fine. Can you see the difference?

The u"xxx" and u'xxx' forms are part of Python syntax for string
constants, and are handled at compile time (for some value of
"compile"). It's not part of run-time string handling. The first case
is creating an 8-bit string. In an 8-bit string, the interpreter looks
for things like \x00 and converts it to a single byte. The second case
is creating a Unicode string. In a Unicode string, the interpreter
looks for things like \u4eb0 and converts it to a single character.

--
Tim Roberts, ***@probo.com
Providenza & Boekelheide, Inc.
--
You received this message because you are subscribed to the Google Groups "wxPython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wxpython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jake Larrimore

2014-09-19 17:45:46 UTC

Permalink

Post by Tim Roberts

The difference here is not between "passing it in alone" and "in the
middle". The difference is in the datatypes. If you change your
testScript = " Hello world! u'\u4eb0'. Hello, again \n"
testScript = u" Hello world! \u4eb0. Hello, again \n"
it works fine. Can you see the difference?
The u"xxx" and u'xxx' forms are part of Python syntax for string
constants, and are handled at compile time (for some value of
"compile"). It's not part of run-time string handling. The first case
is creating an 8-bit string. In an 8-bit string, the interpreter looks
for things like \x00 and converts it to a single byte. The second case
is creating a Unicode string. In a Unicode string, the interpreter
looks for things like \u4eb0 and converts it to a single character.
--
Providenza & Boekelheide, Inc.

Tim and Werner:
I see now. Thanks for the quick reply. Tim your explanation was spot on
and very helpful. I completely understand now.

Best regards,
Jake

Jake Larrimore

2014-09-19 18:23:47 UTC

Permalink

Post by Jake Larrimore

Post by Tim Roberts

I see now. Thanks for the quick reply. Tim your explanation was spot on
and very helpful. I completely understand now.
Best regards,
Jake

Sorry all it looks like I spoke too soon...
I'm getting the following:

testScript = u"Hello world! \u4eb0. Hello, again \n" *#works fine*

However:

testScript = "Hello world! \u4eb0. Hello, again \n"
testUnicode = unicode(testScript,"utf-8")
richText.WriteText(testUnicode)* #doesn't work?*

Even though type(testUnicode) = <type 'unicode'>

What am I missing here?
Thanks again,
Jake

Tim Roberts

2014-09-19 19:37:40 UTC

Permalink

Post by Jake Larrimore
Sorry all it looks like I spoke too soon...
testScript = u"Hello world! \u4eb0. Hello, again \n" *#works fine*
testScript = "Hello world! \u4eb0. Hello, again \n"
testUnicode = unicode(testScript,"utf-8")
richText.WriteText(testUnicode)*#doesn't work?*
Even though type(testUnicode) = <type 'unicode'>
What am I missing here?

It's because the \u syntax is not handled by the string module. It is
handled by the interpreter when it parses a Unicode string CONSTANT, not
when it converts to Unicode. When I do this:
sss = "abc\n\x33"
I am creating a string that contains 5 characters, with the values 0x61
0x62 0x63 0x0A 0x33. The string contains no backslashes, nor does it
contain the letter "x". Similarly, when I do this:
uuu = u"abc\n\u4eb0"
I am creating a string that contains 5 characters: 0x0061 0x0062 0x0063
0x000A 0x4EB0. Again, the string contains no backslashes, nor does it
contain the letter "u".

But when I say this:
sss = "abc\n\u4eb0"
I have created a 10-character string. It starts with 61 62 63 0A, but
then it actually contains a backslash and a "u". Those are valid ASCII
characters, and they are valid Unicode characters. So, when you convert
that to Unicode, it happily converts the backslash and the "u4eb0" to
their Unicode equivalents.

If you are receiving 8-bit strings that contain these Unicode escapes,
then you are going to have to parse it by hand after you convert it to
Unicode. If you need to embed Unicode code points in an 8-bit string,
then you need to check into using UTF-8. The UTF-8 for U+4EB0 is E4 BA
B0. So, you could say this:

testScript = "Hello world! \xe4\xba\xb0. Hello, again \n"
testUnicode = testScript.decode('utf-8')

Chris Barker

2014-09-23 16:40:42 UTC

Permalink

Post by Tim Roberts
If you are receiving 8-bit strings that contain these Unicode escapes,
then you are going to have to parse it by hand after you convert it to
Unicode.

or use eval() -- much easier, though always potentially dangerous:

In [10]: print rs
this is a 'raw' string that has a unicode escape in it: \u00B0

In [11]: eval('u"%s"'%rs)
Out[11]: u"this is a 'raw' string that has a unicode escape in it: \xb0"

in this case, the eval() creates an actual unicode object.

There may be a way to invoke python's string parsing without the general
purpose eval -- I haven't looked.

-Chris

If you need to embed Unicode code points in an 8-bit string, then you need

Post by Tim Roberts
to check into using UTF-8. The UTF-8 for U+4EB0 is E4 BA B0. So, you
testScript = "Hello world! \xe4\xba\xb0. Hello, again \n"
testUnicode = testScript.decode('utf-8')

In that case, I'd just use unicode, then decode it to utf-8 -- just like
the normal old way to do it. why write what is essentially a utf-8 encoder,
when python gives you one?

-CHB

----------------------------------
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Nathan McCorkle

2014-09-19 21:03:38 UTC

Permalink

Post by Jake Larrimore
Sorry all it looks like I spoke too soon...
testScript = u"Hello world! \u4eb0. Hello, again \n" *#works fine*
testScript = "Hello world! \u4eb0. Hello, again \n"
testUnicode = unicode(testScript,"utf-8")
richText.WriteText(testUnicode)* #doesn't work?*
Even though type(testUnicode) = <type 'unicode'>
What am I missing here?

This works for me... don't try printing it to the console though, it
doesn't seem to know how to print that character (it shows up in the
RichText box as a Asian looking character)

testScript = "Hello world! \u4eb0. Hello, again \n"
testUnicode = unicode(testScript,"unicode-escape")
print type(testUnicode)
panel.WriteText(testUnicode)

Jake Larrimore

2014-09-23 13:40:55 UTC

Permalink

Post by Nathan McCorkle

Post by Jake Larrimore
Sorry all it looks like I spoke too soon...
testScript = u"Hello world! \u4eb0. Hello, again \n" *#works fine*
testScript = "Hello world! \u4eb0. Hello, again \n"
testUnicode = unicode(testScript,"utf-8")
richText.WriteText(testUnicode)* #doesn't work?*
Even though type(testUnicode) = <type 'unicode'>
What am I missing here?

This works for me... don't try printing it to the console though, it
doesn't seem to know how to print that character (it shows up in the
RichText box as a Asian looking character)
testScript = "Hello world! \u4eb0. Hello, again \n"
testUnicode = unicode(testScript,"unicode-escape")
print type(testUnicode)
panel.WriteText(testUnicode)

Tim--
Thanks for the explanation. That makes sense.

Nathan--
That worked for me also. This is what I ended up using. Though Tim's
explanation was also very helpful as to WHY it wasn't working. I'm glad I
don't have to parse through by hand.

Thanks again guys,
Jake

Continue reading on narkive:

Search results for 'String with unicode 'digit-4' code' (Questions and Answers)

replies

What is binary code, and who can give me a good explanation?

started 2006-06-02 10:17:56 UTC

computers & internet

replies

problem with javascript, please help?

started 2006-09-09 18:56:59 UTC

programming & design

replies

Problem converting javascript to php - PLEASE HELP!!?