Adding binary data to DOM tree in Python (charset encoding problems)

When I was reading some stuff from the database (Postgres) and adding it to a DOM tree in Python, I encountered a strange encoding problem (that’s got something to do with UTF-8).

Ultimately the solution was to set the client encoding to utf-8 in postgres:

alter user <myuser> set client_encoding to 'utf-8'

In the process I also discovered, that if you add binary (not unicode) data to a DOM tree you should decode it first:

<utf8-variable>.decode("utf-8")

And encode it back to utf-8 when writing XML:

doc.toprettyxml(indent=" ", encoding="utf-8"))

Obviously, if the variable contains UFT-8 to begin with, there’s not much point in encoding it back and fourth, however, if other encodings are involved, it should be done this way.

Leave a Reply