Unicode is a standard for universal representation of characters in computer code. It supports almost all characters in different languages. Each character is represented by a unique number. Unicode defines the standard for UTF-8, UTF-16 and UTF-32.
This article will discuss how to deal with unicode characters in Python.
Python Source Encoding (Python 2)
By default, Python 2 source files are interpreted as ascii. When writing unicode characters in source files, we must follow the standard defined in PEP263. To do this, we must define the encoding in the first or second line.
# coding=utf-8
or
#!/usr/bin/env python
# -*- coding: utf-8 -*-
Unicode objects (Python 2)
In Python, the unicode string type represents unicode characters and defined by prepending u
in the string definition.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
a = u"Chicken attack! ?"
print(a)
Unicode in Python 3
Python 3 introduced a breaking change when it comes to unicode characters as strings are now in Unicode as opposed to Python 2's byte strings. By default, all source files are now interpreted as UTF-8. In Python, defining the file encoding is no longer needed.
#!/usr/bin/env python
a = "Chicken attack! ?"
print(a)
Note that in Python 2, strings represent only byte characters while in Python 3, byte strings can be defined by prepending b
during definition. Byte strings cannot contain unicode characters.
#!/usr/bin/env python
a = b'hello'
Reading and writing files
To read and write unicode files, we can use the codecs library. This library works on both Python 2 and 3 versions.
#!/usr/bin/env python
import codecs
with codecs.open('myfile.txt', encoding='utf-8') as f:
print(f.read())
In Python 3, it is recommended to use the io library because because the codecs
library is planned to be deprecated.
#!/usr/bin/env python
import io
with io.open('myfile.txt', encoding='utf-8') as f:
print(f.read())
For example, applying this to CSV files, we simply set the encoding.
#!/usr/bin/env python
import csv
import io
with io.open('chicken.csv', 'w', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['Chicken attack!', '?'])
Writing JSON
The json standard library defaults to ascii when writing. To disable this, pass ensure_ascii=False
.
#!/usr/bin/env python
import json
import io
with io.open('chicken.jl', 'w', encoding='utf-8') as f:
json.dump('Chicken attack! ?', f, ensure_ascii=False)
Printing to console
When printing unicode characters to console, a UnicodeEncodeError
is raised.
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f414' in position 16: character maps to <undefined>
To properly print unicode characters to console, define the environment variable PYTHONIOENCODING=UTF-8
. We can also simply add this before executing a Python script.
$ PYTHONIOENCODING=UTF-8 python unicode.py
Chicken attack! ?
Logging
The Python logging module does not use UTF-8 by default. To support unicode characters, define encoding when using a FileHandler.
#!/usr/bin/env python
import logging
logger = logging.getLogger(__name__)
handler = logging.FileHandler('unicode.log', encoding='utf-8')
logger.addHandler(handler)
if __name__ == '__main__':
logger.error("Chicken attack! ?")
Redis
The redis-py module handles data as bytes. In Python 3, in order to work safely with Unicode characters without manually converting types, we can define the encoding when creating StrictRedis instance.
#!/usr/bin/env python
from redis import StrictRedis
redis_client = StrictRedis(
host=os.getenv('REDIS_HOST'),
port=os.getenv('REDIS_PORT'),
db=os.getenv('REDIS_DB'),
password=os.getenv('REDIS_PASSWORD'),
encoding='utf-8'
)
Conclusion
Even if unicode is supported by most libraries and languages, we should take note that this is not yet the default encoding. We must explicitly define encoding when writing applications.
What are other unicode-related tips and tricks that I missed? Leave your comments below.
Edit: Added examples for CSV, JSON and Redis