Swift Tip: Decomposing Emoji
Now that emoji are common everywhere, we need to be aware of unicode, even without an international userbase. For example, the emoji π¨βπ©βπ§βπ¦ (a family of four glyph) has a very different length across String implementations:
"π¨βπ©βπ§βπ¦".count // 1
("π¨βπ©βπ§βπ¦" as NSString).length // 11
Javascript also evaluates to 11. In Ruby "π¨βπ©βπ§βπ¦".length
evaluates to 7, and in Python 2 len("π¨βπ©βπ§βπ¦")
evaluates to 25 (depending on your settings). One string, four different lengths.
Perhaps even more surprising: none of these implementations are wrong. They're all counting different things. In Swift, we get 1
as the answer because Swift counts the characters -- π¨βπ©βπ§βπ¦ is a single character. The NSString
variant and Javascript evaluate to 11 because they're counting the number of UTF-16 code units. We can replicate this in Swift:
"π¨βπ©βπ§βπ¦".utf16.count // 11
We can also see how Python gets to 25 -- in this case, it counts the UTF-8 code units:
"π¨βπ©βπ§βπ¦".utf8.count // 25
And finally, Ruby and Python 3 evalute to 7 because they count the unicode scalars, and π¨βπ©βπ§βπ¦ consists of the following scalars: π¨ + zero width joiner + π© + zero width joiner + π§ + zero width joiner + π¦.
"π¨βπ©βπ§βπ¦".unicodeScalars.count // 7
When you're dealing with strings where length is significant, keep this in mind. To learn more, watch last week's Swift Talk episode or read the transcript . If you'd like to learn more about unicode and how it's implemented in Swift, read our book Advanced Swift .