Christophe Pettus posted an interesting challenge here. Two strings which on the surface look equal, but if you ask PostgreSQL to compare them, they are not equal.
Now let me start with a note: Twitter totally screws this challenge up.
How so? Although the two strings are different in the original, when posting this to Twitter the strings are made equal. Where is the fun in that?
I asked Christophe for the original query:
INSERT INTO t VALUES (E'Zo\u0065\u0301', E'Zo\u00e9');
And you end up with the following texts in the table:
SELECT * FROM t;
a | b
-----+-----
Zoé | Zoé
(1 row)
If you translate the UTF-8 strings into hex, you get "0x5a 0x6f 0x65 0xcc 0x81" and "0x5a 0x6f 0xc3 0xa9". Clearly they are different.
However if you convert the two strings from the Tweet, you get "0x5a 0x6f 0xc3 0xa9" and "0x5a 0x6f 0xc3 0xa9". Same string. Poor Twitter.
Checking the hex values was actually one of my first ideas when I spotted this challenge. But nevertheless based on my experience from my "Data Types in PostgreSQL" and "Advanced Data Types in PostgreSQL" talks, I figured it should be possible to "solve" this puzzle even if the strings are in fact equal.
Buckle up! We are about to dive deep into how extendible PostgreSQL really is!
Continue reading "Not so equal texts in PostgreSQL - and how to compare texts in a more elaborate way"