Skip to content

Not so equal texts in PostgreSQL - and how to compare texts in a more elaborate way

Christophe Pettus posted an interesting challenge here. Two strings which on the surface look equal, but if you ask PostgreSQL to compare them, they are not equal.

Now let me start with a note: Twitter totally screws this challenge up.

How so? Although the two strings are different in the original, when posting this to Twitter the strings are made equal. Where is the fun in that?

I asked Christophe for the original query:

INSERT INTO t VALUES (E'Zo\u0065\u0301', E'Zo\u00e9');

And you end up with the following texts in the table:

SELECT * FROM t;
  a  |  b  
-----+-----
 Zoé | Zoé
(1 row)

If you translate the UTF-8 strings into hex, you get "0x5a 0x6f 0x65 0xcc 0x81" and "0x5a 0x6f 0xc3 0xa9". Clearly they are different.

However if you convert the two strings from the Tweet, you get "0x5a 0x6f 0xc3 0xa9" and "0x5a 0x6f 0xc3 0xa9". Same string. Poor Twitter.

Checking the hex values was actually one of my first ideas when I spotted this challenge. But nevertheless based on my experience from my "Data Types in PostgreSQL" and "Advanced Data Types in PostgreSQL" talks, I figured it should be possible to "solve" this puzzle even if the strings are in fact equal.

Buckle up! We are about to dive deep into how extendible PostgreSQL really is!

 

In my "Advanced Data Types in PostgreSQL" talk in the last chapter I explain that it is possible to extend PostgreSQL with your own data types, and corresponding functions, operators and operator classes. Let's put this knowledge to good use and make equal strings not equal.

Looking at the query it seems easy, all I need is:

  • A new data type (similar to string)
  • A new operator for "=", which in fact makes the strings not equal

How hard can it be ...

Preparations

In order to save my regular databases and not accidentally screw up the catalog, I create a new database where all the following tests take place:

\set ON_ERROR_STOP false
\c postgres
DROP DATABASE IF EXISTS "Zoé";
CREATE DATABASE "Zoé";
\c "Zoé"

\set ON_ERROR_STOP true
\set ECHO queries

CREATE SCHEMA "Zoé";
SET search_path TO "Zoé";

I name the new database "Zoé" - because I can. And to make it look nice, I also create a "Zoé" schema. Because why not. This also helps to show where I use my own functions and objects, because they are always schema-qualified with "Zoé".

New data type

A new data type in PostgreSQL needs a couple support functions, input and output, receive and send. That's the bare minimum.

Lucky me, I'm mimicking the TEXT type, so I can re-use all the text functions for this.

CREATE FUNCTION "Zoé".varchar2_in(cstring)
       RETURNS varchar2
            AS 'textin'
      LANGUAGE internal STRICT;

CREATE FUNCTION "Zoé".varchar2_out(varchar2)
       RETURNS cstring
            AS 'textout'
      LANGUAGE internal STRICT;

CREATE FUNCTION "Zoé".varchar2_recv(internal)
       RETURNS varchar2
            AS 'textrecv'
      LANGUAGE internal STRICT;

CREATE FUNCTION "Zoé".varchar2_send(varchar2)
       RETURNS bytea
            AS 'textsend'
      LANGUAGE internal STRICT;

Time to create the new type. Let me know if you recognize the name ...

CREATE TYPE "Zoé".varchar2 (
    INTERNALLENGTH = variable,
    INPUT = "Zoé".varchar2_in,
    OUTPUT = "Zoé".varchar2_out,
    RECEIVE = "Zoé".varchar2_recv,
    SEND = "Zoé".varchar2_send,
    COLLATABLE = true,
    CATEGORY = 'S',
    PREFERRED = true,
    ALIGNMENT = int4,
    STORAGE = extended
);

As you can see, this uses the newly created support functions, which I placed into my dedicated schema.

At this point I can already use this type, create the table, and insert data:

CREATE TABLE "Zoé".t (a "Zoé".varchar2,
                      b "Zoé".varchar2
                     );

INSERT INTO "Zoé".t VALUES ('Zoé', 'Zoé');

Checking the content:

SELECT * FROM "Zoé".t;
  a  |  b  
-----+-----
 Zoé | Zoé
(1 row)

Operator

However I can not yet compare the two columns:

SELECT * FROM "Zoé".t WHERE a = b;
SELECT * FROM "Zoé".t WHERE a = b;
ERROR:  operator does not exist: varchar2 = varchar2
LINE 1: SELECT * FROM "Zoé".t WHERE a = b;
                                      ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

PostgreSQL has no idea how to compare two strings of my new type. I need to create the "=" operator for my new type.

In fact I not only need to create the "=" operator, but also the "<>" operator, because the "=" operator needs a negator.

In order to create these two operators, I also need two functions which can compare two of my varchar2 values, and tell PostgreSQL if they are equal or not. This is where the change happens.

The original functions for TEXT:

CREATE FUNCTION pg_catalog.texteq (TEXT, TEXT)
        RETURNS boolean
             AS 'texteq'
       LANGUAGE internal STRICT;

CREATE FUNCTION pg_catalog.textne (TEXT, TEXT)
        RETURNS boolean
             AS 'textne'
       LANGUAGE internal STRICT;

My functions for the new data type:

CREATE FUNCTION "Zoé".texteq ("Zoé".varchar2, "Zoé".varchar2)
        RETURNS boolean
             AS 'textne'
       LANGUAGE internal STRICT;

CREATE FUNCTION "Zoé".textne ("Zoé".varchar2, "Zoé".varchar2)
        RETURNS boolean
             AS 'texteq'
       LANGUAGE internal STRICT;

Do you spot the difference?

Now that I have the support functions, I can create the operators:

CREATE OPERATOR "Zoé".<> (
    FUNCTION = textne,
    LEFTARG = "Zoé".varchar2,
    RIGHTARG = "Zoé".varchar2,
    COMMUTATOR = OPERATOR("Zoé".<>),
    NEGATOR = OPERATOR("Zoé".=),
    RESTRICT = neqsel,
    JOIN = neqjoinsel
);

CREATE OPERATOR "Zoé".= (
    FUNCTION = texteq,
    LEFTARG = "Zoé".varchar2,
    RIGHTARG = "Zoé".varchar2,
    COMMUTATOR = OPERATOR("Zoé".=),
    NEGATOR = OPERATOR("Zoé".<>),
    MERGES,
    HASHES,
    RESTRICT = eqsel,
    JOIN = eqjoinsel
);

Test it

SELECT * FROM "Zoé".t WHERE a = b;
 a | b 
---+---
(0 rows)

Since I just switched the operators, I can turn the query around and ask if the strings are not equal:

SELECT * FROM "Zoé".t WHERE a != b;
  a  |  b  
-----+-----
 Zoé | Zoé
(1 row)

Bottom line

Switching the texteq and textne functions for my new type tells PostgreSQL that the strings are not equal, when in fact they are. Another option which works equally "well" is that I create the texteq and nextne functions with the "correct" internal functions, but switch these functions when I create the operators. This will give the same result.

The extensibility of PostgreSQL gave me the freedom to design the results I wanted to see, switching true and false around, and making you believe that strings are not equal when in fact they are.

That of course is just a quick example developed as an answer to the challenge. However it shows that PostgreSQL can really be expanded to handle any kind of needs you have for your applications.

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

Jeremy Schneider on :

Out of curiosity, do you know whether there are any Unicode-aware string comparison functions - in glibc or perhaps in ICU - which would consider the two strings at the beginning of your article to be equal to each other?
Comments ()

Jeremy Schneider on :

Ha nevermind. I was behind on twitter and just now realized that Christophe's tweet (which you linked) also mentioned the NORMALIZE() function which I think is the thing I'm looking for.
Comments ()

Add Comment

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
To leave a comment you must approve it via e-mail, which will be sent to your address after submission.
Form options