string_to_array behaviour with an empty string

So this caught me out the other day. We had a query which broke out multiple delimited strings each with one or more elements (but always the same number of elements in each) into multiple records using string_to_array and unnest. In each string any element could potentially have an empty value, but at least one string would always have values for all the elements. This meant there should always be at least one output row. So below worked as expected:

SELECT
  UNNEST(STRING_TO_ARRAY(test.s1, '|')) AS col1,
  UNNEST(STRING_TO_ARRAY(test.s2, '|')) AS col2
FROM ( VALUES (('a|b'),('1|2')) ) AS test(s1, s2);

with the following result:

 col1 | col2 
------+------
 a    | 1
 b    | 2
(2 rows)

Below worked entirely as expected as well:

SELECT
  UNNEST(STRING_TO_ARRAY(test.s1, '|')) AS col1,
  UNNEST(STRING_TO_ARRAY(test.s2, '|')) AS col2
FROM ( VALUES (('a|b'),('|')) ) AS test(s1, s2);

with the following result:

 col1 | col2 
------+------
 a    | 
 b    | 
(2 rows)

So did this when we had some data with just the minimum one element in the input:

SELECT
  UNNEST(STRING_TO_ARRAY(test.s1, '|')) AS col1, 
  UNNEST(STRING_TO_ARRAY(test.s2, '|')) AS col2
FROM ( VALUES (('a'),('1')) ) AS test(s1, s2);

with the following result:

 col1 | col2 
------+------
 a    | 1
(1 row)

But then we discovered that there was some data that was being completely lost in the query output. Turns out that this occurred when, like above there was just one element in the input, but one value was empty:

SELECT
  UNNEST(STRING_TO_ARRAY(test.s1, '|')) AS col1,
  UNNEST(STRING_TO_ARRAY(test.s2, '|')) AS col2
FROM ( VALUES (('a'),('')) ) AS test(s1, s2);

with the following result:

 col1 | col2 
------+------
(0 rows)

So after some investigation it turns out that this happens because string_to_array on an empty string returns an array with no elements rather than an array with one element which is an empty string. This does actually make sense because without any application specific context it is indeterminate whether an empty string should be treated as having no items or should itself be treated as one item. There is a very full and complete discussion of the reasoning behind reaching the decision to return an empty array:

Nevertheless we needed to fix the query. The approach I took was to force all input strings to be at least two elements by pre-pending the delimiter and then always discarding the first element:

SELECT
  UNNEST((STRING_TO_ARRAY('|' || test.s1, '|'))[2:100]) AS col1,
  UNNEST((STRING_TO_ARRAY('|' || test.s2, '|'))[2:100]) AS col2
FROM ( VALUES (('a'),('')) ) AS test(s1, s2);

producing the correct result:

 col1 | col2 
------+------
 a    | 
(1 row)

We are using PostgreSQL version 9.5 and we knew an upper bound on the number of elements so we could get away with doing [2:100] but in PostgreSQL version 9.6 you can now do [2:] to leave the upper bound unspecified and defaulting to the array length which is obviously nicer!

The behaviour of string_to_array on a zero length string is of course documented on https://www.postgresql.org/docs/9.5/static/functions-array.html but in a rather oblique way under a supplementary note documenting changes in behaviour from before PostgreSQL version 9.1 when NULL was returned instead of an empty array.

Leave a Reply

Your email address will not be published. Required fields are marked *