string_to_array behaviour with an empty string

So this caught me out the other day. We had a query which broke out multiple delimited strings each with one or more elements (but always the same number of elements in each) into multiple records using string_to_array and unnest. In each string any element could potentially have an empty value, but at least one string would always have values for all the elements. This meant there should always be at least one output row. So below worked as expected:

SELECT
  UNNEST(STRING_TO_ARRAY(test.s1, '|')) AS col1,
  UNNEST(STRING_TO_ARRAY(test.s2, '|')) AS col2
FROM ( VALUES (('a|b'),('1|2')) ) AS test(s1, s2);

with the following result:

 col1 | col2 
------+------
 a    | 1
 b    | 2
(2 rows)

Below worked entirely as expected as well:

SELECT
  UNNEST(STRING_TO_ARRAY(test.s1, '|')) AS col1,
  UNNEST(STRING_TO_ARRAY(test.s2, '|')) AS col2
FROM ( VALUES (('a|b'),('|')) ) AS test(s1, s2);

with the following result:

 col1 | col2 
------+------
 a    | 
 b    | 
(2 rows)

So did this when we had some data with just the minimum one element in the input:

SELECT
  UNNEST(STRING_TO_ARRAY(test.s1, '|')) AS col1, 
  UNNEST(STRING_TO_ARRAY(test.s2, '|')) AS col2
FROM ( VALUES (('a'),('1')) ) AS test(s1, s2);

with the following result:

 col1 | col2 
------+------
 a    | 1
(1 row)

But then we discovered that there was some data that was being completely lost in the query output. Turns out that this occurred when, like above there was just one element in the input, but one value was empty:

SELECT
  UNNEST(STRING_TO_ARRAY(test.s1, '|')) AS col1,
  UNNEST(STRING_TO_ARRAY(test.s2, '|')) AS col2
FROM ( VALUES (('a'),('')) ) AS test(s1, s2);

with the following result:

 col1 | col2 
------+------
(0 rows)

So after some investigation it turns out that this happens because string_to_array on an empty string returns an array with no elements rather than an array with one element which is an empty string. This does actually make sense because without any application specific context it is indeterminate whether an empty string should be treated as having no items or should itself be treated as one item. There is a very full and complete discussion of the reasoning behind reaching the decision to return an empty array:

Nevertheless we needed to fix the query. The approach I took was to force all input strings to be at least two elements by pre-pending the delimiter and then always discarding the first element:

SELECT
  UNNEST((STRING_TO_ARRAY('|' || test.s1, '|'))[2:100]) AS col1,
  UNNEST((STRING_TO_ARRAY('|' || test.s2, '|'))[2:100]) AS col2
FROM ( VALUES (('a'),('')) ) AS test(s1, s2);

producing the correct result:

 col1 | col2 
------+------
 a    | 
(1 row)

We are using PostgreSQL version 9.5 and we knew an upper bound on the number of elements so we could get away with doing [2:100] but in PostgreSQL version 9.6 you can now do [2:] to leave the upper bound unspecified and defaulting to the array length which is obviously nicer!

The behaviour of string_to_array on a zero length string is of course documented on https://www.postgresql.org/docs/9.5/static/functions-array.html but in a rather oblique way under a supplementary note documenting changes in behaviour from before PostgreSQL version 9.1 when NULL was returned instead of an empty array.

Hold the exact(ish) “CREATE VIEW” definition in the PostgreSQL System Catalog

It would be rather nice if the exact SQL definition (well almost, see below) supplied with the “create view” statement could be held in the PostgreSQL System Catalog, rather than the re-written optimised version that is currently held. This has come up before, see:

http://www.postgresql-archive.org/idea-storing-view-source-in-system-catalogs-td1987401.html
http://www.postgresql-archive.org/Preserving-the-source-code-of-views-td5775163.html

Back then this idea got stomped on for good reasons but I am not suggesting quite the same.

We would prefer PostgreSQL to store view definitions in the system catalog *AFTER* column expansion (unlike in the discussion above) but *BEFORE* internal parser rewrites. So, currently for example:

CREATE TABLE bar ( x INTEGER, y TEXT );

CREATE VIEW foo(x, y) AS
SELECT * FROM bar WHERE y IN ( ‘a’, ‘b’ ) AND x IS NOT DISTINCT FROM 0;

SELECT view_definition FROM information_schema.views
WHERE table_name = ‘foo’;

SELECT bar.x, bar.y
FROM bar
WHERE ((bar.y = ANY (ARRAY[‘a’::text, ‘b’::text])) AND (NOT (bar.x IS DISTINCT FROM 0)));
(1 row)

The column expansion and corresponding rewrite is mandatory at the time of creation (see below), but the other changes in the definition are not. They make comparing the current definition of a view as it is in a database against an externally maintained definition difficult, which is what we in particular want to be able to do.

The SQL standard (at least the ancient version I have access to) for “information_schema” seems rather open to interpretation on this issue. It says that the view definition column:

“contains a representation of the view descriptors”

What do we take “representation” to be – presumably a completely mangled internally optimised “blob” would be compliant (but otherwise useless). The spirit of the statement might be more reasonably taken to be “a human legible representation”. In that case how much rewrite should actually be allowed – to me the rewritten definition above is already getting less legible than the original, but maybe others feel different. I don’t know if it was a design decision that the parser rewrites were folded back into the stored definition or just a side effect of the implementation. Either way, if there is any additional rewriting in the future there must be a point at which the rewrites would be considered to make the stored view definition so sufficiently illegible that the spirit of the standard at least is no longer being followed – or would future rewrites be needlessly constrained by having to preserve the legibility of the stored definition?

The SQL standard is however completely explicit about column expansion:

“NOTE 13: Any implicit column references that were contained in the <query
expression> associated with the <view definition> are replaced by explicit
column references in VIEW_DEFINITION.”

The SQL standard also implies that the definition should be NULL when the length exceeds some limit (not sure what that is but I have a vague feeling from somewhere that it was 4K characters). PostgreSQL does not enforce that limit that I know of (or simply has a much higher limit) since we have some massive views that are returned entirely – and that seems very sensible to me, can’t see a point having a limit these days (except for backwards compatibility with external applications). Maybe this limit has been removed in more recent versions of the standard.

The re-written view definition is also what is output by pg_dump. The consequence of this is that any new/altered parser rewrites made in a newer version of PostgreSQL might not be automatically applied to the view when it is restored from the old versions dump file into that new version. They would however always be applied if the view was restored from a dump that used the original raw definition from the CREATE VIEW statement.

I looked at MariaDB for comparison. This does some rewriting as well – it quotes every table and column reference and adds superfluous bracketing, but does not replace IN (it does not have a comparable “NOT DISTINCT” syntax, just <=>, but I suspect it would shift an embedded NOT in similar constructs in the same way as PostgreSQL). I guess YMMV here depending on how much clever rewriting the database does. Don’t know about other databases like Oracle or SQL Server.

Arguably the same principle should be applied to any embedded SQL literal expression used in DDL operations where the parser rewrites it, e.g. in DEFAULT values for table columns or WITH clauses used in trigger conditions. No idea what the SQL standard says about these, and indeed if they are even included under the information_schema.

Storing the original CREATE VIEW statement post column expansion is obviously of no benefit internally to PostgreSQL itself (excepting the minor pg_dump point above) so I can understand the reluctance to change anything. Reading the earlier discussion on this issue also suggests that this might be a really tricky change to make, I guess due to the way the parser is currently written, so cost/benefit is poor.

Anyway, for the record at least, we have a use case where it would be better for us if the original CREATE VIEW statement was returned as-is in the “information_schema.view_definition” column, but in compliance with Note13 any column expansion re-writes will have been applied. Over and above standards compliance I believe that the column expansion is structurally necessary anyway to facilitate subsequent table/column re-naming and the automatic cascade of the re-naming to views that use them. The further parser rewritten definition could continue to be stored in the pg_catalog in addition if it is needed internally within PostgreSQL (i.e. pg_dump). However, I also suggest that it might be sensible for the original “column expansion only” definition to be used in the pg_dump output, so that the “intent” of the view when originally created is what should (or should not in the case of any incompatibilities) migrate into a future version (or be restored into the same version).