Saturday, 14 April 2018

The PostgreSQL Machine

PostgreSQL is a portable RDBMS targeting POSIX systems (and Windows).  It also makes some assumptions about the operating system and hardware it's running on that are not covered by POSIX, but hold on all typical systems.  For example:
  • we assume that 32 bit aligned integers can be read and written atomically; that is, without any kind of synchronisation, you might read an arbitrarily stale value but you won't see a "torn" value with a mixture of bits from the before and after values of a concurrent write
  • we assume that system calls (or at least IPC-related syscalls) synchronise memory; that is, if you write to shared memory and then signal another process, the other process will then be able to read the value
  • we assume that disk blocks of 512 bytes (or some multiple) are written atomically when preallocated; that is, if you lose power and then come back up, you'll either see the old or the new version of a 512-byte block, and not a mixture of bits from the two or other kinds of corruption; specifically, the pg_control file won't be corrupted by torn writes because its size is not allowed to exceed PG_CONTROL_MAX_SAFE_SIZE
On the other hand, it makes some assumptions that are based on (one reading of) POSIX (not to mention the POLA, at least from the vantage point of a humble userland muggle) that turned out not to be entirely reflected in reality on a certain extremely popular OS:
  • we assume that it is safe to write to a file, then close it, and then reopen it from any process and call fsync()
  • we assume that if fsync() reports a failure, then it's safe to retry by calling fsync() again; specifically, if a checkpoint fails because fsync() failed, then we can try to create another checkpoint (for example during shutdown) and a successful fsync() again and know that everything we wrote is on disk

While there seems to be plenty of evidence that calling fsync() from another process does the right thing under normal circumstances, it turns out that write-back errors from buffered IO are not always propagated to user space.  Whether earlier write-back errors are forgotten or even reported at all turns out to vary by operating system and version.  I'm not going to go into the gory details of that here... you can find some evolving notes on our wiki and in the fsyncgate thread and other threads linked from there.

The thread is, for practical end-user purposes, a bit of a storm in a teacup.  If your storage is failing, the most likely outcome is that your database is toast and it's probably time to fail over or replace hardware and restore from backups.  On the other hand, it is certainly possible to imagine scenarios where transient errors result in undetected data loss on some OSes.  Even if the errors are permanent, hiding them makes it impossible for the next level up to report the failure to its clients and possibly take defensive action.  That's not nice.

It'll be interesting to see if the thread spurs a major drive for direct IO support in PostgreSQL, which was already an elephant in the room for other reasons.  If the recent discussion on the linux-ext4 mailing list is anything to go by, it may also lead to some kernel changes which I guess should be of interest to a wider audience.  I don't believe for a second that these problem affect only PostgreSQL.  In any case, from where I'm sitting FreeBSD comes out of this looking like a pretty good implementation of the "PostgreSQL Machine".

4 comments: