Skip to main content

The PostgreSQL Machine

Bonsack's machine
PostgreSQL is a portable RDBMS targeting POSIX systems (and Windows).  It also makes some assumptions about the operating system and hardware it's running on that are not covered by POSIX, but hold on all typical systems.  For example:
  • we assume that 32 bit aligned integers can be read and written atomically; that is, without any kind of synchronisation, you might read an arbitrarily stale value but you won't see a "torn" value with a mixture of bits from the before and after values of a concurrent write
  • we assume that system calls (or at least IPC-related syscalls) synchronise memory; that is, if you write to shared memory and then signal another process, the other process will then be able to read the value
  • we assume that disk blocks of 512 bytes (or some multiple) are written atomically when preallocated; that is, if you lose power and then come back up, you'll either see the old or the new version of a 512-byte block, and not a mixture of bits from the two or other kinds of corruption; specifically, the pg_control file won't be corrupted by torn writes because its size is not allowed to exceed PG_CONTROL_MAX_SAFE_SIZE
On the other hand, it makes some assumptions that are based on (one reading of) POSIX (not to mention the POLA, at least from the vantage point of a humble userland muggle) that turned out not to be entirely reflected in reality on a certain extremely popular OS:
  • we assume that it is safe to write to a file, then close it, and then reopen it from any process and call fsync()
  • we assume that if fsync() reports a failure, then it's safe to retry by calling fsync() again; specifically, if a checkpoint fails because fsync() failed, then we can try to create another checkpoint (for example during shutdown) and a successful fsync() again and know that everything we wrote is on disk

While there seems to be plenty of evidence that calling fsync() from another process does the right thing under normal circumstances, it turns out that write-back errors from buffered IO are not always propagated to user space.  Whether earlier write-back errors are forgotten or even reported at all turns out to vary by operating system and version.  I'm not going to go into the gory details of that here... you can find some evolving notes on our wiki and in the fsyncgate thread and other threads linked from there.

The thread is, for practical end-user purposes, a bit of a storm in a teacup.  If your storage is failing, the most likely outcome is that your database is toast and it's probably time to fail over or replace hardware and restore from backups.  On the other hand, it is certainly possible to imagine scenarios where transient errors result in undetected data loss on some OSes.  Even if the errors are permanent, hiding them makes it impossible for the next level up to report the failure to its clients and possibly take defensive action.  That's not nice.

It'll be interesting to see if the thread spurs a major drive for direct IO support in PostgreSQL, which was already an elephant in the room for other reasons.  If the recent discussion on the linux-ext4 mailing list is anything to go by, it may also lead to some kernel changes which I guess should be of interest to a wider audience.  I don't believe for a second that these problem affect only PostgreSQL.  In any case, from where I'm sitting FreeBSD comes out of this looking like a pretty good implementation of the "PostgreSQL Machine".

Update: It looks like Linux 4.17 will probably include some changes here.

Comments

Popular posts from this blog

Parallel Hash for PostgreSQL

PostgreSQL 9.6 and 10 can use all three join strategies in parallel query plans, but they can only use a partial plan on the outer side of the join.  As of commit 18042840 , assuming nothing irreparably busted is discovered in the next few months, PostgreSQL 11 will ship with Parallel Hash.  Partial plans will be possible on both sides of a join for the first time. There will certainly be some adjustments  before it's released, but it seems like a good time to write a blog article to present Parallel Hash.  This is the biggest feature I've worked on in PostgreSQL so far, and I'm grateful to the reviewers, testers, committers and mentors of the PostgreSQL hacker community and EnterpriseDB for making this work possible. So what does this feature really do? A simple example Using the "orders" and "lineitem" tables from TPC-H scale 30GB, here is a very simple join query answering the (somewhat contrived) question "how many lineitems have there ev

SERIALIZABLE in PostgreSQL 11... and beyond

Thanks to the tireless work of Google Summer of Code student Shubham Barai with the support of reviewers, a mentor and a committer, PostgreSQL 11 will ship with predicate lock support for  hash indexes ,  gin indexes  and  gist indexes .  These will make SERIALIZABLE transaction isolation much more efficient with those indexes, filling in some of the feature combination gaps and quirks that exist in my favourite RDBMS. It seems like a good time to write a bit about SERIALIZABLE and how it interacts with other selected PostgreSQL features, including indexes. A bit of background If you want to read something a little less dry than the usual papers about transaction isolation, I recommend  ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications  which, among other things, discusses a transaction isolation-based attack that bankrupted a bitcoin exchange.  It also makes some interesting observations about some of PostgreSQL's rivals.  Even excluding malicious at