Deep Dive: PostgreSQL Storage Internals — From Heap Files on Disk to Shared Buffers
When you run EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE id = 42;, PostgreSQL prints something like Buffers: shared hit=3 read=1. What does that mean exactly? Why does the same UPDATE run in 2 ms one time and 80 ms the next, even though the table isn’t being locked? Why doesn’t your table get smaller after a big DELETE? And what does the shared buffer — the thing every tuning guide tells you to size — actually contain?
Answering those questions takes you one layer below SQL: the storage engine. This is a deep dive into how PostgreSQL organises a table on disk (heap files, pages, tuples) and how that data is pulled into RAM by the Shared Buffer Manager. It complements the Table Bloat deep dive (which focuses on the consequence — dead tuples and vacuum) and the PostgreSQL Partitioning deep dive (which focuses on splitting big tables) — both of those posts stand on top of the concepts this one dissects.
1. How a table is laid out on disk
Every PostgreSQL table eventually reduces to a set of files on the filesystem. The PGDATA environment variable points to the data directory (e.g. /var/lib/postgresql/16/main). Inside it, the heap file path for a given table follows this pattern:
$PGDATA/base/<database_oid>/<relfilenode>You can verify directly:
SELECT pg_relation_filepath('users');16384 is the database OID, 24576 is the relfilenode of the users table — relfilenode is the physical filename of a relation in PostgreSQL.
The oid and relfilenode are different things. When you run VACUUM FULL, CLUSTER, REINDEX, or TRUNCATE, PostgreSQL rewrites the data into a new file and changes the relfilenode — but the oid (how the table is referenced from the system catalog) stays the same. This matters for monitoring: if you cache a file path and reuse it 30 minutes later, it may point at a file that no longer exists.
The three forks of a relation
A table isn’t just one fork — it’s three. Fork is a term borrowed from the “file fork” concept in older filesystems; for practical purposes you can think of a fork as just a file.
- Main fork (
<relfilenode>): the actual heap data — the tuples you’veINSERTed. - FSM fork (
<relfilenode>_fsm): the Free Space Map — a small tree-like data structure that records how many free bytes each page has left. OnINSERT, PostgreSQL consults the FSM to find a page with enough room, instead of scanning sequentially. - VM fork (
<relfilenode>_vm): the Visibility Map — 2 bits per page, recording whether (a) the page contains only live tuples, and (b) the page has been frozen. Index-only scans rely on (a) to skip the heap-visibility check; VACUUM relies on it to skip pages that have no dead tuples.
When the table grows past 1 GB, the main fork is split into consecutive segment files: <relfilenode>, <relfilenode>.1, <relfilenode>.2, …
Historical reason: many older filesystems (HFS, FAT32, ext2 with small block sizes) had per-file size limits of 2 GB or 4 GB. PostgreSQL has kept the 1 GB segment ceiling ever since because it still works fine and there’s no reason to change.
2. Page anatomy — the 8 KB I/O unit
PostgreSQL never reads or writes single bytes, tuples, or rows. The fundamental I/O unit is a page (also called a block), default size 8 KB. Every disk read or write happens at this granularity. This is the reason the table-bloat post emphasises that “to fetch one tuple, you must haul up the entire 8 KB container.”
A page’s layout, from offset 0 to 8191:
-
PageHeaderData (24 bytes) — page-level metadata:
pd_lsn(8 B): the Log Sequence Number of the most recent WAL record that touched this page. Used for recovery and checkpoint coordination.pd_checksum(2 B): integrity checksum (whendata_checksumsis enabled).pd_flags(2 B): flags likePD_HAS_FREE_LINES,PD_PAGE_FULL,PD_ALL_VISIBLE.pd_lower(2 B),pd_upper(2 B): two pointers that delimit free space —pd_loweris the end of the line pointer array;pd_upperis the top of the tuple area.pd_special(2 B): offset to the start of the special area (used only by indexes).pd_pagesize_version(2 B),pd_prune_xid(4 B).
-
ItemIdData array — an array of 4-byte line pointers. Each line pointer =
(offset, length, flags)pointing at a tuple within the same page. Index entries reference a tuple not by raw offset but through this line pointer, calledCTID = [block_num, line_pointer_index]. Why: when VACUUM removes dead tuples and defragments the page, the tuples shift to new positions inside the page; the line pointer indirection lets that happen without having to update every index that points to those tuples. -
Free space — the gap between
pd_lowerandpd_upper. This is the unused area. New INSERTs consume it from both ends. -
Tuples — the actual data, stacked from the bottom of the page upward. Each tuple = HeapTupleHeader + (optional) null bitmap + user data.
-
Special space — reserved for indexes (B-tree, GIN, GiST, …). Heap pages don’t use it.
Why the two-direction layout?
The “line pointers grow down, tuples grow up” design has a specific goal: letting you add a new tuple without shifting any existing data. On INSERT, a line pointer is appended at the end of the array (advancing pd_lower), and the tuple is prepended to the top of the tuple area (decreasing pd_upper). The page is only “full” when pd_lower meets pd_upper.
3. Tuple anatomy — HeapTupleHeader
Each physical tuple is a byte blob made of three parts: the header (23 bytes + padding), an optional null bitmap, and the user data (aligned).
The header holds everything MVCC needs to decide whether the tuple is visible to a given transaction. The important fields:
t_xmin(4 B) — the Transaction ID that INSERTed this tuple. A fresh row hast_xmin = current_xid.t_xmax(4 B) — the XID that DELETEd or UPDATEd this tuple.0means the tuple is live, not yet touched by anyone. When youDELETEa row, PostgreSQL only setst_xmax = current_xidand leaves the tuple in place — this is the essence of MVCC (Multi-Version Concurrency Control — a mechanism that lets readers and writers operate concurrently without blocking, by keeping multiple versions of each row).t_cid/t_xvac(4 B) — Command ID within the same transaction; distinguishes statements inside one BEGIN/COMMIT.t_ctid(6 B = 4 B block_num + 2 B offset) — a pointer to this tuple itself (right after INSERT) or to the latest version after UPDATE. This is the mechanism that builds the HOT chain discussed in section 7.t_infomask2(2 B) — attribute count plus a handful of flags (e.g.HEAP_KEYS_UPDATED).t_infomask(2 B) — vital flags, including:HEAP_HASNULL: the tuple has a NULL column → there’s a null bitmap.HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE: HOT chain markers (section 7).- Hint bits —
HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_COMMITTED,HEAP_XMAX_INVALID: cached commit/abort status of xmin/xmax so visibility checks don’t need to consultpg_xactevery time.
t_hoff(1 B) — offset from the start of the tuple to the user data (past the header + null bitmap).
The null bitmap is only present when HEAP_HASNULL = 1. Each column gets one bit; bit = 0 means the column is NULL. This is much more efficient than storing a flag byte per column.
The user data holds the column values, aligned according to each type’s typalign. int8 aligns on 8 bytes, int4 on 4 bytes, bool on 1 byte. The consequence: the order of columns in CREATE TABLE affects storage size. A table with columns (a bool, b int8, c bool) is larger than (b int8, a bool, c bool) because padding has to be inserted to align b. The table-bloat post covers this optimization in depth.
Tuple size cap — where the 2 KB number comes from
Each tuple is capped at around 2 KB.
A common question: “Why 2 KB — is that a hard-coded constant?”
The answer is no. PostgreSQL derives the number from the formula TOAST_TUPLE_THRESHOLD ≈ BLCKSZ / 4, i.e. fit at least 4 tuples into every page. With the default BLCKSZ = 8 KB, the threshold lands around 2 KB.
When a tuple exceeds this threshold, TOAST (The Oversized-Attribute Storage Technique — a mechanism that compresses and/or moves large columns to a separate side table) kicks in: PostgreSQL compresses variable-length columns or relocates them to a side table, leaving only an inline pointer behind.
The threshold can be tuned per table via the toast_tuple_target storage parameter (PG 11+):
ALTER TABLE my_table SET (toast_tuple_target = 4080);4. The Shared Buffer Manager — the bridge between RAM and disk
Whenever PostgreSQL needs a tuple, the smallest unit it can pull from disk is an 8 KB page. Disk is roughly five orders of magnitude slower than RAM, so going to disk for every query is unthinkable. The answer is a buffer pool in shared memory, carved into 8 KB slots; each slot holds exactly one page. The shared_buffers parameter (default 128 MB, typically ~25% of RAM in production) sets the total number of slots.
The hard part shows up under load: hundreds of backends hammer the pool concurrently, some pages are touched thousands of times per second while others sit idle. The buffer manager has to answer three interlocking questions — look up fast “is this page already in the pool?”, decide who gets evicted when the pool fills up, and flush dirty pages back to disk without stalling transactions. The sections below peel off each one in turn.
A three-layer architecture
To support those three questions, the buffer manager is built from three parallel components:
- Buffer table — a hash table that maps
BufferTag → buffer_id. This is the index that tells you whether a given page is currently pooled. - Buffer descriptor array — a metadata array that runs parallel to the buffer pool; each entry describes one slot via
tag,buf_id,refcount(pin count),usage_count, and astateword holding flags such asBM_DIRTY,BM_VALID,BM_IO_IN_PROGRESS. - Buffer pool — the array of 8 KB slots that actually holds page bytes. The
buffer_idis the index into this array. Crucially, descriptor[i]and pool slot[i]always describe the same page — they march in lockstep.
A BufferTag is the cluster-wide unique key for a page, made of five fields:
BufferTag = { spcOid, dbOid, relNumber, forkNum, blockNum }For example, {16821, 16384, 37721, 0, 7} points to block 7 of fork main (0) of relation 37721 in database 16384 inside tablespace 16821. Why all five? A PostgreSQL cluster contains many databases, each spans potentially several tablespaces, and every relation has a few forks (main, FSM, VM); the tag has to carry that full context to be globally unique.
The life of a page access
Now we walk through the exact sequence when a backend needs to read a page, across three scenarios of increasing complexity. The same vocabulary — buffer_table, descriptor, slot — is used throughout.
Cache hit (page is already in the pool):
- The backend process determines which page it needs and computes the BufferTag.
- The buffer manager looks up
buffer_idin the Buffer table using the BufferTag. - The buffer manager updates the Buffer descriptor metadata to mark this slot as accessed:
refcount += 1,usage_count += 1. - Through the buffer manager, the backend process accesses pool slot
buffer_idin the Buffer pool and reads or writes the tuple. - Once done, the buffer manager updates the Buffer descriptor metadata again:
refcount -= 1. If a write just happened, setBM_DIRTY = 1.
This is the fast path — every step runs entirely in RAM.
Cache miss, a free slot exists:
- The backend process determines which page it needs and computes the BufferTag.
- The buffer manager fails to find a
buffer_idin the Buffer table. - The buffer manager picks an empty slot from the freelist.
- The buffer manager inserts the entry
(buffer_tag, buffer_id)into the Buffer table. - Set
BM_IO_IN_PROGRESS = 1, read 8 KB from disk into the slot. - Set
BM_IO_IN_PROGRESS = 0,BM_VALID = 1.
Once the page has been successfully loaded from disk into the Buffer pool, the rest of the work continues exactly as in the Cache hit case.
Cache miss, the pool is full (page replacement):
When every slot in the Buffer pool is occupied and the requested page hasn’t been loaded yet, the buffer manager has to decide which page gets replaced. A replacement algorithm runs, and the page it chooses is called the victim page.
- The buffer manager confirms the requested page isn’t in the Buffer pool and that no free slot exists.
- The buffer manager picks a victim page via the clock-sweep algorithm (next section).
- If the victim page is dirty: it must be flushed to storage before the slot can be reused, otherwise the modifications it carries would be lost.
- Swap the entries in the Buffer table: remove the old entry (pointing to the victim’s tag) and insert the new one (pointing to the requested tag) — both pointing at the same
buffer_id. - Load the new page from disk into the slot, reset
BM_DIRTY = 0,BM_VALID = 1.
Clock-sweep — picking a victim without a global LRU lock
A textbook LRU forces every access to promote a slot to the head of a single list — and with hundreds of backends, that global lock becomes the bottleneck. PostgreSQL accepts a cheaper approximation of LRU instead: clock-sweep.
A nextVictimBuffer pointer walks circularly through the descriptor array. On each step, it inspects the current descriptor under this logic:
loop:
d = descriptors[nextVictimBuffer]
if d.refcount == 0:
if d.usage_count == 0:
return d
else:
d.usage_count -= 1
nextVictimBuffer = (nextVictimBuffer + 1) mod NCombined with backends incrementing usage_count on every access, this produces the behaviour we want:
- Hot pages get touched constantly →
usage_countstays high → they survive many sweep rotations before being reconsidered. - Cold pages never get touched → each sweep decrements
usage_countfurther → they hit 0 quickly and are chosen as victims. - Pinned buffers (
refcount > 0) are always skipped → a page currently in use is never evicted.
The hot path is O(1) work on a single descriptor, no global lock required.
Flushing dirty pages — bgwriter, checkpointer, and the backend itself
When a backend writes into a buffer, it just sets BM_DIRTY and moves on — pushing the bytes back to disk is somebody else’s job. There are three “somebody elses”, with clearly divided responsibilities:
- bgwriter — runs continuously; every
bgwriter_delay(default 200 ms) it flushes up tobgwriter_lru_maxpagespages (default 100). The key detail: bgwriter preferentially targets pages withusage_count = 0— exactly the candidates clock-sweep is about to pick as victims. Cleaning them ahead of time means that when a backend needs to evict one, the slot is already clean and no one blocks on I/O. - checkpointer — periodically flushes every dirty page at checkpoint time and writes the checkpoint record into the WAL. Triggers:
checkpoint_timeout(default 5 minutes) or when WAL hitsmax_wal_size. The point is to bound how much WAL has to be replayed after a crash; sparser checkpoints mean longer recovery. - Backend self-writes — the fallback when clock-sweep lands on a dirty victim that bgwriter hasn’t reached yet. The backend has to flush the page itself before it can reuse the slot, which becomes a direct latency spike for the query. A high rate of backend writes is a strong signal bgwriter is mistuned — usually its delay is too large or its max-pages too small.
Putting it together: bgwriter does steady-rate hygiene and anticipates eviction; checkpointer does periodic bulk cleanup; the backend is the emergency fallback when neither has caught up.
5. Conclusion
You can learn the PostgreSQL storage engine top-down (SQL → query plan → I/O) or bottom-up (bytes in a file → pages → buffers → tuples → MVCC → SQL). The second path is harder but gives you more durable understanding — instead of relying on “best practices I once heard,” you can read any documentation and reason about the trade-offs from first principles.