普通视图

发现新文章,点击刷新页面。
今天 — 2026年4月19日首页

PostgreSQL MVCC 深度解析

作者 hudson2022
2026年4月19日 10:45

PostgreSQL MVCC 深度解析

摘要: 本文通过每条元组头部的 t_xmin 和 t_xmax 字段,解释 PostgreSQL 的多版本并发控制(Multi-Version Concurrency Control)在存储层的工作原理。展示了快照如何在并发会话之间确定可见性,为什么 READ COMMITTED 和 REPEATABLE READ 隔离级别表现不同,以及非阻塞读取与磁盘空间使用之间的权衡。

原文链接


你在一个 psql 会话中执行 SELECT * FROM orders,看到 5000 万行数据。另一个会话中的同事在同一时刻执行相同查询,却看到 49,999,999 行。你们都没有错,也没有看到过期数据。你们读取的是相同的 8KB 堆页面,相同的磁盘字节。

这就是 PostgreSQL MVCC(多版本并发控制)的承诺,也是读操作永远不会阻塞写操作、写操作也永远不会阻塞读操作的原因。这是存储引擎中最容易被误解的部分。人们知道"一行数据有多个版本"后就止步于此。

答案就在每条元组的八个字节中。

xmin 和 xmax:唯二重要的两个 XID

如果你读过《深入理解 8KB 页面》,就知道每条元组以 23 字节的头部开始。头部的头八个字节是两个 32 位事务 ID:t_xmin(插入这个版本的 transaction)和 t_xmax(删除或更新它的 transaction,如果是 0 则表示仍存活)。

这就是 MVCC 在存储层面的核心。PostgreSQL 不维护单独的"当前版本"表。它不标记行为最新。每条元组都携带自己的双字段时间戳,当你的查询读取一个页面时,PostgreSQL 必须逐条元组地决定你的事务是否可以看到它。

一个最小演示:

CREATE TABLE mvcc_demo (id int, val text);
INSERT INTO mvcc_demo VALUES (1, 'alpha'), (2, 'beta');

pageinspect 查看原始页面:

SELECT lp, t_xmin, t_xmax, t_ctid
FROM heap_page_items(get_raw_page('mvcc_demo', 0));
 lp | t_xmin | t_xmax | t_ctid
----+--------+--------+--------
  1 |    100 |      0 | (0,1)
  2 |    100 |      0 | (0,2)
(2 rows)

两条元组。都以 t_xmin = 100(执行 INSERT 的事务)和 t_xmax = 0(没有人删除它们)标记。在这个时刻,数据库上的每个会话都会看到这些行,因为所有人的快照都认定事务 100 已提交。

现在打开两个并发会话。会话 A 执行一个未提交的 UPDATE:

-- session A
BEGIN;
UPDATE mvcc_demo SET val = 'alpha-new' WHERE id = 1;
-- do not commit yet

再次查看页面:

SELECT lp, t_xmin, t_xmax, t_ctid
FROM heap_page_items(get_raw_page('mvcc_demo', 0));
 lp | t_xmin | t_xmax | t_ctid
----+--------+--------+--------
  1 |    100 |    101 | (0,3)
  2 |    100 |      0 | (0,2)
  3 |    101 |      0 | (0,3)
(3 rows)

一次 UPDATE,三条元组。id=1 的旧版本仍在行指针 1 处,带有 t_xmax = 101 的标记,新版本在行指针 3 处,t_xmin = 101

会话 A 尚未提交。事务 101 仍在进行中。正在执行 SELECT * FROM mvcc_demo 的会话 B 仍然看到原始的 alpha,而不是 alpha-new。三条元组都在页面上,但会话 B 的快照认为 XID 101 正在进行中,忽略了它所做的任何修改。可见性判断是实时进行的,每次触碰元组时都会发生。

这是 MVCC 反直觉的部分:磁盘上的字节不会因为询问者的不同而改变。 改变的是读取它们时规划器应用的可视性判决。

快照

pg_current_snapshot() 是查看你的会话实际持有什么的最清晰方式。

SELECT pg_current_snapshot();
 pg_current_snapshot
---------------------
 101:103:101
(1 row)

这是 xmin:xmax:xip_list,这就是整个快照:

  • xmin:可能仍在进行中的最低 XID。低于此值的所有事务都已解决(已提交或已中止)。你可以信任它的 t_xmin/t_xmax 标记而无需进一步检查。
  • xmax:第一个尚未分配的 XID。等于此值或高于此值的任何值都不存在 yet。带有此值标记的元组必须被忽略。
  • xip_list:xmin 和 xmax 之间仍在运行的 XID。这些是"进行中"的事务,它们的写入对你不可见。

PostgreSQL 逐条元组地应用这个测试。如果你的快照认为 t_xmin 已中止或仍在进行中,这条元组对你来说不存在,PostgreSQL 会跳过它。如果 t_xmin 已提交,则由 t_xmax 决定:0 表示元组存活,已提交的 t_xmax 表示有人删除了它你看不到,进行中或已中止的 t_xmax 表示删除尚未到达你的快照。

相同的页面。相同的字节。不同的会话有不同的快照,所以对同一条元组会得出不同的结果。

交互式 MVCC 可视化器

针对同一个堆页面驱动两个并发会话。观察 xmin 和 xmax 标记的变化,在 READ COMMITTED 和 REPEATABLE READ 之间切换,逐条元组地追踪可见性规则,并在死版本堆积时运行 VACUUM。

打开可视化器

READ COMMITTED 与 REPEATABLE READ 的区别

PostgreSQL 两个最常用的隔离级别之间的差异归结为一个问题:快照何时捕获?

READ COMMITTED(默认)在每个语句开始时捕获一个的快照。如果另一个会话在你的第一个和第二个 SELECT 之间提交,你的第二个 SELECT 会看到变化。世界在你的事务下逐语句前进。

REPEATABLE READ 在事务开始时捕获一个快照,并在每个后续语句中重用它。从你的事务角度来看,世界是冻结的。其他会话可以提交上千次更改;你的查询持续返回在 BEGIN 时可见的内容。

页面上的字节在两种情况下完全相同。唯一的区别是你的事务携带哪个快照。

-- session A, READ COMMITTED (default)
BEGIN;
SELECT val FROM mvcc_demo WHERE id = 1;  -- 'alpha'

-- session B, in another terminal:
UPDATE mvcc_demo SET val = 'alpha-new' WHERE id = 1;
-- (auto-commits)

-- back in session A:
SELECT val FROM mvcc_demo WHERE id = 1;  -- 'alpha-new' new statement, new snapshot
COMMIT;

用 REPEATABLE READ 重复:

-- session A, REPEATABLE READ
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT val FROM mvcc_demo WHERE id = 1;  -- 'alpha-new'

-- session B:
UPDATE mvcc_demo SET val = 'alpha-newer' WHERE id = 1;
-- (auto-commits)

-- Back in session A:
SELECT val FROM mvcc_demo WHERE id = 1;  -- still 'alpha-new'  same snapshot as BEGIN
COMMIT;

可视化器直接展示这一点:每个会话上都有一个隔离级别选择器。在 REPEATABLE READ 下,快照在 BEGIN 时捕获并持久化。在 READ COMMITTED 下,每次运行 SELECT 时都会刷新。观察每条元组上的可见性标记如何相应地翻转。

每次 UPDATE 都会留下死元组

PostgreSQL 中的每次 UPDATE 都会创建一个新的元组版本。旧版本不会消失。它被标记上 t_xmax 并留在页面上占用空间,直到 VACUUM 清理它。

在有大量更新的繁忙表上,死元组的堆积速度可能超过 VACUUM 清理的速度。这就是"膨胀",它是团队认为 Postgres 需要调优的最常见原因。MVCC 契约("永不阻塞,始终提供一致的视图")是用磁盘空间支付的。

可以看到死元组的堆积情况用 pgstattuple

CREATE EXTENSION IF NOT EXISTS pgstattuple;

-- After lots of updates
SELECT table_len, tuple_count, dead_tuple_count, dead_tuple_percent
FROM pgstattuple('mvcc_demo');
 table_len | tuple_count | dead_tuple_count | dead_tuple_percent
-----------+-------------+------------------+--------------------
      8192 |           2 |                3 |              42.15
(1 row)

三条死元组,两条活元组,42% 的页面空间被浪费。这 42% 会一直浪费下去,直到 VACUUM 运行,或者直到下一个触碰这个页面的查询注意到死空间并触发页面级清理。

xmin 地平线

VACUUM 只能在没有运行中的事务可能仍需要看到它时回收死元组。如果会话 B 五分钟前启动了一个 REPEATABLE READ 事务并一直空闲,它的快照仍然认为 id=1 的更新前版本是活的。VACUUM 无法触碰它而不破坏那个会话。

所以 VACUUM 找到系统中最旧的活动事务,并拒绝清理任何比它更新的东西。一个长时间运行的 REPEATABLE READ 事务(比如,一个需要一小时的分析查询)实际上锁定在这段时间内产生的每个元组版本。表会持续膨胀。autovacuum 运行,发现没有允许它清理的东西,然后退出。

长时间运行的事务问题不是 MVCC 的 bug。它是 MVCC 按设计工作的结果。"读者永不阻塞"的代价是读者可以阻塞清理。如果你曾经在有问题的生产数据库上检查过 pg_stat_activity 并发现一个 14 小时前的 idle in transaction,你就知道这是怎么回事。

可视化器清楚地展示这一点:在会话 B 中启动一个 REPEATABLE READ 事务,让会话 A 运行大量 UPDATE 并 COMMIT,然后运行 VACUUM。回收计数不会包括会话 B 仍能看到的元组版本。

提示位:为什么 SELECT 会弄脏页面

第一次触碰有新写入的页面的 SELECT 可能导致页面被写回磁盘。不是因为 SELECT 修改了任何数据,而是因为它设置了提示位

当 PostgreSQL 遇到带有 t_xmin = 101 的元组并需要知道 101 是否已提交时,它不会凭空知道。它必须在 pg_xact(以前叫 pg_clog)中查找 101,即 commit log。一旦找到答案,它就将该答案缓存在元组的 t_infomask 位中(HEAP_XMIN_COMMITTEDHEAP_XMIN_INVALID)。未来的读者完全跳过 pg_xact 查找。

设置这些位是一次写操作。页面变脏了。最终被刷新。你无辜的 SELECT 最终触发了 I/O。

这就是为什么在冷表上运行 EXPLAIN (ANALYZE, BUFFERS) 有时会在计划只包含读取的情况下显示 dirtied 缓冲区。这也是为什么"批量加载后的第一次查询"模式有那个神秘的慢运行:你要为在数千个新写入的页面上设置提示位支付一次性成本。参见《理解 EXPLAIN Buffers》 了解更多关于这些计数器如何显示的信息。

一段话总结 MVCC 契约

每条元组携带 t_xmint_xmax。每个事务携带一个 (xmin, xmax, xip_list) 的快照。可见性是一个两阶段查找,比较两者。UPDATE 和 DELETE 不就地修改字节。它们在旧版本上标记 t_xmax 并追加新版本。VACUUM 清理死版本,但只能清理没有活动事务可能仍需要的那些。长时间运行的事务阻塞 VACUUM。每个 SELECT 第一次看到新数据时都可能弄脏一个页面,因为它在提示位中缓存提交状态。

每条元组 8 字节的 XID,加上每个事务一个三数快照,加上一个可见性函数。这就是整个机制,但后果蔓延到 PostgreSQL 运维的每个角落,从膨胀监控到复制到 autovacuum 调优。

关于完整的字节级tour(提示位编码、可见性图、冷冻、XID 回绕),存储系列详细涵盖这些。如果你从未观察过 MVCC 的发生,可视化器是建立直观理解最快的方式。让两个会话相互对抗,切换隔离级别,然后再回到这篇文章。

昨天以前首页

PostgreSQL User Management: Create Users and Grant Privileges

When you run a PostgreSQL database in production, you rarely want every application and every developer to connect as the postgres superuser. A clean setup gives each service its own login role with a scoped set of privileges, so a bug or a leaked password cannot touch unrelated data.

PostgreSQL handles this with roles. A role can represent a single user, a group, or both at the same time, and privileges are granted to roles rather than to raw login accounts. This guide explains how to create roles, set passwords, grant and revoke privileges, and clean up roles you no longer need.

Roles vs Users in PostgreSQL

Historically, PostgreSQL had separate CREATE USER and CREATE GROUP statements. Modern versions replaced both with a single concept: the role. A role with the LOGIN attribute can connect to the server, which makes it a user. A role without LOGIN is typically used as a group that other roles inherit privileges from.

In practice, CREATE USER is still valid and is treated as a shortcut for CREATE ROLE ... LOGIN. We will use both forms in this guide.

Connecting to PostgreSQL

All the commands below run inside the psql shell. On most systems you can open it as the postgres system user:

Terminal
sudo -u postgres psql

You will see a prompt like this:

output
postgres=#

Every SQL statement ends with a semicolon. If you forget it, psql keeps waiting for more input.

Creating a Role

The simplest form of CREATE ROLE takes just a name:

sql
CREATE ROLE linuxize;

This role exists but cannot log in yet and has no password.

To create a login role with a password, use LOGIN:

sql
CREATE ROLE linuxize_login WITH LOGIN PASSWORD 'strong_password_here';

The equivalent shortcut is:

sql
CREATE USER linuxize_user WITH PASSWORD 'strong_password_here';

Both statements produce the same result. Use whichever form reads more clearly in your scripts.

Warning
Do not commit passwords to version control. Keep them in a .env file, a secrets manager, or a provisioning tool, and make sure the file is listed in .gitignore.

Useful Role Attributes

You can combine several attributes in a single CREATE ROLE statement. These are the ones you will reach for most often:

  • LOGIN - The role can connect to the server.
  • PASSWORD 'secret' - Sets the login password.
  • SUPERUSER - Grants full access, equivalent to the postgres role. Use sparingly.
  • CREATEDB - The role may create new databases.
  • CREATEROLE - The role may create and modify other roles.
  • INHERIT - The role automatically inherits privileges of roles it is a member of. This is the default.
  • VALID UNTIL 'timestamp' - Expires the password at the given time.
  • CONNECTION LIMIT n - Caps the number of concurrent connections for this role.

For example, to create a login role that can also create databases and is limited to ten concurrent connections:

sql
CREATE ROLE app_owner WITH LOGIN PASSWORD 'strong_password_here' CREATEDB CONNECTION LIMIT 10;

Listing Existing Roles

To see every role on the server, use the \du meta-command:

sql
\du
output
 List of roles
Role name | Attributes
-----------+------------------------------------------------------------
app_owner | Create DB, 10 connections
linuxize |
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS

The Attributes column tells you what each role can do. A blank column means the role has no special attributes beyond the defaults, such as NOLOGIN and INHERIT.

Changing a Role

Use ALTER ROLE to change attributes, rename a role, or reset a password. To change the password:

sql
ALTER ROLE linuxize WITH PASSWORD 'new_password_here';

To grant an additional attribute:

sql
ALTER ROLE linuxize CREATEDB;

To remove one, prefix it with NO:

sql
ALTER ROLE linuxize NOCREATEDB;

Creating a Database for the Role

It is common to give each application its own database owned by its login role. Create the database and assign ownership in one statement:

sql
CREATE DATABASE linuxize_app OWNER linuxize;

The owner of a database has full control over it, so the role can create tables, schemas, and other objects without any further grants.

Granting Privileges

Privileges in PostgreSQL are granted at several levels: database, schema, table, column, sequence, and function. The syntax is consistent across levels:

sql
GRANT privilege_list ON object_type object_name TO role_name;

Database-level Privileges

To let a role connect to a database and create objects in it:

sql
GRANT CONNECT ON DATABASE linuxize_app TO linuxize;
GRANT CREATE ON DATABASE linuxize_app TO linuxize;

CONNECT controls whether the role can open a session to the database. CREATE controls whether it can create schemas.

Schema-level Privileges

To let a role create and use objects inside a schema:

sql
GRANT USAGE ON SCHEMA public TO linuxize;
GRANT CREATE ON SCHEMA public TO linuxize;

USAGE is required for almost every operation inside the schema. CREATE lets the role add new tables, views, or functions.

Table-level Privileges

Table privileges match the common SQL operations:

sql
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE orders TO linuxize;

To grant every available table privilege at once:

sql
GRANT ALL PRIVILEGES ON TABLE orders TO linuxize;

To grant the same privileges on every existing table in a schema:

sql
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO linuxize;

This only affects tables that exist at the moment you run the command. Tables created later are not covered.

Default Privileges for Future Objects

To cover tables created in the future, set default privileges:

sql
ALTER DEFAULT PRIVILEGES IN SCHEMA public
 GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO linuxize;

From this point on, any new table created in public by the current role inherits the listed privileges for linuxize.

Revoking Privileges

REVOKE is the mirror of GRANT and uses the same structure:

sql
REVOKE INSERT, UPDATE, DELETE ON TABLE orders FROM linuxize;

To strip every privilege on a table:

sql
REVOKE ALL PRIVILEGES ON TABLE orders FROM linuxize;

Revoking a privilege only affects the privileges you previously granted. Ownership is a separate concept: the owner of an object always keeps full control over it, regardless of grants.

Group Roles

To manage privileges for a team, create a role without LOGIN and add members to it:

sql
CREATE ROLE readonly;
GRANT CONNECT ON DATABASE linuxize_app TO readonly;
GRANT USAGE ON SCHEMA public TO readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly;

GRANT readonly TO linuxize;

linuxize now inherits every privilege granted to readonly. To remove the membership later:

sql
REVOKE readonly FROM linuxize;

This pattern scales better than granting privileges role by role, because you change permissions in one place.

Deleting a Role

To drop a role, use DROP ROLE:

sql
DROP ROLE linuxize;

PostgreSQL refuses the command if the role still owns any objects or holds any privileges. To clean these up first, reassign the objects and drop dependent privileges in each database where the role owns objects:

sql
REASSIGN OWNED BY linuxize TO postgres;
DROP OWNED BY linuxize;
DROP ROLE linuxize;

REASSIGN OWNED transfers ownership of objects owned by the role in the current database, and DROP OWNED removes any remaining privileges there. If the role owns objects in other databases, repeat the cleanup in each one before dropping the role.

Quick Reference

Task Statement
Create a login role CREATE ROLE name WITH LOGIN PASSWORD 'pass';
Create a user (shortcut) CREATE USER name WITH PASSWORD 'pass';
List roles \du
Change a password ALTER ROLE name WITH PASSWORD 'new';
Add attribute ALTER ROLE name CREATEDB;
Remove attribute ALTER ROLE name NOCREATEDB;
Create database with owner CREATE DATABASE db OWNER name;
Grant table privileges GRANT SELECT, INSERT ON TABLE t TO name;
Grant all table privileges in schema GRANT ... ON ALL TABLES IN SCHEMA public TO name;
Default privileges for new tables ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ... TO name;
Revoke privileges REVOKE ALL PRIVILEGES ON TABLE t FROM name;
Add role to group GRANT group_role TO name;
Drop role and its objects REASSIGN OWNED BY name TO postgres; DROP OWNED BY name; DROP ROLE name;

Troubleshooting

ERROR: permission denied for schema public
Grant USAGE on the schema: GRANT USAGE ON SCHEMA public TO role_name;. On PostgreSQL 15 and later, the public schema is no longer writable by default, so you may also need GRANT CREATE ON SCHEMA public.

ERROR: role "name" cannot be dropped because some objects depend on it
Reassign and drop the role’s objects first: REASSIGN OWNED BY name TO postgres; DROP OWNED BY name;, then run DROP ROLE name; again.

FATAL: password authentication failed for user
Check that the role has LOGIN and that the authentication method in pg_hba.conf matches the client (for example, md5 or scram-sha-256). Reload the configuration with SELECT pg_reload_conf(); after editing pg_hba.conf.

New tables are not visible to a read-only user
Grants on ALL TABLES IN SCHEMA only cover tables that exist at grant time. Set ALTER DEFAULT PRIVILEGES for the owning role so new tables inherit the read permissions automatically.

FAQ

What is the difference between CREATE USER and CREATE ROLE?
CREATE USER is a shortcut for CREATE ROLE ... LOGIN. Both create the same kind of object. Use CREATE USER when you want a login account and CREATE ROLE when you are creating a group role without login.

Can a single role be both a user and a group?
Yes. A role with LOGIN can still be granted to other roles. Members inherit its privileges as long as INHERIT is set, which is the default.

How do I change the owner of an existing database?
Use ALTER DATABASE db_name OWNER TO new_owner;. The new owner gains full control over the database and its objects.

Do I need to restart PostgreSQL after creating a role?
No. Role changes take effect immediately. Only changes to pg_hba.conf or the main server configuration require a reload or restart.

Conclusion

Roles are the single unit of access control in PostgreSQL. Give each application its own login role, group shared privileges into group roles, and use ALTER DEFAULT PRIVILEGES so future objects stay consistent with your current policy.

For related reading, see how to check the PostgreSQL version and the installation guide for your distribution under the PostgreSQL tag .

❌
❌