PostgreSQL窗口函数中 ROWS 和 RANGE 模式的区别
Contents
示例表数据如下:
test=# select * from tcost ;
path | cost
------+-------
111 | 23.3
111 | 33.4
111 | 3.4
222 | 3.4
222 | 33.4
222 | 333.4
32 | 3.4
32 | 0.4
32 | 0.04
(9 rows)
test=#
ROWS
test=# select path, cost, sum(cost) over (order by cost desc) as sum_cost, sum(cost) over (order by cost desc rows between current row and 2 following ) as row from tcost;
path | cost | sum_cost | row
------+-------+----------+-------
222 | 333.4 | 333.4 | 400.2
111 | 33.4 | 400.2 | 90.1
222 | 33.4 | 400.2 | 60.1
111 | 23.3 | 423.5 | 30.1
111 | 3.4 | 433.7 | 10.2
32 | 3.4 | 433.7 | 7.2
222 | 3.4 | 433.7 | 3.84
32 | 0.4 | 434.1 | 0.44
32 | 0.04 | 434.14 | 0.04
(9 rows)
test=#
可以看到 max2 的值都是由 current row (当前行) and 2 following(紧接着2行)的sum()结果出来的。 即 400.2 = 333.4 + 33.4 + 33.4 得出的。 90.1 = 33.4 + 33.4 + 23.3 得出的。
注意,上面那条SQL没有写 partition by ,那默认情况下就是以整个表来表示窗口化的,即只有一个窗口。现在试着,添加上partition by 语句的结果看看:
test=# select path, cost, sum(cost) over (order by cost desc) as sum_cost, sum(cost) over (partition by path order by cost desc rows between current row and 2 following ) as row from tcost;
path | cost | sum_cost | row
------+-------+----------+-------
32 | 3.4 | 433.7 | 3.84
32 | 0.4 | 434.1 | 0.44
32 | 0.04 | 434.14 | 0.04
111 | 33.4 | 400.2 | 60.1
111 | 23.3 | 423.5 | 26.7
111 | 3.4 | 433.7 | 3.4
222 | 333.4 | 333.4 | 370.2
222 | 33.4 | 400.2 | 36.8
222 | 3.4 | 433.7 | 3.4
(9 rows)
test=#
当 partition by path时,可以看到,这些ROWS 模式,都是在当前所在的窗口来进行的,并不会跨窗口来进行。 所以,在这里强调一下,ROWS表示的是物理行。
RANGE
看看,当是RANGE时的结果
test=# select path, cost, sum(cost) over (order by cost desc) as sum_cost, sum(cost) over (order by cost desc range between current row and UNBOUNDED following ) as range from tcost;
path | cost | sum_cost | range
------+-------+----------+--------
222 | 333.4 | 333.4 | 434.14
111 | 33.4 | 400.2 | 100.74
222 | 33.4 | 400.2 | 100.74
111 | 23.3 | 423.5 | 33.94
111 | 3.4 | 433.7 | 10.64
32 | 3.4 | 433.7 | 10.64
222 | 3.4 | 433.7 | 10.64
32 | 0.4 | 434.1 | 0.44
32 | 0.04 | 434.14 | 0.04
(9 rows)
test=#
可以看到,RANGE时,相同数据的会被合并到一起再来进行计算,也表明,列中具有相同值的range的值也是相同的,并且结果是它们合并后进行计算后的结果。
test=# select path, cost, sum(cost) over (order by cost desc) as sum_cost, sum(cost) over (order by cost desc range between current row and UNBOUNDED following ) as range,sum(cost) over (order by cost desc rows between current row and UNBOUNDED following ) as row from tcost;
path | cost | sum_cost | range | row
------+-------+----------+--------+--------
222 | 333.4 | 333.4 | 434.14 | 434.14
111 | 33.4 | 400.2 | 100.74 | 100.74
222 | 33.4 | 400.2 | 100.74 | 67.34
111 | 23.3 | 423.5 | 33.94 | 33.94
111 | 3.4 | 433.7 | 10.64 | 10.64
32 | 3.4 | 433.7 | 10.64 | 7.24
222 | 3.4 | 433.7 | 10.64 | 3.84
32 | 0.4 | 434.1 | 0.44 | 0.44
32 | 0.04 | 434.14 | 0.04 | 0.04
(9 rows)
test=#
这里可以非常明显看到RANGE和ROWS的区别。 ROWS:是按物理行来进行区分的 RANGE:是按数值进行逻辑区分的
RANGE 和 ROWS 在PostgreSQL中的语法
[ RANGE | ROWS ] frame_start
[ RANGE | ROWS ] BETWEEN frame_start AND frame_end
frame_start 和 frame_end可以是:
UNBOUNDED PRECEDING
value PRECEDING
CURRENT ROW
value FOLLOWING
UNBOUNDED FOLLOWING
特别注意:value PRECEDING和value FOLLOWING 当前只允许ROWS模式。 RANGE模式后面只能接 UNBOUNDED FOLLOWING。
默认的框架选项是RANGE UNBOUNDED PRECEDING,该选项与 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW相同。有ORDER BY, 它设置框架从分区的开始一直到与当前行相同的最后一行。没有ORDER BY, 那么就是当前分区的所有行都包含在框架中,因为所有行都会成为当前行的相同行。
RANGE时,请注意有没有 order by 的区别
test=# select path, cost, sum(cost) over (order by cost desc) as sum_cost, sum(cost) over (range between current row and UNBOUNDED following ) as no_order_by_range,sum(cost) over (order by cost desc range between current row and UNBOUNDED following ) as has_order_by_range from tcost;
path | cost | sum_cost | no_order_by_range | has_order_by_range
------+-------+----------+-------------------+--------------------
222 | 333.4 | 333.4 | 434.14 | 434.14
111 | 33.4 | 400.2 | 434.14 | 100.74
222 | 33.4 | 400.2 | 434.14 | 100.74
111 | 23.3 | 423.5 | 434.14 | 33.94
111 | 3.4 | 433.7 | 434.14 | 10.64
32 | 3.4 | 433.7 | 434.14 | 10.64
222 | 3.4 | 433.7 | 434.14 | 10.64
32 | 0.4 | 434.1 | 434.14 | 0.44
32 | 0.04 | 434.14 | 434.14 | 0.04
(9 rows)
test=#
没有ORDER BY, 那么就是当前分区的所有行都包含在框架中,因为所有行都会成为当前行的相同行。
ROWS时,请注意有没有 order by 的区别
test=# select path, cost, sum(cost) over (order by cost desc) as sum_cost, sum(cost) over (rows between current row and UNBOUNDED following ) as no_order_by_rows,sum(cost) over (order by cost desc rows between current row and UNBOUNDED following ) as has_order_by_rows from tcost;
path | cost | sum_cost | no_order_by_rows | has_order_by_rows
------+-------+----------+------------------+-------------------
222 | 333.4 | 333.4 | 434.14 | 434.14
111 | 33.4 | 400.2 | 100.74 | 100.74
222 | 33.4 | 400.2 | 67.34 | 67.34
111 | 23.3 | 423.5 | 33.94 | 33.94
111 | 3.4 | 433.7 | 10.64 | 10.64
32 | 3.4 | 433.7 | 7.24 | 7.24
222 | 3.4 | 433.7 | 3.84 | 3.84
32 | 0.4 | 434.1 | 0.44 | 0.44
32 | 0.04 | 434.14 | 0.04 | 0.04
(9 rows)
test=#
有没有ORDER BY,都是一样的,因为ROWS是按物理分行的,而不是按逻辑分行的。
总结
ROWS:是按物理行来进行窗口级别里再次进行范围选择的。
RANGE:是按逻辑行来进行窗口级别里再次进行范围选择的。RANGE时,相同行会被合并成同一条数据再进行计算,相同行窗口计算时的结果也是相同的。
是否是相同行,是根据ORDER BY排序时的结果决定的。
有ORDER BY时:同行是说在ORDER BY排序时不唯一的行。【即具有相同数值的行】
不同行是说ORDER BY排序时具有不同的数值的行。
没有ORDER BY:那么就是当前分区的所有行都包含在框架中,因为所有行都会成为当前行的相同行。【特别要注意最后一句的意思】