Search notes:

SQL: PERCENTILE_DISC

This is an attempt at trying to explain the percentile_disc() SQL analytic function.
percentile_disc(p) within group (over order by val), with 0<= p <= 1, returns a value r such that p*100 percent of the values of val are less than r, within a given group (thus the clause within group).
I am trying to demonstrate that with the following SQL snippets.

Test data

First, we need to create a table …
create table tq84_p (
   item      varchar(10),
   val_one   integer    ,
   val_two   integer   
);
Github repository about-SQL, path: /select/analytic/percentile/disc/create-table.sql
… and fill it with some simple test data:
insert into tq84_p values ('foo',  0,       1);
insert into tq84_p values ('foo',  1,       7);
insert into tq84_p values ('foo',  2,       3);
insert into tq84_p values ('foo',  3,       3);
insert into tq84_p values ('foo',  4,       4);
insert into tq84_p values ('foo',  5,      10);
insert into tq84_p values ('foo',  6,      13);
insert into tq84_p values ('foo',  7,      11);
insert into tq84_p values ('foo',  8,      14);
insert into tq84_p values ('foo',  9,      13);
insert into tq84_p values ('foo', 10, 9999999);
Github repository about-SQL, path: /select/analytic/percentile/disc/insert-1.sql

Calculate some percentiles

I now let SQL calculate the 20th, 40th and 70th percentile of the data.
select
   percentile_disc(0.2) within group (order by val_one) percentile_20th,
   percentile_disc(0.4) within group (order by val_one) percentile_40th,
   percentile_disc(0.7) within group (order by val_one) percentile_70th
from
   tq84_p;
Github repository about-SQL, path: /select/analytic/percentile/disc/select-some-percentiles.sql
The select statement returns 2, 4 and 7.
2 is the 20th percentile of the data because 20 % of data data are less than 2 (namely 0 and 1). The same goes for the other two numbers as well.

Implementation on different databases

I was able to run this Query on Oracle and PostgreSQL.
SQL Server 2017 returns the error message Msg 10753 The function 'percentile_disc' must have an OVER clause.
SQLite 3.27.2 returned near "(": syntax error
MySQL was unable to execute the query, too.

50th percentile / median vs average

The 50th percentile is also called the median. The number of records with a value less than the median is (± 1) the same as the number of records with a value greater then the median.
The following query compares the median with the average or mean (avg()).
select
   percentile_disc(0.5) within group (order by val_one) median_val_one,
   avg            (                            val_one) mean_val_one  ,
   percentile_disc(0.5) within group (order by val_two) median_val_two,
   avg            (                            val_two) mean_val_two
from
   tq84_p;
Github repository about-SQL, path: /select/analytic/percentile/disc/compare-median-avg.sql
The query returns 5 (= the median aka 50th percentile of val_one), 5 (the average of val_one), 10 (the median of val_two) and 909098 (the mean of val_two).
This result demonstrates that the median is much more robust against outliers. The high value of 9999999 drags the average into a region that does not look like the original data at all while the median still is in the region of most data of the original data.

Combining with group by

percentile_disc (as also any other analytic function) can also be combined with group by.
I Insert a few more records for another group (item = 'bar'):
insert into tq84_p values ('bar', 1, 3);
insert into tq84_p values ('bar', 3, 4);
insert into tq84_p values ('bar', 8, 8);
Github repository about-SQL, path: /select/analytic/percentile/disc/insert-item-bar.sql
select
   item,
   percentile_disc(0.5) within group (order by val_one) median_val_one
from
   tq84_p
group by
   item;
Github repository about-SQL, path: /select/analytic/percentile/disc/select-group-by.sql
The query now returns the median for both groups (bar: 3, baz: 5).

Combining with over clause

percentile_disc can also be combined with the over() clause so that the query returns both, the original data and the calculated percentile.
select
   item,
   val_two,
   percentile_disc(0.5) within group (order by val_two) over (partition by item) median_val_two
from
   tq84_p;
Github repository about-SQL, path: /select/analytic/percentile/disc/select-over.sql
Now, that the select statement has the over() clause, it is also possible to run in in SQL Server 2017.

See also

PERCENTILE_CONT interpolates the returned value.
Relationship between the analytic functions ntile and percentile_disc
Analytic functions
percentile

Index