René Nyffenegger's collection of things on the web
René Nyffenegger on Oracle - Most wanted - Feedback
 

November 4, 2005: On eliminating outliers in avg() queries

Suppose, you have a table that stores some text along with a number:
create table elim_lowest_highest (
  txt varchar2(10),
  num number
);
and is filled with the following values:
insert into elim_lowest_highest values ('foo',  1);
insert into elim_lowest_highest values ('foo', 10);
insert into elim_lowest_highest values ('foo', 11);
insert into elim_lowest_highest values ('foo', 12);
insert into elim_lowest_highest values ('foo', 90);

insert into elim_lowest_highest values ('bar', 12);
insert into elim_lowest_highest values ('bar', 28);
insert into elim_lowest_highest values ('bar', 29);
insert into elim_lowest_highest values ('bar', 30);
insert into elim_lowest_highest values ('bar', 31);
insert into elim_lowest_highest values ('bar', 32);
insert into elim_lowest_highest values ('bar',154);
Now, you want to select the average num grouped by txt from this table. However, you realize that the lowest and highest values per group (txt) are outliers and you want to eliminate those. In our case, these outliers are 1 and 90 where txt='foo' and 12 and 154 where txt='bar'.
The row_number() analytic function comes in handy for this task.
select txt, avg(num) 
  from (
    select row_number() over (partition by txt order by num     ) row_num_asc ,
           row_number() over (partition by txt order by num desc) row_num_desc,
           txt, num
      from elim_lowest_highest
  )
where row_num_desc <> 1 and row_num_asc <> 1
group by txt;
Here's the result:
TXT          AVG(NUM)
---------- ----------
bar                30
foo                11
The outliers have effectively been eliminated: 30 is the average of 28, 29, 30, 31 and 32 while 11 is the average of 10, 11 and 12.
Of course, if you suspect that there are more outliers per groups, for example three low outliers and three high outliers, the where clause would then read where row_num_desc > 3 and row_num_asc > 3. Also, a group needs then to have at least seven rows to show up in the result set.

How does this work

IMHO, the easiest way to understand how this query works is by inspecting what the query's subselect returns:
select row_number() over (partition by txt order by num     ) row_num_asc ,
       row_number() over (partition by txt order by num desc) row_num_desc,
       txt, num
  from elim_lowest_highest;
ROW_NUM_ASC ROW_NUM_DESC TXT               NUM
----------- ------------ ---------- ----------
          1            7 bar                12
          2            6 bar                28
          3            5 bar                29
          4            4 bar                30
          5            3 bar                31
          6            2 bar                32
          7            1 bar               154
          1            5 foo                 1
          2            4 foo                10
          3            3 foo                11
          4            2 foo                12
          5            1 foo                90
The result shows that the row_number() function results in an ascending and a descending series of numbers. It is reset for each new group (through the power of partition by txt). The outer select statement will then disregard all rows that have either a 1 in row_num_asc or a 1 in row_num_desc.

More on Oracle

This is an on Oracle article. The most current articles of this series can be found here.