November 4, 2005: On eliminating outliers in avg() queries

create table elim_lowest_highest (
  txt varchar2(10),
  num number
);

insert into elim_lowest_highest values ('foo',  1);
insert into elim_lowest_highest values ('foo', 10);
insert into elim_lowest_highest values ('foo', 11);
insert into elim_lowest_highest values ('foo', 12);
insert into elim_lowest_highest values ('foo', 90);

insert into elim_lowest_highest values ('bar', 12);
insert into elim_lowest_highest values ('bar', 28);
insert into elim_lowest_highest values ('bar', 29);
insert into elim_lowest_highest values ('bar', 30);
insert into elim_lowest_highest values ('bar', 31);
insert into elim_lowest_highest values ('bar', 32);
insert into elim_lowest_highest values ('bar',154);

Now, you want to select the average num grouped by txt from this table. However, you realize that the lowest and highest values per group (txt) are outliers and you want to eliminate those. In our case, these outliers are 1 and 90 where txt='foo' and 12 and 154 where txt='bar'.

select txt, avg(num) 
  from (
    select row_number() over (partition by txt order by num     ) row_num_asc ,
           row_number() over (partition by txt order by num desc) row_num_desc,
           txt, num
      from elim_lowest_highest
  )
where row_num_desc <> 1 and row_num_asc <> 1
group by txt;

TXT          AVG(NUM)
---------- ----------
bar                30
foo                11

Of course, if you suspect that there are more outliers per groups, for example three low outliers and three high outliers, the where clause would then read where row_num_desc > 3 and row_num_asc > 3. Also, a group needs then to have at least seven rows to show up in the result set.

How does this work

select row_number() over (partition by txt order by num     ) row_num_asc ,
       row_number() over (partition by txt order by num desc) row_num_desc,
       txt, num
  from elim_lowest_highest;

ROW_NUM_ASC ROW_NUM_DESC TXT               NUM
----------- ------------ ---------- ----------
          1            7 bar                12
          2            6 bar                28
          3            5 bar                29
          4            4 bar                30
          5            3 bar                31
          6            2 bar                32
          7            1 bar               154
          1            5 foo                 1
          2            4 foo                10
          3            3 foo                11
          4            2 foo                12
          5            1 foo                90

The result shows that the row_number() function results in an ascending and a descending series of numbers. It is reset for each new group (through the power of partition by txt). The outer select statement will then disregard all rows that have either a 1 in row_num_asc or a 1 in row_num_desc.


	René Nyffenegger's collection of things on the web
	René Nyffenegger on Oracle - Most wanted - Feedback - Follow @renenyffenegger

November 4, 2005: On eliminating outliers in avg() queries

How does this work

More on Oracle