René Nyffenegger's collection of things on the web | |
René Nyffenegger on Oracle - Most wanted - Feedback
- Follow @renenyffenegger
|
November 4, 2005: On eliminating outliers in avg() queries | ||
Suppose, you have a table that stores some text along with a number:
create table elim_lowest_highest ( txt varchar2(10), num number );
and is filled with the following values:
insert into elim_lowest_highest values ('foo', 1); insert into elim_lowest_highest values ('foo', 10); insert into elim_lowest_highest values ('foo', 11); insert into elim_lowest_highest values ('foo', 12); insert into elim_lowest_highest values ('foo', 90); insert into elim_lowest_highest values ('bar', 12); insert into elim_lowest_highest values ('bar', 28); insert into elim_lowest_highest values ('bar', 29); insert into elim_lowest_highest values ('bar', 30); insert into elim_lowest_highest values ('bar', 31); insert into elim_lowest_highest values ('bar', 32); insert into elim_lowest_highest values ('bar',154);
Now, you want to select the average num grouped by txt from this table. However, you realize that the lowest and highest
values per group (txt) are outliers and you want to eliminate those. In our case, these outliers are 1 and 90 where txt='foo' and
12 and 154 where txt='bar'.
The row_number() analytic function
comes in handy for this task.
select txt, avg(num) from ( select row_number() over (partition by txt order by num ) row_num_asc , row_number() over (partition by txt order by num desc) row_num_desc, txt, num from elim_lowest_highest ) where row_num_desc <> 1 and row_num_asc <> 1 group by txt;
Here's the result:
TXT AVG(NUM) ---------- ---------- bar 30 foo 11
The outliers have effectively been eliminated: 30 is the average of 28, 29, 30, 31 and 32 while 11 is the average of
10, 11 and 12.
Of course, if you suspect that there are more outliers per groups, for example three low outliers and three high outliers, the
where clause would then read
where row_num_desc > 3 and row_num_asc > 3 . Also, a group needs then to have at least
seven rows to show up in the result set.
How does this work
IMHO, the easiest way to understand how this query works is by inspecting what the query's subselect returns:
select row_number() over (partition by txt order by num ) row_num_asc , row_number() over (partition by txt order by num desc) row_num_desc, txt, num from elim_lowest_highest; ROW_NUM_ASC ROW_NUM_DESC TXT NUM ----------- ------------ ---------- ---------- 1 7 bar 12 2 6 bar 28 3 5 bar 29 4 4 bar 30 5 3 bar 31 6 2 bar 32 7 1 bar 154 1 5 foo 1 2 4 foo 10 3 3 foo 11 4 2 foo 12 5 1 foo 90
The result shows that the row_number() function results in an ascending and a descending series of numbers. It is reset for each new
group (through the power of
partition by txt ). The outer select statement will then disregard all rows that have either a
1 in row_num_asc or a 1 in row_num_desc.
More on OracleThis is an on Oracle article. The most current articles of this series can be found here.
|