| René Nyffenegger's collection of things on the web | |
|
René Nyffenegger on Oracle - Most wanted - Feedback
|
Poor man's text index in Oracle (optimizing SQL queries with like '%...%') | ||
|
One of the problems with like is that if a select statement is performed like this:
select foo from bar where baz like '%something%', it most likely results in a
full table scan, because it can't use an index, which in most cases (but a few special cases)
makes the query slow.
Here, I want to present a solution to circumvent this problem.
However, the faster answer time comes at a cost: More space is needed in a tablespace because
additional data must be stored.
The idea is the following: for a word, all possible substrings are stored. For example, for the word apple, the substrings
a, p, l , e, ap, pp, pl, le, app, ppl, ple, appl, pple and apple will be stored together with the corresponding
rowid. Now, if someone is searching for %ppl%, we have ppl already stored, and find
the rowid and hence the row containing ppl.
First, we need a collection type. This is sort of an array that can store multiple
varchar2's. It will be needed for SplitLine (which uses the technique outlined in
returning a table from a pl/sql function).
create or replace type table_of_vc as table of varchar2(100); /
SplitLine is a function, that returns such an array. This array contains all words that are found within p_w. A word is
delimited by the elements of p_delim. In our case, the default of a space will do.
create or replace Function SplitLine(
p_line in Varchar2,
p_delim in table_of_vc default table_of_vc(' '),
p_min_length in Number default 3)
return table_of_vc
is
/* v_delim_len containts the length of the used delimiter after each split */
v_delim_len number;
v_split_pos number;
v_split_from_pos number := 1;
v_split_str varchar2(100);
v_ret table_of_vc := table_of_vc();
begin
begin
select
pos , len into v_split_pos, v_delim_len
from (
select
len,
pos,
row_number () over (order by pos) r
from (
select
length(column_value) len,
instr(p_line,column_value,1) pos
from
table(p_delim)
) where pos > 0
) where r = 1;
exception
when no_data_found then
v_split_pos := 0;
when others then
return v_ret;
end;
while v_split_pos > 0 loop
v_split_str := substr(p_line, v_split_from_pos, v_split_pos-v_split_from_pos);
if length(v_split_str) >= p_min_length then
v_ret.extend;
v_ret(v_ret.count) := v_split_str;
end if;
v_split_from_pos := v_split_pos + v_delim_len;
begin
select
pos , len into v_split_pos, v_delim_len
from (
select
len,
pos,
row_number () over (order by pos) r
from (
select
length(column_value) len,
instr(p_line,column_value,v_split_from_pos) pos
from
table(p_delim)
) where pos > 0
) where r = 1;
exception
when no_data_found then
v_split_pos := 0;
end;
end loop;
v_split_str := substr(p_line,v_split_from_pos);
if length (v_split_str) >= p_min_length then
v_ret.extend;
v_ret(v_ret.count) := v_split_str;
end if;
return v_ret;
end SplitLine;
/
Partials is now the function, that returns all substrings of a word whose length is at least 3 (parameter min_length).
The returned substrings are all in uppercase.
create or replace Function Partials(
p_w in Varchar2,
min_length in Number default 3,
p_delim in table_of_vc default table_of_vc(' ',',','(',')','-')) return table_of_vc
is
v_beg number;
v_end number;
v_len number;
v_ret table_of_vc := table_of_vc();
begin
for r in (select column_value w from Table(cast(SplitLine(p_w,p_delim) as table_of_vc))) loop
v_beg:=1;
v_len:=length(r.w);
while v_beg + min_length <= v_len+1 loop
v_end := v_beg + min_length - 1;
while v_end <= v_len loop
v_ret.extend;
v_ret(v_ret.count) := upper(substr(r.w,v_beg,v_end-v_beg+1));
v_end := v_end + 1;
end loop;
v_beg := v_beg + 1;
end loop;
end loop;
return v_ret;
end Partials;
/
Here's the table that needs to be indexed:
create table table_with_words ( words varchar2(50), n number );
And it's going to be filled with some values:
insert into table_with_words values ('apple banana tomato' , 10);
insert into table_with_words values ('chicago geneva london',20);
insert into table_with_words values ('thomas barbara mickey',30);
insert into table_with_words values ('green blue orange', 40);
commit;
Also, our self made index table is needed:
create table table_with_words_ix( r rowid, s varchar2(20) );
And this table must be filled, of course:
insert into table_with_words_ix
select
rowid,
column_value
from
table_with_words,
table(partials(words));
commit;
Now, we want to know where we find rang:
select n from table_with_words_ix i join table_with_words t on i.r = t.rowid where i.s = 'RANG';
This works on Oracle 9i. Probably, it will be possible to adapt it for 8i as well.
A note on using rowid
Daniel A. Morgan points out that a possible problem with using rowids is that it
assumes that rowids are stable. This assumption is not valid with
index organized tables and partitioned tables.
Links
See also On splitting a string into words with regular expressions where a function uses
regexp_substr to split a string.
|