René Nyffenegger's collection of things on the web | |
René Nyffenegger on Oracle - Most wanted - Feedback
- Follow @renenyffenegger
|
Poor man's text index in Oracle (optimizing SQL queries with like '%...%') | ||
One of the problems with like is that if a select statement is performed like this:
select foo from bar where baz like '%something%', it most likely results in a
full table scan, because it can't use an index, which in most cases (but a few special cases)
makes the query slow.
Here, I want to present a solution to circumvent this problem.
However, the faster answer time comes at a cost: More space is needed in a tablespace because
additional data must be stored.
The idea is the following: for a word, all possible substrings are stored. For example, for the word apple, the substrings
a, p, l , e, ap, pp, pl, le, app, ppl, ple, appl, pple and apple will be stored together with the corresponding
rowid. Now, if someone is searching for %ppl%, we have ppl already stored, and find
the rowid and hence the row containing ppl.
First, we need a collection type. This is sort of an array that can store multiple
varchar2's. It will be needed for SplitLine (which uses the technique outlined in
returning a table from a pl/sql function).
create or replace type table_of_vc as table of varchar2(100); /
SplitLine is a function, that returns such an array. This array contains all words that are found within p_w. A word is
delimited by the elements of p_delim. In our case, the default of a space will do.
create or replace Function SplitLine( p_line in Varchar2, p_delim in table_of_vc default table_of_vc(' '), p_min_length in Number default 3) return table_of_vc is /* v_delim_len containts the length of the used delimiter after each split */ v_delim_len number; v_split_pos number; v_split_from_pos number := 1; v_split_str varchar2(100); v_ret table_of_vc := table_of_vc(); begin begin select pos , len into v_split_pos, v_delim_len from ( select len, pos, row_number () over (order by pos) r from ( select length(column_value) len, instr(p_line,column_value,1) pos from table(p_delim) ) where pos > 0 ) where r = 1; exception when no_data_found then v_split_pos := 0; when others then return v_ret; end; while v_split_pos > 0 loop v_split_str := substr(p_line, v_split_from_pos, v_split_pos-v_split_from_pos); if length(v_split_str) >= p_min_length then v_ret.extend; v_ret(v_ret.count) := v_split_str; end if; v_split_from_pos := v_split_pos + v_delim_len; begin select pos , len into v_split_pos, v_delim_len from ( select len, pos, row_number () over (order by pos) r from ( select length(column_value) len, instr(p_line,column_value,v_split_from_pos) pos from table(p_delim) ) where pos > 0 ) where r = 1; exception when no_data_found then v_split_pos := 0; end; end loop; v_split_str := substr(p_line,v_split_from_pos); if length (v_split_str) >= p_min_length then v_ret.extend; v_ret(v_ret.count) := v_split_str; end if; return v_ret; end SplitLine; /
Partials is now the function, that returns all substrings of a word whose length is at least 3 (parameter min_length).
The returned substrings are all in uppercase.
create or replace Function Partials( p_w in Varchar2, min_length in Number default 3, p_delim in table_of_vc default table_of_vc(' ',',','(',')','-')) return table_of_vc is v_beg number; v_end number; v_len number; v_ret table_of_vc := table_of_vc(); begin for r in (select column_value w from Table(cast(SplitLine(p_w,p_delim) as table_of_vc))) loop v_beg:=1; v_len:=length(r.w); while v_beg + min_length <= v_len+1 loop v_end := v_beg + min_length - 1; while v_end <= v_len loop v_ret.extend; v_ret(v_ret.count) := upper(substr(r.w,v_beg,v_end-v_beg+1)); v_end := v_end + 1; end loop; v_beg := v_beg + 1; end loop; end loop; return v_ret; end Partials; /
Here's the table that needs to be indexed:
create table table_with_words ( words varchar2(50), n number );
And it's going to be filled with some values:
insert into table_with_words values ('apple banana tomato' , 10); insert into table_with_words values ('chicago geneva london',20); insert into table_with_words values ('thomas barbara mickey',30); insert into table_with_words values ('green blue orange', 40); commit;
Also, our self made index table is needed:
create table table_with_words_ix( r rowid, s varchar2(20) );
And this table must be filled, of course:
insert into table_with_words_ix select rowid, column_value from table_with_words, table(partials(words)); commit;
Now, we want to know where we find rang:
select n from table_with_words_ix i join table_with_words t on i.r = t.rowid where i.s = 'RANG';
This works on Oracle 9i. Probably, it will be possible to adapt it for 8i as well.
A note on using rowid
Daniel A. Morgan points out that a possible problem with using rowids is that it
assumes that rowids are stable. This assumption is not valid with
index organized tables and partitioned tables.
Links
See also On splitting a string into words with regular expressions where a function uses
regexp_substr to split a string.
|