René Nyffenegger's collection of things on the web
René Nyffenegger on Oracle - Most wanted - Feedback -
 

Poor man's text index in Oracle (optimizing SQL queries with like '%...%')

One of the problems with like is that if a select statement is performed like this: select foo from bar where baz like '%something%', it most likely results in a full table scan, because it can't use an index, which in most cases (but a few special cases) makes the query slow.
Here, I want to present a solution to circumvent this problem.
However, the faster answer time comes at a cost: More space is needed in a tablespace because additional data must be stored.
The idea is the following: for a word, all possible substrings are stored. For example, for the word apple, the substrings a, p, l , e, ap, pp, pl, le, app, ppl, ple, appl, pple and apple will be stored together with the corresponding rowid. Now, if someone is searching for %ppl%, we have ppl already stored, and find the rowid and hence the row containing ppl.
First, we need a collection type. This is sort of an array that can store multiple varchar2's. It will be needed for SplitLine (which uses the technique outlined in returning a table from a pl/sql function).
create or replace type table_of_vc as table of varchar2(100);
/
SplitLine is a function, that returns such an array. This array contains all words that are found within p_w. A word is delimited by the elements of p_delim. In our case, the default of a space will do.
create or replace Function SplitLine(
            p_line       in Varchar2,
            p_delim      in table_of_vc  default table_of_vc(' '),
            p_min_length in Number default 3) 
         return table_of_vc 
is
  /* v_delim_len containts the length of the used delimiter after each split */
  v_delim_len  number;

  v_split_pos  number;
  v_split_from_pos number := 1;
  v_split_str  varchar2(100);
  v_ret        table_of_vc := table_of_vc();

begin

  begin
    select 
      pos , len into v_split_pos, v_delim_len
    from (
      select 
        len,
        pos,
        row_number () over (order by pos) r
      from (
        select 
          length(column_value)               len,
          instr(p_line,column_value,1)       pos
        from 
          table(p_delim)
        ) where pos > 0
    ) where r = 1;

  exception
    when no_data_found then
      v_split_pos := 0;
    when others then
      return v_ret;
  end;


  while v_split_pos > 0 loop 

    v_split_str  := substr(p_line, v_split_from_pos, v_split_pos-v_split_from_pos);
    if length(v_split_str) >= p_min_length then
      v_ret.extend;
      v_ret(v_ret.count) := v_split_str;
    end if;

    v_split_from_pos  := v_split_pos + v_delim_len;

    begin
      select 
        pos , len into v_split_pos, v_delim_len
      from (
        select 
          len,
          pos,
          row_number () over (order by pos)    r
        from (
          select 
            length(column_value)                    len,
            instr(p_line,column_value,v_split_from_pos) pos
          from 
            table(p_delim)
          ) where pos > 0
      ) where r = 1;
    exception
      when no_data_found then
        v_split_pos := 0;
    end;


  end loop;

  v_split_str  := substr(p_line,v_split_from_pos);
  if length (v_split_str) >= p_min_length then
    v_ret.extend;
    v_ret(v_ret.count) := v_split_str;
  end if;

  return v_ret;
end SplitLine;
/
Partials is now the function, that returns all substrings of a word whose length is at least 3 (parameter min_length). The returned substrings are all in uppercase.
create or replace Function Partials(
             p_w        in Varchar2, 
             min_length in Number default 3,
             p_delim    in table_of_vc default table_of_vc(' ',',','(',')','-')) return table_of_vc
is
  v_beg number;
  v_end number;
  v_len number;
  v_ret table_of_vc := table_of_vc();
begin
 
  for r in (select column_value w from Table(cast(SplitLine(p_w,p_delim) as table_of_vc))) loop

    v_beg:=1;
    v_len:=length(r.w);

    while v_beg + min_length <= v_len+1 loop
      v_end := v_beg + min_length - 1;
      while v_end <= v_len loop

        v_ret.extend;
        v_ret(v_ret.count) := upper(substr(r.w,v_beg,v_end-v_beg+1));

        v_end := v_end + 1;
      end loop;
      v_beg := v_beg + 1;
    end loop;


  end loop;

  return v_ret;
end Partials;
/
Here's the table that needs to be indexed:
create table table_with_words (
  words varchar2(50),
  n     number
);
And it's going to be filled with some values:
insert into table_with_words values ('apple banana tomato' , 10);
insert into table_with_words values ('chicago geneva london',20);
insert into table_with_words values ('thomas barbara mickey',30);
insert into table_with_words values ('green blue orange',    40);

commit;
Also, our self made index table is needed:
create table table_with_words_ix(
  r rowid,
  s varchar2(20)
);
And this table must be filled, of course:
insert into table_with_words_ix 
  select 
    rowid,
    column_value
  from
    table_with_words,
    table(partials(words));

commit;
Now, we want to know where we find rang:
select
  n
from
  table_with_words_ix  i
join
  table_with_words     t
  on i.r = t.rowid
where
  i.s = 'RANG';
This works on Oracle 9i. Probably, it will be possible to adapt it for 8i as well.

A note on using rowid

Daniel A. Morgan points out that a possible problem with using rowids is that it assumes that rowids are stable. This assumption is not valid with index organized tables and partitioned tables.

Links

See also On splitting a string into words with regular expressions where a function uses regexp_substr to split a string.