René Nyffenegger's collection of things on the web
René Nyffenegger on Oracle - Most wanted - Feedback
 

August 20, 2006: On splitting a string into words with regular expressions

With Oracle's regular expressions, it's relatively easy to split a string into it's words. Easy, that is, compared with a solution that doesn't use regexp_substr. It still requires some amount of setup as the following steps show...
First, I need a nested table type. This nested table type (t_vc2_50) is basically an array of varchar2(50) and will contain the words from the splitted string.
create or replace type t_vc2_50 as table of varchar2(50);
/
Then, I need a function that actually splits the string into its words:
create or replace function split_vc2(str in varchar2) return t_vc2_50 as
  ret  t_vc2_50 := t_vc2_50();
  i    number := 1;
  part varchar2(50);

begin

  loop
    
    part := regexp_substr(str, '([^[:blank:]]+)', 1,  i);
  
    exit when part is null;
  
    ret.extend;
    ret(i) := part;
    i := i+1;
  
  end loop;
  
  return ret;

end split_vc2;
/
I want a table that stores the words when I execute the function:
create table splitted_words (
  id   number primary key,
  word varchar2(50)
);
This function basically searches for continuous non white space characters. [:blank:] finds white space characters, the preceeding caret (^) negates it (which then searches for non white space characters), then, the + searches for one ore more (continuous) non white space characters.
The loop variable i, used as fourth parameter to regexp_substr, tells to give the ith matched substring.
The loop is exited when a returned substring is null, that is, when the last possible substring has been found.
Each found substring is stored at position i in the collection variable ret. At the end of the function, ret is returned.
Finally, I execute the function. The table() creates a virtual table (if I may use this word) that, in this case here, consists of one column whose type is varchar2(50). The number of rows in this virtual table is given by the number of elements in ret.
Since the column in the virtual table has no name, I have to use Oracle's predefined name column_value, which refers to that virtual column.
insert into splitted_words 
select rownum, column_value 
  from table(split_vc2(
     'Hello world, here''s    something   4 u  '
  ));
Let's check what I have inserted into splitted_words:
select * from splitted_words;
Indeed, the words are splitted on white space characters:
        ID WORD
---------- ---------------
         1 Hello
         2 world,
         3 here's
         4 something
         5 4
         6 u
It might be argued that the comma in world, should be stripped out as well, but that's left as an exercise for the reader :-).

Links

See also Poor man's text index where I created a function to split text into words without using Oracle's regular expressions (that was at the time of 8i and 9i).

More on Oracle

This is an on Oracle article. The most current articles of this series can be found here.