May 11, 2005: On using regexp_replace to format data

I have a table with raw data:

create table table_raw_data (
  txt  varchar2(50)
);

This table is filled with names and first names. Two people also have a middle name (Michelle and George). For each of these people, there is also their office location (for example 22B) and their phone extension (for example +1409 or 1457):

insert into table_raw_data values ('Steven Green    22B +1409');
insert into table_raw_data values ('Edward Allen 18A 1457');
insert into table_raw_data values ('Ruth Michelle Campbell  9D 1466');
insert into table_raw_data values ('Paul Turner +1402 13B');
insert into table_raw_data values ('Kimberly Thompson 9B 1451');
insert into table_raw_data values ('Kevin   George Young 1424 15A');
insert into table_raw_data values ('Charles  Jackson   16B 1428');
insert into table_raw_data values ('Margaret Collins +1461 22A');
insert into table_raw_data values ('Brian Moore 5C   +1472');
insert into table_raw_data values ('Betty Davis 1412   7B');
insert into table_raw_data values ('Paul Mitchell  18A +1421');
insert into table_raw_data values ('Richard   Brown 22C 4155');

Unfortunately, the data is not structured very well. Sometimes, names are delimited by a single space, sometimes, they are delimited by multiple spaces. The extension has a leading + sometimes, but not always. The office comes after or before the extension:

What I want, however, is a table that is structured like this:

create table table_formatted (
  name     varchar2(20),
  fst_name varchar2(20),
  mdl_name varchar2(20),
  office   varchar2( 3),
  phone    varchar2( 4)
);

In order to fill this table, regexp_replace does a good job:

insert into table_formatted
select
  regexp_replace(txt, pat, '\4'),
  regexp_replace(txt, pat, '\1'),
  regexp_replace(txt, pat, '\3'),
  regexp_replace(txt, '.*[[:space:]]([[:digit:]]{1,2}[[:upper:]]).*', '\1'),
  regexp_replace(txt, '.*[[:space:]]\+?([[:digit:]]{4})([[:space:]]|$).*', '\1')
from
  table_raw_data cross join (
    select
    '^([[:alpha:]]+)'               ||   -- find the 1st word (name)
    '[[:space:]]+'                  ||   -- find 1 or more spaces
    '(([[:alpha:]]+)[[:space:]]+)?' ||   -- find an optional middle name
    '([[:alpha:]]+)'                ||   -- find the last name
    '.*' pat
  from
    dual);

Now, I can query the table:

select 
  name,
  fst_name,
  mdl_name,
  office,
  phone
from 
  table_formatted;

Here's the output of the query:

NAME                 FST_NAME             MDL_NAME             OFF PHON
-------------------- -------------------- -------------------- --- ----
Green                Steven                                    22B 1409
Allen                Edward                                    18A 1457
Campbell             Ruth                 Michelle             9D  1466
Turner               Paul                                      13B 1402
Thompson             Kimberly                                  9B  1451
Young                Kevin                George               15A 1424
Jackson              Charles                                   16B 1428
Collins              Margaret                                  22A 1461
Moore                Brian                                     5C  1472
Davis                Betty                                     7B  1412
Mitchell             Paul                                      18A 1421
Brown                Richard                                   22C 4155

More on Oracle

This is an on Oracle article. The most current articles of this series can be found here.


	René Nyffenegger's collection of things on the web
	René Nyffenegger on Oracle - Most wanted - Feedback - Follow @renenyffenegger