Friday, January 25, 2008

Regular Expressions in Oracle Database 10g

by Fábio Souza

Hello everybody!

This is my debut post to Java 2 Go! (ironically it's not about Java) and I'm really happy for having been invited to write by Eduardo. I hope you will enjoy as much as I am!

As the title says, this is a short how-to about the use of regular expressions in Oracle Database. This feature was introduced in Oracle 10g and is really helpful, useful and, better off, not too complicated. Let's take a look.

There are three built-in SQL functions and one operator for working with regular expressions in Oracle 10g:
  • OPERATOR:

    • REGEXP_LIKE
      It's used in the WHERE clause with column filtering purpose. The filter is the informed pattern.

  • FUNCTIONS:

    • REGEXP_INSTR
      Returns the initial position where the informed pattern was found.

    • REGEXP_SUBSTR
      Returns the result of the informed pattern application.

    • REGEXP_REPLACE
      Replaces the informed pattern with another given string.

NOTE: pattern is the regular expression which is used to match specific regions of a string.

This example will show only the REGEXP_REPLACE function and the intention is to return only the value of field "Login".

-- Description:
-- First argument: string which the pattern will be applied to (it's possible to use a column instead).
-- Second argument: the pattern.
-- Third argument: new string that will replace the portion of the original string matching the given pattern.

select REGEXP_REPLACE(
   'ID = 13737009 - Login = T123456 - Status nonAdmin - WorkingPlace = HeadQuarter - Sector = B',
   '(.+)(Login = )([^ ]+)(.+)',
   '\3')
from dual;


The pattern used above defines four groups between parenthesis:
  1. "(.+)" matches 1 or more characters

  2. "(Login = )" matches the exact substring "Login = "

  3. "([^ ]+)" matches 1 or more character different from " " (space)

  4. "(.+)" again, matches 1 or more characters

In our example, the result for each group above is:
  1. "ID = 13737009 - "

  2. "Login = "

  3. "T123456"

  4. " - Status nonAdmin - WorkingPlace = HeadQuarter - Sector = B"

For the simple query above, the result of function "REGEXP_REPLACE" will be the replacement of the substring matching the given pattern with "\3". The presence of character "\" tells the function to replace "\3" with the substring matching the third group defined in the given pattern, which is "([^ ]+)". As you may notice, in this case, the given pattern matches the entire string so, at the end, the entire string will be replaced with "T123456".

Despite the beauty of this, there's one thing we should always keep in mind: performance. As we already know, applying functions to table columns in WHERE clauses typically makes Oracle's choice optimizer to ignore any index on that particular column which may result in a table full scan. If we're dealing with big volumes, this might be a huge concern. Thank God Oracle gives us some options like Partitioning and Parallel Queries.

That's all for now. If you'd like to get further information on the subject I recommend reading this article from OTN:

http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/rischert_regexp_pt1.html

Cheers! (and keep reading...)

0 comments: