Using Oracle LONG and RAW Data Types to Store Unstructured Data in your Database Roger Schrag, Database Specialists, Inc. Abstract The LONG, RAW, and LONG RAW data types allow you to store unstructured data inside your Oracle database—data such as graphics, sounds, EDI files, and bitmaps. There are pros and cons to storing such information in an Oracle database instead of in ordinary files on an external file system, and this presentation will list factors to consider when making such a decision. We'll look at selecting the right data type to hold your unstructured data, and at the limitations imposed by the different data types. Since data stored in LONG, RAW, and LONG RAW columns can't always be manipulated with traditional SQL methods, we'll look at techniques you can use to load, update, and search your unstructured data using SQL, PL/SQL, and OCI. Case studies and actual code samples will be used throughout the presentation to illustrate concepts and techniques. The presentation is primarily geared to Oracle 7, however significant new features in Oracle 8 for managing unstructured data are also discussed. Introduction Most data that you find in a relational database has a well defined structure. An employee table, for example, may store the first name, last name, hire date, and department for each employee. This is considered structured data because each element of data is constrained to a specific data type and must conform to business rules that can be enforced in the database. Perhaps the first and last names must be character data not to exceed fifteen bytes each, the hire date must be a valid date after February of 1990 when the company started doing business, and the department must be a valid department listed in the departments table. But suppose you want to also store a digitized image of each employee. You could add a column to the employee table to hold this data, but the database would not constrain the data stored there to a specific type or enforce any rules on it. The database would simply store such information and retrieve it on demand. This is an example of unstructured data being stored in the database. There is a wide variety of interesting unstructured data that could be stored in a database. A web site that lets you search for songs by title and then listen to the songs could store sound samples in the database along with the song titles. An EDI translator could load EDI files from business partners directly into a database for storage and future processing. A data analysis tool could store complex structures that model the data in a database for analysis on demand. (Oracle’s own ConText Option uses this approach to store information about text documents in a way that allows quick text searches.) Oracle Data Types Oracle 7 offers three data types that are especially useful for storing unstructured data. These are LONG, RAW, and LONG RAW, and they represent some of the least understood of the Oracle 7 data types. * A LONG column in a table can hold character data of virtually any size. (The limit is actually just under two gigabytes.) The LONG data type is similar to VARCHAR2, but no maximum length is specified in the declaration and the 2000 byte limit of VARCHAR2 does not apply. * A RAW column in a table can hold binary data up to 255 bytes in length. As with the VARCHAR2 data type, you must declare the maximum length up front. * A LONG RAW column in a table can hold binary data up to just under two gigabytes in size. Like the LONG, no maximum length is specified in the declaration. The LONG data type has a lot in common with LONG RAW (as VARCHAR2 does with RAW) except that LONG is used to store character data and LONG RAW is used to store binary data. Since all data is ultimately stored in binary anyhow, you might wonder what the real difference is between character and binary data. One crucial difference is character set translation. If an application running in an environment that uses one character set issues a query to an Oracle database that uses a different character set, then SQL*Net will translate the character data from the character set of the database to the character set of the application as the data is retrieved. Binary data is delivered as-is with no translation, but character data always undergoes translation when the client and server character sets are not identical. Character set translation may be more significant than it appears. Even if all of your computers use ASCII, there are different implementations of ASCII. Most Unix servers use a seven bit ASCII, while web applications use an ISO code page and Macintosh has an eight bit ASCII character set all its own. In general, you should never store binary data in a VARCHAR2 or LONG data type, because it could get confused in a heterogenous computing environment. Another important difference between character and binary data is that each offers a different set of built-in operations that may be performed. Character data can be capitalized and lower cased, for example, while binary data (starting in Oracle 7.3, anyway) can be bitwise ANDed and ORed. Limitations Imposed on Some Data Types Because LONG and LONG RAW data types are unconstrained when stored in the database, there are limitations on what Oracle can do with them. Traditional character manipulation functions like upper() and initcap() cannot be used with LONG data. Likewise, binary manipulation functions like rawtohex() and utl_raw.bit_and() cannot be used with LONG RAW data. LONG and LONG RAW data also cannot be used as criteria in a WHERE clause, and certain SQL functionality like INSERT INTO...SELECT is not supported when LONG or LONG RAW data types are involved. Further, a table may have at most one column with the LONG or LONG RAW data type, and Oracle’s Replication Option will not replicate LONG or LONG RAW data.. Prior to Oracle 7.3, the OCI does not support piecewise variable binding for LONG and LONG RAW data. This means that if an application wants to insert a large image into the database, for example, it must first store the entire image in one contiguous block of memory before issuing the INSERT statement to Oracle. (Beginning in Oracle 7.3, an application can submit the data in separate pieces that will be concatenated when inserted into the database.) Additional Data Types in Oracle 8 Oracle 8 offers a whole new set of data types that will prove useful for storing unstructured data in the database: * CLOB: large character data from a single-byte character set * NCLOB: large character data, possibly multi-byte * BLOB: large binary data * BFILE: large binary data, with the data itself stored entirely outside the database CLOBs and NCLOBs are similar to LONGs in Oracle 7 (and BLOBs are similar to LONG RAWs), but there are some key differences: The new data types, collectively called LOBs, are stored much more efficiently than LONGs and LONG RAWs. Also a new PL/SQL package called DBMS_LOB makes LOBs much easier to work with in PL/SQL and new OCI calls make LOBs easier for external applications to work with. Oracle 8 supports piecewise random access of LOBs, allowing applications to access just the part of a LOB they are interested in instead of having to fetch an entire LOB. The BFILE data type allows you to store just a pointer to the data in the database, with the data itself stored in a file outside the database. In this case the database will not be able to ensure the integrity or availability of the binary data. Should You Store Unstructured Data in Oracle? Most applications have a need for at least some unstructured data. The question of where to store this data often comes up during application design. You could use some of the data types described in the previous section to store the unstructured data entirely in the database, or you could store the unstructured data in operating system files outside the database and store just the names of the files in the database. Suppose you are building a search application for a web site where the user can get a list of web pages that contain a certain word or phrase. You could store the actual documents to be searched inside the database in a table created as follows: CREATE TABLE web_pages ( page_id NUMBER PRIMARY KEY, url VARCHAR2(2000) NOT NULL, title VARCHAR2(2000), content LONG NOT NULL ); Another example would be an application where you could query a song by its title and retrieve the audio for the song in one of two different formats. This time, the unstructured data will be stored outside the database, with a file name stored in the database so that the sound files for a given song can be found quickly: CREATE TABLE songs ( song_id NUMBER PRIMARY KEY, song_title VARCHAR2(2000) NOT NULL, real_audio_file_name VARCHAR2(255), au_format_file_name VARCHAR2(255) ); There are several advantages and disadvantages to consider when deciding if you should store unstructured data inside an Oracle database. We’ll explore some of them in the following two sections. Advantages Storing unstructured data in an Oracle database gives you the security of knowing the data is permanent and won’t be lost. Oracle’s robust redo log facility assures you that the data will be recoverable should a hardware or software failure occur; the unstructured data is permanent just like the structured data. Unstructured data stored in a database falls under the database’s transaction control, ensuring read consistency and data integrity. This means that the unstructured data will never suddenly change in the midst of a transaction. For example, suppose you store dozens of graphical images in the database for use on web pages. If you need to update the graphics to give your site a new look, you can perform the updates all in one transaction and be assured that no end user will see a page that has some of the old graphics and some of the new ones because their page was served while you were updating the images. Finally, storing unstructured data in the database makes it easier for you to access and manipulate the data with SQL and PL/SQL. You can write queries that retrieve unstructured data along with the structured data, for example. Prior to Oracle 7.3, PL/SQL did not offer a straightforward way to access operating system files outside the database, so this sort of query was not possible without writing an external daemon in a language such as C. Even with Oracle 7.3, file access is still limited to sequential access of character data only; random access and binary files are not supported. Thus it is much easier to manipulate unstructured data with SQL and PL/SQL if the data is stored in the database. Disadvantages The LONG, RAW, and LONG RAW data types have many limitations as we saw earlier, so data stored in these data types can be hard to work with. Another issue is chaining of rows in a table. In Oracle 7, data is stored in logical units called blocks. Block size is chosen when a database is created, and is typically 2K, 4K, 8K, or 16K. When you store a row of data in a table and the entire row won’t fit inside one block, then part of the row is stored in one block and a pointer links the block to another block where more of the row is stored. This is called a chained row. A 60K graphic image stored in a database with an 8K block size could actually be spread across eight different blocks, as an example. These blocks could reside on different physical areas of the disk, causing retrieval to be very slow as the disk head needs to seek to eight different locations. In this case, it would be much faster to retrieve the image if it were stored in an operating system file that was stored contiguously in one spot on the disk. Another concern is interleaving of structured and unstructured data. In Oracle 7, unstructured data is stored in the same blocks as the structured data. Each row of an employee table that includes employee photographs, for example, might require 50 bytes to store structured data (such as employee name and department) and 2000 bytes to store the image. To retrieve the names of 100 employees would require accessing 205K of data because the employee names are sandwiched in between the images. If the images were stored elsewhere, this query would only require accessing 5K of data. Also, Oracle’s robust transaction control and read consistency model comes at a cost. Each Oracle block has an area set aside for transaction control, and this means that you can’t fit 8K of data in an 8K Oracle block. So unstructured data takes up more storage space when stored in Oracle instead of operating system files. Disks are cheap these days, but also think about the extra memory you’d want to devote to the buffer cache, and the extra I/Os that will be necessary to access the data. Oracle’s robust recoverability comes at a high cost, too. When data is inserted into a table in Oracle, the data is written to a rollback segment and the redo log in addition to being written into the table itself. This causes a lot of extra disk I/O when loading data into an Oracle database—up to three times as much as simply writing the data to operating system files. Oracle’s backup methods can also be extremely inefficient when large volumes of unstructured data that changes now and then is stored in the database. Complete database recoveries can only be made from backups taken at the tablespace level, meaning that you can’t backup only the individual data that has changed since the last backup. You can mark an entire tablespace as read-only and then you don’t need to back it up more than once, but as soon as you want to update one record of data in that tablespace, you will need to back up the entire tablespace again. This means that if you store large volumes of unstructured data in your database and some of this data changes from time to time, then backing up this data will be much more time consuming than if the data had been stored in operating system files outside the database. This is because most operating systems allow you to backup only files that have changed since the last backup, while Oracle has the “all or nothing” approach at the tablespace level. Yet another concern about storing unstructured data in the database is that it can lead to database instability. As we discussed earlier, long pieces of unstructured data can lead to many chained rows in the database. It’s an unfortunate fact of life that the Oracle database can become unstable in situations where huge numbers of chained rows exist. Oracle has definitely fixed a lot of bugs in this area over the years, but my clients have consistently run into stability problems when storing large volumes of frequently changing large unstructured data in Oracle. As recently as Oracle release 7.3.2.1.0 running on Sun Solaris 2.5.1, I’ve had to help clients rebuild tables and recreate data lost to ORA-00600 errors and interesting corruptions such as an 1186 byte value appearing in a VARCHAR2(1) column. Manipulating Unstructured Data in Oracle First, a schema design tip: If you choose to store unstructured data in Oracle using the LONG or LONG RAW data types, store the unstructured data in a separate table used just for that purpose. The table should hold only the unstructured data itself and a foreign key to link the data to structured data. This foreign key should also be the primary key for the table. Use Oracle’s declared integrity constraints in order to ensure integrity between the structured and unstructure data. This strategy can improve performance when accessing the structured data (by reducing chaining and interleaving with unstructured data) and will confine any potential database instability to the unstructured data only. Sample employee tables might be created as follows: CREATE TABLE employees ( employee_id NUMBER(9) PRIMARY KEY, first_name VARCHAR2(15) NOT NULL, last_name VARCHAR2(15) NOT NULL, hire_date DATE NOT NULL, dept_id NUMBER(9) NOT NULL ); CREATE TABLE employee_photos ( employee_id NUMBER(9) PRIMARY KEY, photo_image LONG RAW NOT NULL, CONSTRAINT employee_photos_employee FOREIGN KEY (employee_id) REFERENCES employees ON DELETE CASCADE ); Following are some tips you can use when manipulating unstructured data. These methods will help you get around the limitations Oracle imposes on the RAW, LONG, and LONG RAW data types. Using SQL and PL/SQL VARCHAR2 and RAW variables in PL/SQL blocks may be declared with lengths up to 32,767 bytes. These variables may be manipulated within PL/SQL statements with all of the regular built-in character and binary functions, even though they can have lengths longer than the usual 2000 or 255 bytes respectively. This can allow you to process longer data, as long as it is still shorter than 32K. Figure 1 shows a simple PL/SQL function that can search for text in LONGs up to 32,767 bytes in length. CREATE TABLE html_page_texts ( page_id NUMBER PRIMARY KEY, html_text LONG ); CREATE FUNCTION my_instr ( search_page IN NUMBER, search_text IN VARCHAR2, start_pos IN NUMBER DEFAULT 1 ) RETURN NUMBER AS data VARCHAR2(32767); BEGIN SELECT html_text INTO data FROM html_page_texts WHERE page_id = search_page; RETURN INSTR (data, search_text, start_pos); END my_instr; Figure 1: A PL/SQL function that can search for text in a LONG column, provided the LONG data never exceeds 32,767 bytes in length. The my_instr() function show in Figure 1 is simplistic and limited in usefulness, but offers an example of the possibilities. Of course, this technique only works when your LONG or LONG RAW data won’t exceed 32K in length. Starting in Oracle release 7.2, there is a way to fetch LONG data of arbitrary length in PL/SQL by fetching and processing the data in parts that are each of a managable size. Figure 2 shows an enhanced version of the my_instr() function, now supporting LONG data of virtually any length. CREATE FUNCTION my_instr2 ( search_page IN NUMBER, search_text IN VARCHAR2, start_pos IN NUMBER DEFAULT 1 ) RETURN NUMBER AS c NUMBER := dbms_sql.open_cursor; i NUMBER; pos NUMBER := start_pos; len NUMBER; data VARCHAR2(32767); BEGIN dbms_sql.parse (c, 'SELECT html_text FROM html_page_texts WHERE page_id = :p', dbms_sql.native); dbms_sql.bind_variable (c, 'p', search_page); dbms_sql.define_column_long (c, 1); i := dbms_sql.execute_and_fetch (c); IF i = 1 THEN LOOP dbms_sql.column_value_long (c, 1, 32767, pos, data, len); i := INSTR (data, search_text); IF i > 0 THEN dbms_sql.close_cursor (c); RETURN i + pos - 1; END IF; EXIT WHEN len < 32767; pos := pos + 32767 - LENGTH (search_text); END LOOP; END IF; dbms_sql.close_cursor (c); RETURN NULL; END my_instr2; Figure 2: A PL/SQL function that can search for text in LONG data of virtually any length. Prior to Oracle 7.3, there is not a whole lot you can do to manipulate binary data. You can use the hextoraw() and rawtohex() functions to convert binary data into hexadecimal character strings, but that’s about it. Starting in Oracle 7.3, though, you get the UTL_RAW built-in package which allows you to bitwise AND and OR, concatenate, pattern match, and complement binary data among other things. (UTL_RAW might not be covered in your documentation and it may not be installed by the Oracle Installer, but look for the file utlraw.sql in your rdbms/admin directory; the package specification contains enough comments to get you going.) Remember that beginning in Oracle release 7.1 you can embed PL/SQL functions in ordinary SQL statements. This allows you to use the UTL_RAW functions in ordinary SQL. Here’s a simple example: CREATE TABLE application_users ( user_id NUMBER PRIMARY KEY, user_name VARCHAR2(30), session_state RAW(10) ); SELECT user_id, user_name FROM application_users WHERE UTL_RAW.COMPARE (session_state, hextoraw ('FF0088A3B5')) IN (4, 5); The ability to embed PL/SQL functions into straight SQL statements can be combined with the ability to better manipulate LONG and LONG RAW data in PL/SQL to allow accessing LONG data in the WHERE clause of a SQL statement. Using the my_instr() function shown in Figure 1, here is a SQL statement that searches LONG data in the WHERE clause: SELECT page_id FROM html_page_texts WHERE my_instr (page_id, '