Using Oracle LONG and RAW Data Types to Store Unstructured Data in your Database Roger Schrag, Database Specialists, Inc. Abstract The LONG, RAW, and LONG RAW data types allow you to store unstructured data inside your Oracle database—data such as graphics, sounds, EDI files, and bitmaps. There are pros and cons to storing such information in an Oracle database instead of in ordinary files on an external file system, and this presentation will list factors to consider when making such a decision. We'll look at selecting the right data type to hold your unstructured data, and at the limitations imposed by the different data types. Since data stored in LONG, RAW, and LONG RAW columns can't always be manipulated with traditional SQL methods, we'll look at techniques you can use to load, update, and search your unstructured data using SQL, PL/SQL, and OCI. Case studies and actual code samples will be used throughout the presentation to illustrate concepts and techniques. The presentation is primarily geared to Oracle 7, however significant new features in Oracle 8 for managing unstructured data are also discussed. Introduction Most data that you find in a relational database has a well defined structure. An employee table, for example, may store the first name, last name, hire date, and department for each employee. This is considered structured data because each element of data is constrained to a specific data type and must conform to business rules that can be enforced in the database. Perhaps the first and last names must be character data not to exceed fifteen bytes each, the hire date must be a valid date after February of 1990 when the company started doing business, and the department must be a valid department listed in the departments table. But suppose you want to also store a digitized image of each employee. You could add a column to the employee table to hold this data, but the database would not constrain the data stored there to a specific type or enforce any rules on it. The database would simply store such information and retrieve it on demand. This is an example of unstructured data being stored in the database. There is a wide variety of interesting unstructured data that could be stored in a database. A web site that lets you search for songs by title and then listen to the songs could store sound samples in the database along with the song titles. An EDI translator could load EDI files from business partners directly into a database for storage and future processing. A data analysis tool could store complex structures that model the data in a database for analysis on demand. (Oracle’s own ConText Option uses this approach to store information about text documents in a way that allows quick text searches.) Oracle Data Types Oracle 7 offers three data types that are especially useful for storing unstructured data. These are LONG, RAW, and LONG RAW, and they represent some of the least understood of the Oracle 7 data types. * A LONG column in a table can hold character data of virtually any size. (The limit is actually just under two gigabytes.) The LONG data type is similar to VARCHAR2, but no maximum length is specified in the declaration and the 2000 byte limit of VARCHAR2 does not apply. * A RAW column in a table can hold binary data up to 255 bytes in length. As with the VARCHAR2 data type, you must declare the maximum length up front. * A LONG RAW column in a table can hold binary data up to just under two gigabytes in size. Like the LONG, no maximum length is specified in the declaration. The LONG data type has a lot in common with LONG RAW (as VARCHAR2 does with RAW) except that LONG is used to store character data and LONG RAW is used to store binary data. Since all data is ultimately stored in binary anyhow, you might wonder what the real difference is between character and binary data. One crucial difference is character set translation. If an application running in an environment that uses one character set issues a query to an Oracle database that uses a different character set, then SQL*Net will translate the character data from the character set of the database to the character set of the application as the data is retrieved. Binary data is delivered as-is with no translation, but character data always undergoes translation when the client and server character sets are not identical. Character set translation may be more significant than it appears. Even if all of your computers use ASCII, there are different implementations of ASCII. Most Unix servers use a seven bit ASCII, while web applications use an ISO code page and Macintosh has an eight bit ASCII character set all its own. In general, you should never store binary data in a VARCHAR2 or LONG data type, because it could get confused in a heterogenous computing environment. Another important difference between character and binary data is that each offers a different set of built-in operations that may be performed. Character data can be capitalized and lower cased, for example, while binary data (starting in Oracle 7.3, anyway) can be bitwise ANDed and ORed. Limitations Imposed on Some Data Types Because LONG and LONG RAW data types are unconstrained when stored in the database, there are limitations on what Oracle can do with them. Traditional character manipulation functions like upper() and initcap() cannot be used with LONG data. Likewise, binary manipulation functions like rawtohex() and utl_raw.bit_and() cannot be used with LONG RAW data. LONG and LONG RAW data also cannot be used as criteria in a WHERE clause, and certain SQL functionality like INSERT INTO...SELECT is not supported when LONG or LONG RAW data types are involved. Further, a table may have at most one column with the LONG or LONG RAW data type, and Oracle’s Replication Option will not replicate LONG or LONG RAW data.. Prior to Oracle 7.3, the OCI does not support piecewise variable binding for LONG and LONG RAW data. This means that if an application wants to insert a large image into the database, for example, it must first store the entire image in one contiguous block of memory before issuing the INSERT statement to Oracle. (Beginning in Oracle 7.3, an application can submit the data in separate pieces that will be concatenated when inserted into the database.) Additional Data Types in Oracle 8 Oracle 8 offers a whole new set of data types that will prove useful for storing unstructured data in the database: * CLOB: large character data from a single-byte character set * NCLOB: large character data, possibly multi-byte * BLOB: large binary data * BFILE: large binary data, with the data itself stored entirely outside the database CLOBs and NCLOBs are similar to LONGs in Oracle 7 (and BLOBs are similar to LONG RAWs), but there are some key differences: The new data types, collectively called LOBs, are stored much more efficiently than LONGs and LONG RAWs. Also a new PL/SQL package called DBMS_LOB makes LOBs much easier to work with in PL/SQL and new OCI calls make LOBs easier for external applications to work with. Oracle 8 supports piecewise random access of LOBs, allowing applications to access just the part of a LOB they are interested in instead of having to fetch an entire LOB. The BFILE data type allows you to store just a pointer to the data in the database, with the data itself stored in a file outside the database. In this case the database will not be able to ensure the integrity or availability of the binary data. Should You Store Unstructured Data in Oracle? Most applications have a need for at least some unstructured data. The question of where to store this data often comes up during application design. You could use some of the data types described in the previous section to store the unstructured data entirely in the database, or you could store the unstructured data in operating system files outside the database and store just the names of the files in the database. Suppose you are building a search application for a web site where the user can get a list of web pages that contain a certain word or phrase. You could store the actual documents to be searched inside the database in a table created as follows: CREATE TABLE web_pages ( page_id NUMBER PRIMARY KEY, url VARCHAR2(2000) NOT NULL, title VARCHAR2(2000), content LONG NOT NULL ); Another example would be an application where you could query a song by its title and retrieve the audio for the song in one of two different formats. This time, the unstructured data will be stored outside the database, with a file name stored in the database so that the sound files for a given song can be found quickly: CREATE TABLE songs ( song_id NUMBER PRIMARY KEY, song_title VARCHAR2(2000) NOT NULL, real_audio_file_name VARCHAR2(255), au_format_file_name VARCHAR2(255) ); There are several advantages and disadvantages to consider when deciding if you should store unstructured data inside an Oracle database. We’ll explore some of them in the following two sections. Advantages Storing unstructured data in an Oracle database gives you the security of knowing the data is permanent and won’t be lost. Oracle’s robust redo log facility assures you that the data will be recoverable should a hardware or software failure occur; the unstructured data is permanent just like the structured data. Unstructured data stored in a database falls under the database’s transaction control, ensuring read consistency and data integrity. This means that the unstructured data will never suddenly change in the midst of a transaction. For example, suppose you store dozens of graphical images in the database for use on web pages. If you need to update the graphics to give your site a new look, you can perform the updates all in one transaction and be assured that no end user will see a page that has some of the old graphics and some of the new ones because their page was served while you were updating the images. Finally, storing unstructured data in the database makes it easier for you to access and manipulate the data with SQL and PL/SQL. You can write queries that retrieve unstructured data along with the structured data, for example. Prior to Oracle 7.3, PL/SQL did not offer a straightforward way to access operating system files outside the database, so this sort of query was not possible without writing an external daemon in a language such as C. Even with Oracle 7.3, file access is still limited to sequential access of character data only; random access and binary files are not supported. Thus it is much easier to manipulate unstructured data with SQL and PL/SQL if the data is stored in the database. Disadvantages The LONG, RAW, and LONG RAW data types have many limitations as we saw earlier, so data stored in these data types can be hard to work with. Another issue is chaining of rows in a table. In Oracle 7, data is stored in logical units called blocks. Block size is chosen when a database is created, and is typically 2K, 4K, 8K, or 16K. When you store a row of data in a table and the entire row won’t fit inside one block, then part of the row is stored in one block and a pointer links the block to another block where more of the row is stored. This is called a chained row. A 60K graphic image stored in a database with an 8K block size could actually be spread across eight different blocks, as an example. These blocks could reside on different physical areas of the disk, causing retrieval to be very slow as the disk head needs to seek to eight different locations. In this case, it would be much faster to retrieve the image if it were stored in an operating system file that was stored contiguously in one spot on the disk. Another concern is interleaving of structured and unstructured data. In Oracle 7, unstructured data is stored in the same blocks as the structured data. Each row of an employee table that includes employee photographs, for example, might require 50 bytes to store structured data (such as employee name and department) and 2000 bytes to store the image. To retrieve the names of 100 employees would require accessing 205K of data because the employee names are sandwiched in between the images. If the images were stored elsewhere, this query would only require accessing 5K of data. Also, Oracle’s robust transaction control and read consistency model comes at a cost. Each Oracle block has an area set aside for transaction control, and this means that you can’t fit 8K of data in an 8K Oracle block. So unstructured data takes up more storage space when stored in Oracle instead of operating system files. Disks are cheap these days, but also think about the extra memory you’d want to devote to the buffer cache, and the extra I/Os that will be necessary to access the data. Oracle’s robust recoverability comes at a high cost, too. When data is inserted into a table in Oracle, the data is written to a rollback segment and the redo log in addition to being written into the table itself. This causes a lot of extra disk I/O when loading data into an Oracle database—up to three times as much as simply writing the data to operating system files. Oracle’s backup methods can also be extremely inefficient when large volumes of unstructured data that changes now and then is stored in the database. Complete database recoveries can only be made from backups taken at the tablespace level, meaning that you can’t backup only the individual data that has changed since the last backup. You can mark an entire tablespace as read-only and then you don’t need to back it up more than once, but as soon as you want to update one record of data in that tablespace, you will need to back up the entire tablespace again. This means that if you store large volumes of unstructured data in your database and some of this data changes from time to time, then backing up this data will be much more time consuming than if the data had been stored in operating system files outside the database. This is because most operating systems allow you to backup only files that have changed since the last backup, while Oracle has the “all or nothing” approach at the tablespace level. Yet another concern about storing unstructured data in the database is that it can lead to database instability. As we discussed earlier, long pieces of unstructured data can lead to many chained rows in the database. It’s an unfortunate fact of life that the Oracle database can become unstable in situations where huge numbers of chained rows exist. Oracle has definitely fixed a lot of bugs in this area over the years, but my clients have consistently run into stability problems when storing large volumes of frequently changing large unstructured data in Oracle. As recently as Oracle release 7.3.2.1.0 running on Sun Solaris 2.5.1, I’ve had to help clients rebuild tables and recreate data lost to ORA-00600 errors and interesting corruptions such as an 1186 byte value appearing in a VARCHAR2(1) column. Manipulating Unstructured Data in Oracle First, a schema design tip: If you choose to store unstructured data in Oracle using the LONG or LONG RAW data types, store the unstructured data in a separate table used just for that purpose. The table should hold only the unstructured data itself and a foreign key to link the data to structured data. This foreign key should also be the primary key for the table. Use Oracle’s declared integrity constraints in order to ensure integrity between the structured and unstructure data. This strategy can improve performance when accessing the structured data (by reducing chaining and interleaving with unstructured data) and will confine any potential database instability to the unstructured data only. Sample employee tables might be created as follows: CREATE TABLE employees ( employee_id NUMBER(9) PRIMARY KEY, first_name VARCHAR2(15) NOT NULL, last_name VARCHAR2(15) NOT NULL, hire_date DATE NOT NULL, dept_id NUMBER(9) NOT NULL ); CREATE TABLE employee_photos ( employee_id NUMBER(9) PRIMARY KEY, photo_image LONG RAW NOT NULL, CONSTRAINT employee_photos_employee FOREIGN KEY (employee_id) REFERENCES employees ON DELETE CASCADE ); Following are some tips you can use when manipulating unstructured data. These methods will help you get around the limitations Oracle imposes on the RAW, LONG, and LONG RAW data types. Using SQL and PL/SQL VARCHAR2 and RAW variables in PL/SQL blocks may be declared with lengths up to 32,767 bytes. These variables may be manipulated within PL/SQL statements with all of the regular built-in character and binary functions, even though they can have lengths longer than the usual 2000 or 255 bytes respectively. This can allow you to process longer data, as long as it is still shorter than 32K. Figure 1 shows a simple PL/SQL function that can search for text in LONGs up to 32,767 bytes in length. CREATE TABLE html_page_texts ( page_id NUMBER PRIMARY KEY, html_text LONG ); CREATE FUNCTION my_instr ( search_page IN NUMBER, search_text IN VARCHAR2, start_pos IN NUMBER DEFAULT 1 ) RETURN NUMBER AS data VARCHAR2(32767); BEGIN SELECT html_text INTO data FROM html_page_texts WHERE page_id = search_page; RETURN INSTR (data, search_text, start_pos); END my_instr; Figure 1: A PL/SQL function that can search for text in a LONG column, provided the LONG data never exceeds 32,767 bytes in length. The my_instr() function show in Figure 1 is simplistic and limited in usefulness, but offers an example of the possibilities. Of course, this technique only works when your LONG or LONG RAW data won’t exceed 32K in length. Starting in Oracle release 7.2, there is a way to fetch LONG data of arbitrary length in PL/SQL by fetching and processing the data in parts that are each of a managable size. Figure 2 shows an enhanced version of the my_instr() function, now supporting LONG data of virtually any length. CREATE FUNCTION my_instr2 ( search_page IN NUMBER, search_text IN VARCHAR2, start_pos IN NUMBER DEFAULT 1 ) RETURN NUMBER AS c NUMBER := dbms_sql.open_cursor; i NUMBER; pos NUMBER := start_pos; len NUMBER; data VARCHAR2(32767); BEGIN dbms_sql.parse (c, 'SELECT html_text FROM html_page_texts WHERE page_id = :p', dbms_sql.native); dbms_sql.bind_variable (c, 'p', search_page); dbms_sql.define_column_long (c, 1); i := dbms_sql.execute_and_fetch (c); IF i = 1 THEN LOOP dbms_sql.column_value_long (c, 1, 32767, pos, data, len); i := INSTR (data, search_text); IF i > 0 THEN dbms_sql.close_cursor (c); RETURN i + pos - 1; END IF; EXIT WHEN len < 32767; pos := pos + 32767 - LENGTH (search_text); END LOOP; END IF; dbms_sql.close_cursor (c); RETURN NULL; END my_instr2; Figure 2: A PL/SQL function that can search for text in LONG data of virtually any length. Prior to Oracle 7.3, there is not a whole lot you can do to manipulate binary data. You can use the hextoraw() and rawtohex() functions to convert binary data into hexadecimal character strings, but that’s about it. Starting in Oracle 7.3, though, you get the UTL_RAW built-in package which allows you to bitwise AND and OR, concatenate, pattern match, and complement binary data among other things. (UTL_RAW might not be covered in your documentation and it may not be installed by the Oracle Installer, but look for the file utlraw.sql in your rdbms/admin directory; the package specification contains enough comments to get you going.) Remember that beginning in Oracle release 7.1 you can embed PL/SQL functions in ordinary SQL statements. This allows you to use the UTL_RAW functions in ordinary SQL. Here’s a simple example: CREATE TABLE application_users ( user_id NUMBER PRIMARY KEY, user_name VARCHAR2(30), session_state RAW(10) ); SELECT user_id, user_name FROM application_users WHERE UTL_RAW.COMPARE (session_state, hextoraw ('FF0088A3B5')) IN (4, 5); The ability to embed PL/SQL functions into straight SQL statements can be combined with the ability to better manipulate LONG and LONG RAW data in PL/SQL to allow accessing LONG data in the WHERE clause of a SQL statement. Using the my_instr() function shown in Figure 1, here is a SQL statement that searches LONG data in the WHERE clause: SELECT page_id FROM html_page_texts WHERE my_instr (page_id, 'Database Specialists, Inc. ') > 0; User beware: Referencing a PL/SQL function in a SQL statement is not as efficient as referencing a built-in function, and it’s still not possible to index LONG or LONG RAW data. So while a lot of things are possible, they may not always be very efficient! Using SQL*Plus SQL*Plus has an environment setting called LONG that dictates how many bytes of a LONG column should be retrieved from the database in a query. On most platforms, SQL*Plus defaults LONG to 80, meaning that when you query a LONG column from the database you will only see the first 80 bytes of the LONG value. If you want to see more of the data in your LONG columns, use the SET LONG command to increase this setting. But this can lead to another problem: SQL*Plus fetches data from the database several rows at a time in order to reduce network overhead. Before submitting a query, SQL*Plus allocates a buffer in memory to hold the data that will come back. If you set LONG too high, or if you try to fetch lots of columns from a table, SQL*Plus may give you a “buffer overflow” error. This means that SQL*Plus didn’t allocate enough memory to store several rows at once during a fetch—all of the rows in one set could take up too many bytes. The solution to this problem is to use the SET ARRAYSIZE command to reduce the number of rows SQL*Plus tries to fetch at once, or use the SET MAXDATA command to allocate more memory for SQL*Plus to use during the fetch. Figure 3 shows an example of using SET LONG and SET ARRAYSIZE to allow you to better access LONG data in SQL*Plus. SQL> SELECT * 2 FROM sys.dba_views 3 WHERE owner = 'SYS' 4 AND view_name = 'USER_DB_LINKS'; OWNER VIEW_NAME TEXT_LENGTH ------------------------------ ------------------------------ ----------- TEXT ------------------------------------------------------------------------------- SYS USER_DB_LINKS 107 select l.name, l.userid, l.password, l.host, l.ctime from sys.link$ l where l.ow SQL> SET LONG 5000 SQL> LIST 1 SELECT * 2 FROM sys.dba_views 3 WHERE owner = 'SYS' 4* AND view_name = 'USER_DB_LINKS' SQL> / buffer overflow. Use SET command to reduce ARRAYSIZE or increase MAXDATA. SQL> SET ARRAYSIZE 1 SQL> LIST 1 SELECT * 2 FROM sys.dba_views 3 WHERE owner = 'SYS' 4* AND view_name = 'USER_DB_LINKS' SQL> / OWNER VIEW_NAME TEXT_LENGTH ------------------------------ ------------------------------ ----------- TEXT ------------------------------------------------------------------------------- SYS USER_DB_LINKS 107 select l.name, l.userid, l.password, l.host, l.ctime from sys.link$ l where l.owner# = userenv('SCHEMAID') SQL> Figure 3: Using SET LONG and SET ARRAYSIZE to allow you to better access LONG data in SQL*Plus. Unfortunately, SQL*Plus is very limited in how it can access binary data. There is no support whatsoever for the LONG RAW data type. RAW data, meanwhile, must be queried or manipulated as hexadecimal strings. For example: CREATE TABLE members ( member_id NUMBER PRIMARY KEY, login_name VARCHAR2(10) NOT NULL, password RAW(10) NOT NULL ); INSERT INTO members (member_id, login_name, password) VALUES (1001, 'rschrag', HEXTORAW ('8E9F24A6EE001743C45D')); Oracle Call Interface The OCI allows applications external to the database to retrieve LONG data from the database piecewise. This means that the application does not need to allocate a large contiguous buffer in memory to receive an entire LONG value all in one shot. It also allows applications random access into the data. This can greatly reduce network traffic because the appliction can selectively retrieve just the data it desires, instead of having to retrieve the entire LONG value. Starting in Oracle 7.3, the OCI also supports piecewise access for binding of variable values. This means that in the VALUES clause of an INSERT statement or the SET clause of an UPDATE statement, applications can provide the data to Oracle in parts. For example, suppose an application will read an EDI file from the operating system and save the contents into a LONG column of a database table. Prior to Oracle 7.3, the application would need to read the entire file into a contiguous buffer in memory before submitting the INSERT statement to the database. The new functionality allows the same application to instead read the file in managable pieces and submit each piece separately to Oracle, reducing demands for memory. When working with character data in OCI or Pro*C applications, watch out for character set translation. As we discussed earlier, it can come into play in places you may not expect. Every Oracle database has a character set which is chosen at the time the database is created. You can see your database’s character set with the following query: SELECT value FROM SYS.nls_database_parameters WHERE parameter = 'NLS_CHARACTERSET'; Also, every application that creates a session with Oracle specifies the character set of the client environment. (This is usually set with shell variables or the Windows registry and varies by client platform.) If the client and the database do not share the same character set, then Oracle will perform character set translation—and this could lead to unwanted manipulation of your data. Here’s a strange, but true example: A company built a comprehensive database of commercially available musical recordings. They initially used an Oracle V6 database running on the Macintosh platform so that they could take advantage of the Mac’s more complete character set with regard to bullets, special quotes, and European alphabetic characters. They eventually migrated to an Oracle 7.2 database running on Unix, but they retained the Mac character set as the database character set. Since the client applications ran on Macs, both the client and database used the same character set and the application ran well. However, the company backed up its databases by running Oracle’s export utility on the Unix machine. On most Unix platforms, the default character set is seven bit ASCII, and this company’s server was no exception. Since the database used the Mac character set and the client appliction (export in this case) used seven bit ASCII, Oracle “translated” the special Mac characters into seven bit ASCII while writing the export file. Since seven bit ASCII does not offer equivalents for many of the special Mac characters, question marks were substituated in their place. The export file now did not represent the data because many quotes and apostrophes and other special characters in the data had been replaced by question marks. The problem was ultimately resolved by setting Unix environment variables to trick Oracle into thinking the export was being taken in a Mac character set environment and thus disabling character set translation. Granted, this was a pretty exotic situation, but it demonstrates the possible issues character set translation can create. Conclusion Oracle provides a variety of data types to allow storage of unstructured data in a relational database. Storing your unstructured data in Oracle gives you the benefits of recoverability, transaction control, and access from SQL and PL/SQL. But these benefits come at some cost—less efficient physical storage, slower loading due to redo log and rollback segment I/O, and slower backups to name a few. Ultimately the pros and cons must be considered on an application by application basis in order to determine whether or not to store unstructured data in the database instead of operating system files outside the database. As a high level generalization, though, it seems that if you will have large volumes of sometimes changing unstructured data you probably should not store it in an Oracle 7 database. (If you’re using Oracle 8 you might want to try it.) At the other end of the spectrum, if you have small pieces of unstructured data that change frequently and must be kept synchronized with structured data and transactions, then it would be beneficial to store the unstructured data in either an Oracle 7 or Oracle 8 database. Clearly, the trend is moving toward storing more in your database and less in operating system files. Back in Oracle V6, LONGs and LONG RAWs were limited to 64K in size. Oracle 7.0 raised the bar to two gigabytes. Oracle 7.2 brought piecewise fetching of arbitrary length LONG data, and Oracle 7.3 brought PL/SQL functions for manipulating binary data. Oracle 8 took support for unstructured data to a new level with the new LOB family of data types, the DBMS_LOB built-in package, and more efficient physical storage. The steady progress Oracle is making in the area of support for unstructured data means more exciting applications for Oracle databases in the future. This is great news for all those committed to Oracle technology. The author is a DBA who has been working with Oracle database technology since Oracle V5.1 in the late 1980s. Although identifying characteristics have been changed to protect client confidentiality, the anecdotes and code samples presented here are all based on real life situations encountered by Oracle users. If you have any questions or comments, you may contact the author by email at rschrag@dbspecialists.com.