Chapter 2. XML Databases
An XML document consists of both data and tags. The tags give a context for the data, so that we can figure out whether the data item, “union”, for example, is referring to a pipefitting, a labor organization, a data structure, or a marriage. By itself as a piece of raw data, “union” is meaningless. It becomes information only when it is combined with a context. Information equals data plus context.
The need for data to have a context in order to become information is not unique to XML. Tag structures are not the only way to provide context. We also provide context by putting headers in our tables, by agreeing on a specific organization for the fields in a flat file, by creating name/value pairs, by giving function definitions containing the order and the type of arguments, by providing “Usage” messages when a user’s argument list seems to be outside the recognized context, or by organizing data into table and column structures in a relational database management system (RDBMS).
Most techniques for providing context are static. If, in a four-column table with four appropriate headings, you try to insert a row with five fields, then you have tried to do something the context does not allow. The same thing holds true if you try to insert “-u union” into an argument list when the coding doesn’t recognize such a pattern as part of the recognized context. If you want to store “union” in an RDBMS, there had better be a pigeonhole for it somewhere in the static table/column structure; the static context is provided by the person who created the table/column structure in the first place.
As suggested by its name, “Extensible Markup Language,” XML provides a way for communicating context that is dynamic—you can provide any context you want, and you can do it on-the-fly, in a way that we never could with the static context structures we were used to. When a sending process includes a new “<type>union</type>” element embedded within an otherwise familiar tag structure, it is giving the receiving process a new context for the data item. The context has been extended on-the-fly, i.e., in a dynamic fashion.
We are used to dynamic data when computers talk to one another. The primary feature of XML is that it allows dynamic context also. With XML, instead of merely exchanging dynamic data, we can now exchange dynamic information.
Standard RDBMS were designed to store dynamic data in a single static context. With the emergence of XML, however, a new challenge arises. Now we have to store not only data but also context. Instead of having merely databases we now need information bases. Not only will the data be dynamic; the context will be dynamic also. We will not only need to query, update, or insert data, we will also need to query, update, or insert context. The next generation of databases (I suppose we will still call them “databases.”) will need to be more than databases; they will equally be contextbases or tagbases or informationbases. Since they were invented, databases had always assumed a single stable, static context. The new challenge is to figure out appropriate storage structures when the tag structures are as dynamic as the data.
With the appearance of XML, the software development community did not suddenly have a mass transformation in their “dynamic data and static context” mindsets. In many applications XML explicitly communicates the same fixed context that was previously passed only implicitly. The tag structure of XML is often used to communicate the same part numbers, customer addresses, and delivery dates that we used before XML. Sometimes the data is extracted from the XML document and stored in the same static RDBMS that the data has always been stored in, and the tagset is simply discarded. In some ways, XML communications are far less efficient than previous methods: XML communicates the entire tag structure—and sometimes we do not really need the tag structure, because the context simply does not change all that often. (IT professionals embrace the XML technology even where tags are extraneous, simply because they find it worth the extra expense to have a standard, portable data format.)
A legitimate question arises. While XML might be important when dynamic processes are communicating with each other, when it comes to databases, might we not merely modify the table/column structure on those relatively rare occasions when the context is modified? After all, we have been successfully doing just that for decades. Even with the availability of XML, the context changes come relatively infrequently; usually only the data changes. And if this thinking is correct, we don’t really need special databases for dynamic context; on those relatively rare occasions when the context changes, we simply modify the table/column structure as needed.
The response from XML enthusiasts is that the reason context rarely changes is that the data community is still locked into the “static context” mindset. There is a revolution in thinking that needs to match the technical revolution that XML exhibits, or so it is claimed. Once we genuinely grasp the possibilities that are now available, we will discover new and exciting applications. And we need to be ready with databases which can deal with genuinely dynamic tagsets.
But is such a claim from XML enthusiasts true? Do we need databases which can deal with genuinely dynamic tagsets?
Mr. Ronald Bourret, whose expertise includes providing tools to transform XML tag structures into RDBMS table/column structures, discusses the possibility that a “structure varies enough that mapping it to a relational database results in either a large number of columns with null values (which wastes space) or a large number of tables (which is inefficient).” In a separate private discussion, Mr. Bourret also suggests there might be more fundamental problems than just inefficient use of database space. He suggests, first, the possibility that the structures might vary frequently enough (many times a second) that it is simply too time-consuming to transform the tagset into table/column structures and then create the tables and columns in a database. Second, Mr. Bourret suggests there might be tagsets which are so widely varying that each tagset represents virtually an entirely unique table/column structure. Both of these problems might be exhibited, Mr. Bourret suggests, by an “XML-aware search engine,” where the search engine “understands the rules governing XML documents,” so that users can search explicitly for context as well as data.
Another example where tagsets can vary rapidly and significantly is in scientific experiments which generate new data contexts. Dr. Harry Direen discusses the tagsets which are generated when doing research into DNA structures and their components, where experimenters can predict they will encounter new tagsets with great regularity, but they have no way of predicting what the tagsets might be.
One approach to XML storage has been mentioned already: use an existing RDBMS and its pre-defined table/column structure. A vendor with a proven RDBMS will usually also provide such an “XML-enabled RDBMS” variant. The advantages of this technique are numerous, especially if you anticipate a stable tag structure and already have the RDBMS and the appropriate tables and columns. You get all the triggers, stored procedures, indexing, and querying capability that we have grown comfortable with from our RDBMS experiences. We can update and insert data easily. And efficient searches can be built onto the table/column structure—you can search specific columns rather than an entire document. But there are also some problems. They center around the fact that you have to have a way to map the XML schema to the database schema. Without such a mapping, you don’t know where to store the data from the XML document, and you don’t know how to create the tagset for query replies.
Typically tools are developed to make it easier to map document schema to database schema. You tend to still encounter problems: tables proliferate when the document schemas show significant variance; it’s hard to interpret inefficient document schemas; it’s difficult to determine when you have a major schema change and when you are merely extending a previous structure. It’s particularly difficult to map mixed content (where an element contains both data items and child elements), especially when the child element is embedded within the data. Furthermore, order of child elements can be important for an XML document, but it is not important for relational databases. Using XML-enabled relational databases seems to assume that for the most part we have stable, predictable tag structures. It uses XML as a very portable communication method but de-emphasizes the “extensibility” built into XML.
The other major approaches to storing XML documents are generically called “native XML databases.” They share this in common: (1) they are prepared to store any incoming syntactically correct XML document, whatever its tag structure, without any separate off-line preparation of the database, and (2) in response to appropriate queries, they will reply with a tag structure that is reasonably faithful to the original document structure. The architectures used by such native XML databases are quite varied.
One approach used in a native XML database is to store the XML document in its entirety. Separate columns in a more-or-less traditional database might be used for such metadata as the date of the document, the user’s original filename, perhaps a document ID assigned by the system, or keywords for indexing. Such an approach to storing an XML document is quite efficient when it comes to the speed for the initial storage of the document and for retrieving a whole document or a major part of a document. However, without the table/column structures to limit the initial search object, such operations as searching, joins with data items or tags from more than one document, updates, and inserts all present their own challenges. Various tools and structures have been developed to deal with such challenges, and in many cases using such an approach is preferred over using an XML-enabled RDBMS, especially when there is no stable tag structure.
NeoCore’s approach to providing a native XML database is quite unique. It uses neither of the above two database approaches. There are neither tables nor columns. But the documents are not stored in their entirety, either. Instead, NeoCore provides a storage architecture in which pieces of the original document are separated for storage into NeoCore’s own structures.
In this current document, in Chapters 3 through 6, we explain many of the details of the NeoCore architecture. At this time we provide a “big-picture” architectural overview.
NeoCore provides both UNIX and Windows variants. All database files are managed by the operating system in the same way it manages any other file.
The NeoCore XMS operates in the client-server setup shown in Figure 2-1. On the client side there is, first, a NeoCore provided “Console”—a tool which runs on the user’s browser through which various administrative tasks can be done and through which individual store, query, insert, update, and delete commands can be manually given. Also on the client side is a NeoCore-provided API, available in Java, C++, Visual Basic, and .COM, which allows client-side programs to communicate with the database. The actual communication between client and server uses the HTTP protocol, although that is typically invisible to the user. (Of course, the user doesn’t have to use the NeoCore APIs; the user might prefer to generate the HTTP independently, particularly if there is a browser-based data input mechanism. NeoCore provides in its documentation all the HTTP sequences recognized by the XMS.)
On the server side, the NeoCore XMS listens on two ports—one for communicating with the Console and one for all other transactions. The XMS includes at its core the XML database. Between the two ports and the core database is a system which manages such things as login/logout, user and password control, user groups, transactions, archiving, logging, metadata generation, error-handling, re-sizing, and so on. In our study of the NeoCore database, we are interested only in the core storage—and not the broader management system—because the core storage contains NeoCore’s unique, patented architectural features.
Now let’s look at the architecture within the core database. See Figure 2-2. NeoCore stores tags and attribute names in a “Tag Dictionary”; it stores data items and attribute values in a “Data Dictionary.” The use of the term “Dictionary” is perhaps unfortunate. For one thing, the term “dictionary” is often used as a synonym for “schema”, referring to the entire database schema. NeoCore’s “Dictionary” is something altogether different. Further, NeoCore’s “Dictionary” has nothing resembling “definitions,” and definitions are what one normally expects of a dictionary. Instead, NeoCore’s two dictionaries are simply lists of data items for the Data Dictionary and, for the Tag Dictionary, a list of the corresponding tag paths. There are no duplicates in the two dictionaries: if there are fifty thousand customers in the database whose city is Chicago, then “Chicago” appears only once in the database. The dictionaries are explained more thoroughly in Chapter 3.
A “Map Store” provides the structure of each XML document. For each data item in the XML document, there is an entry in the Map Store. The entry provides a pointer to the data item in the Data Dictionary, a pointer to its tag path in the Tag Dictionary, and a set of codes which allow the system to reconstruct the original document. The Map Store is explained more thoroughly in Chapter 3.
There are six indices which work together to provide rapid searching capability. Three of these, the “Core Indices” are explained thoroughly in Chapter 5; the other three, the “Duplicate Indices,” are treated in Chapter 6. All data items are automatically indexed. All tag elements and all sets of tag elements are automatically indexed. Finally, the combination of each data item with its associated tag path is automatically indexed.
The other architectural feature we will examine in some depth is the “Icon Generator.” We will discuss the Icon Generator itself in Chapter 4 and the use of Icons in Chapter 5. In the NeoCore lexicon, an “Icon” is a large binary number, which is used to represent a data or tag string. An icon for NeoCore is related to a string in much the same way that a sophisticated checksum is related to its associated string. Because of the size of the checksum or the icon, there is very little probability that two strings will generate the same checksum or icon, and in small to medium sized databases, there is little probability that any two strings will generate the same icon. For NeoCore, all indexed searching is done with icons and pointers—not with strings. NeoCore claims that this allows for very rapid searching, even when the indices are full; we will examine the rationale behind this claim fairly thoroughly in Chapter 5, and in our testing (see “Test Five” in Chapter 7) we will determine whether the claim is true in practice.
In some ways the NeoCore XMS should be looked on as a work in progress. For example, at present it runs efficiently as a small-to-medium-sized database, but is not configured to work well with large databases, say, in the 100 gigabyte range. Providing that functionality is scheduled for version 3.0, originally scheduled for release in Summer, 2003, but now delayed. For another example, XQuery can be used as the query language, effective with version 2.6; however, version 2.6 did not support the entire XQuery functionality. Some additional XQuery capability was added with version 2.7, but there is still more XQuery capability to be added. As another example, NeoCore has patented a totally new method for performing substring searches of text data, as mentioned in Chapter 6 of this document. However, the new searching technique has not yet been implemented into the XMS.
There are five major XML database benchmarks today. We chose to use the XOO7 benchmark. We based our decision on the fact that the benchmark was extremely flexible in generating documents of many sizes and in building at least some variation into its basically static schema. The source code for the C++ benchmark generation tool was readily available for download, and although created for a UNIX environment, it was readily modified for our Windows testing environment. The tool for generating the XML benchmark documents worked quickly and accurately once modified for our environment.
One objection to the XOO7 benchmark pertains not to the benchmark itself, but to the published testing that was done with the benchmark. It seems the benchmark developers preferred large XML documents. The smallest configuration files, as provided from the XOO7 website, generated a document of some four and a half megabytes. A decent database should handle such documents without problems, and, admittedly, we could modify the configuration files to produce far smaller or far larger documents. However, that is the size they used for their queries and, consequently, it is also the size we used for our queries, so that we could compare our results with theirs. However, we believe that real-life applications use far more files that are much smaller. If the benchmark is to imitate reality to any significant degree, most of its files should be smaller. We did use the XOO7 benchmark generation tool to generate many tiny files and also some significantly larger files, but in doing the specific tests documented by the XOO7 team, we used the 4.5 megabyte files so that we could compare our findings with theirs.
A second objection is to the benchmark’s use of queries which generated query responses which were huge: one of them was over 16 megabytes. Much the same queries could be done with far smaller returns. The problem is that with large responses, we necessarily measure a fair amount of time for socket and transmission time, and for our own client side processes to put the lengthy response into a buffer. We would have preferred queries which put less emphasis on processing and communicating with the database and more emphasis on the amount of search time the database actually used.
A third objection is more fundamental. None of the five XML database benchmarks really challenges the ability of the database under test to deal with both stores and queries for multiple documents arriving at the database in rapid succession with wildly disparate tag structures. All five of them recognize that not every XML document to be stored in a database will display the same tag structure. Nonetheless, they all provide a static schema for the XML documents. They seem to assume that the databases under test will either require a relatively stable table/column structure or will simply store the XML document intact. As a result, they don’t really test the ability of the databases to deal with the extensibility of XML.
Because of this characteristic of the established benchmarks, there is no standardized test for evaluating one of the alleged strengths of the NeoCore approach: handling a variety of rapidly changing, unpredictable tagsets in such a way that storage, queries, inserts, and updates are all done with both space and time efficiency.
Therefore, as part of our testing, we developed a program which generates multiple documents with extremely diverse tag structures—no two documents produced by this program will have similar tag structures. We believe that XML databases should be able to handle such a set of documents, and that benchmark testing should include such an evaluation. If a particular XML-enabled RDBMS database doesn’t handle such a thing well, then it would be meaningless to actually perform the test, but in that case, the reported results need to at least show that part of the benchmark test could not be completed. If XML databases which store the entire document cannot automatically index and readily do rapid searches of such documents, then the results of benchmark testing should document that finding.
In Chapters 3 through 6 of this document, we explain in greater depth the architecture of the NeoCore database. Chapter 7 explains the testing procedures we followed. Chapter 8 presents the results of the testing. Chapter 9 summarizes the major findings of the project.
 Ronald Bourret, “XML and Databases,” Section 5.3, “Storing Data in a Native XML Database,” at http://www.rpbourret.com/xml/XMLAndDatabases.htm#datainnative.
 Harry Direen, Ph.D., “Chapter 10. Knowledge Management in Bioinformatics” in Akmal B. Chaudhri, Awais Rashid, and Roberto Zicari, XML Data Management: Native XML and XML-Enabled Database Systems, (Addison Wesley, 2003, ISBN: 0-201-84452-4).
 See the website of Mr. Ronald Bourret for references to the five XML database benchmarks: http://www.rpbourret.com/xml/XMLDBLinks.htm#Benchmarks