Chapter 1. Introduction
This is the final report of a study conducted in the Department of Computer Science at the University of Colorado at Colorado Springs during the 2002–2003 academic year. The purpose of the study was to understand and evaluate the architecture used to store and later retrieve information contained in Extensible Markup Language (XML) documents by the NeoCore XML Information Management System (XMS). Kenneth H. Wenker, Ph.D., was the principle researcher for this masters degree project; Edward Chow, Ph.D. (http://www.cs.uccs.edu/~chow/), served as Advisor.
The NeoCore XMS is a commercially available product of NeoCore, Inc., (http://www.neocore.com/). At its center is a database which the company claims is the first “self-constructing” database to offer both speed and flexibility. Our study was interested only in this core database. The broader NeoCore XMS also contains a variety of tools and utilities to administer the system and to manage and analyze the information it contains; examining these various tools and utilities in the broader XMS is outside the scope of this study. Some of these tools would be of considerable interest to professionals in the area of data analysis.
The reason for conducting the study is that the NeoCore database uses a unique storage architecture which has not been subjected to published academic scrutiny. It is so unique as to represent a major conceptual alternative. There is a need to bring the details of the architecture to the attention of the academic community. Furthermore, as part of the larger architecture, NeoCore offers an alternative approach to hash tables and a faster way to generate large transforms such as cyclic redundancy code (CRC). Independent of the database architecture, the algorithms used in these hash table and transform generation alternatives need to be examined by the academic community. Our initial academic examination of these two algorithms is included in Chapter 5 and Chapter 4 respectively. (Further, a future release of the XMS is to include what is claimed to be a faster way to search a string for multiple sub-strings of varying sizes. This also should be of great interest for academic examination. This string search technique was not examined as part of this study, since it is not currently part of the NeoCore product. It is explained a bit more in “Searching for Non-Indexed Strings Within Text” in Chapter 6 of this report.)
The details of the NeoCore database architecture are generally unknown outside of NeoCore, even by those whose expertise is in the area of XML databases. A fairly intensive study of the literature failed to find a report of any objective evaluation of XMS performance. Journals aimed at the XML community, when describing the XMS, contain little more than the content of NeoCore press releases—the journals we found contain no detailed description of its architecture, and there is no evidence that those reporting on the product had ever installed it and run it. Even NeoCore white papers (available on its web site and referenced later in this report) fail to explain the core architecture in enough detail to permit academic evaluation; they seem to have potential customers and investors as their audience, rather than the academic community.
Normally companies in NeoCore’s situation would not reveal the architecture of its commercially available product in sufficient detail for the academic community to evaluate it. Companies typically guard such details as proprietary information—there is little to be gained by letting competitors know exactly what you are doing. However, in the case of NeoCore, the situation is significantly different: NeoCore has patented several features of its architecture. The patent documents necessarily reveal many of the architectural details of the NeoCore database, and most of those documents are now published on the United States Patent and Trademark Office (USPTO) website (http://www.uspto.gov/).
One part of this study was to examine the patent documents, which describe significant portions of the NeoCore database architecture. We wanted to look beneath the legal framework of the patent documents to extract, relate, and then explain the technical details of the NeoCore database architecture.
The other part of this study was to install and run the NeoCore XMS and subject it to some benchmark testing. In addition, we ran additional live tests, exploring performance issues not addressed in standard benchmarks; we believe the extra tests are needed to evaluate the performance of the unique architecture.
This project was not NeoCore’s idea; they did not ask for it nor fund it nor participate in any formal way. They usually were extremely supportive when asked for clarifications of difficult material contained in the patent documentation or in their own white papers. However, when asked to explain material not contained in the patent documentation and not included in their white papers on the web site, they typically gently suggested that such material was proprietary. Accordingly, there are some frustrating holes in the descriptions of the architecture contained in Chapter 3 through Chapter 6, below.
NeoCore employees have been aware of this project from the start. Indeed we initiated their involvement, because standard verbiage in the licensing agreement for the evaluation copy of the XMS required that we not divulge testing data without their permission. We wanted them to allow us to publish the results—good or bad—without having them filter the report of our findings. They supported us in that request. Since then, their attitude seems to have been that it is in their best interest that this project be truly objective. On every occasion, we were the ones to initiate any interaction.
For the sake of full disclosure, we want to explain the extent to which NeoCore was involved. NeoCore provided us the following assistance. (1) They advised us, when asked, which of their many patents and pending patents were actually implemented in the NeoCore XMS. Their thinking seemed to be that they did not want us asserting that some feature was implemented within the database when as a matter of fact it was not there. (2) When we asked for clarification of materials contained within the patent documentation, they attempted to help. The help could take the form of an oral explanation. Or if they had an internal paper covering the same material, they would make it available. (3) Attachments to the patent materials are part of the public record, but they are not posted on the USPTO website. When there were references to these attachments within the body of the patent documentation, NeoCore provided us with a copy of those attachments, when we asked for them. (4) When we asked for an overview of architectural features, so that we could put a specific patent into a broader context, they provided an oral overview similar to what they might give to a prospective customer. On one occasion the oral presentation was supported by slides which had been used in an earlier presentation to a prospective customer. (5) When we asked for suggestions about how to configure the system in order to achieve some desired outcome in the live testing, they either told us it could not be done (the logs use only hours/minutes/seconds for time references, and not milliseconds), or told us how to do what we wanted—sometimes gently pointing us to the appropriate reference in the documentation. (6). We asked them to review each chapter of this report for us, to verify that it contained no errors in our explanation of the architectural details. In reply they tried to maintain a hands-off approach so as to support the objectivity of the study. At the same time they pointed out any specific points which were erroneous or misleading—they were as interested in our being accurate as in our being objective.
Our thanks go to Dr. Harry Direen (http://www.direentech.com/), our contact at NeoCore, for his delicate walk between letting us do what we needed to do without interference and at the same time helping us when we needed it, as described in the previous paragraph. Our thanks also goes to Mr. Ronald Bourret (http://www.rpbourret.com/), an independent consultant on XML and databases, who reviewed draft versions of the chapters in this document. We were interested in his providing input from the perspective of the broader XML database community. He was particularly useful identifying material which was unclear, incomplete, or misleading. Mr. Heng Wang saved us many hours with his spontaneous but absolutely perfect tutorial orientation to the MS Visual Studio environment.
Chapter 2 of this document provides an overview of the NeoCore architecture. Chapter 3 through Chapter 6 provide a more detailed examination of the unique, patented architectural features. Chapter 7 explains our procedures for the operational testing. Chapter 8 presents the results of our operational testing. Chapter 9 summarizes our major findings.