Chapter 8. Findings
Following this section, we will examine in some detail the findings from each test. Before we do that, we need to discuss some issues which cut across all the testing.
1. THE IMPORTANCE OF SUFFICIENT RAM. The performance of the NeoCore XMS is affected to a very significant degree by the amount of available RAM. When there is enough RAM to hold completely the five indexes and plenty of buffer space, the system is very fast. If you restrict the amount of RAM or make the indices too big, so that the operating system has to page slices of index back and forth to disk, the effect is dramatic.
While doing the preliminary extensibility testing on our limited development platform, we noticed the extent to which insufficient RAM severely detracts from the NeoCore performance. The extensibility testing, with few duplicates, requires relatively large core indices, challenging the Windows operating system’s ability to retain them all in RAM.
The database administrator can configure the amount of RAM that each database file will use or, instead, let the operating manage it. When RAM is limited and when past usage is not a good indicator of future needs, the operating system might not manage RAM exactly the way the user might prefer. For example, we know, given the way the core indices are structured, that any part of a core index is equally likely to be needed at the next read, regardless of what portion of it the operating system might have been reading in the past. Windows might foolishly choose to keep only a portion of a core index in RAM, in favor of some less significant process.
Figure 8-1 compares times for storing two similar sets of 15 XML documents. In both cases, we were using “extensible” files, so that there was virtually no use of the duplicate indices; extremely high demands were put on the core indices. We were running this test on our “development” environment where there was limited RAM available. In case “A”, we had a large database, and so we were configured for a fairly large tag core index; in case “B” we were configured for a small database and hence a small tag core index. In case A we were forced to restrict the amount of RAM allocated for that index; in case “B”, we let the operating system decide how much of the core indices were kept in RAM—presumably it kept all three of them in memory most of the time. As shown in Figure 8-1, case “B” store times were typically 100 times faster than case “A” store times, except early in the sequence, when the operating system was still learning that it is best to keep all three core indices in RAM all the time. The entire amount of the difference in processing time is due to the need, in case “A”, to move unneeded slices of the core indices from RAM to disk and then move the needed slices from disk to RAM.
Figure 8-1. Effects of Insufficient RAM on Storage Speeds
Remember that the results shown are not typical of real-world applications: we were using “extensibility” documents which have unique characteristics that would not be encountered in normal usage. Using the “extensibility” documents exaggerated the effect observed. Further, we were using our very limited development platform; in practice, no one would run a significant database with such limited resources.
Because of the significance of available RAM, we now question our initial decision to run both client and server on the same testing platform. Whatever efficiency we gained in avoiding cross-platform communications was probably more than eliminated by the fact that the client processes were competing with the server for the same RAM (and the same CPU cycles).
2. SCALING IS AN ART. Several tests required us to work with near-full databases. This is difficult to calculate beforehand, because the storage requirements for each file are dependent on the characteristics of the input XML. Some XML requires relatively large duplicate indices and some requires small. Furthermore, it is awkward to resize the database files. Rather than resize, we found it easier, in a testing environment, to wipe out the entire database, change the configuration settings, and start from scratch.
Because of the difficulty of predicting how much storage might someday be needed and because of the inconvenience of resizing the database, a prudent database manager will want to initialize the database with far a greater size than is really needed. This is not unique to the NeoCore product; every database manager I have met does the same thing with traditional databases also. The point is that if you are trying to glean some hints about how large a storage capacity one is likely to need with the NeoCore product, be prudent and get more than you think you’ll need.
It should be noted that NeoCore claims that when version 3.0 is released, resizing an existing database will be much easier.
3. NO DIRECT COMPARISON BETWEEN ORIGINAL X007 BENCHMARK TESTING AND TESTING OF THE NEOCORE PRODUCT. The NeoCore architecture is so different from the databases originally tested by the XOO7 Benchmark team, that it is difficult to compare results. The XOO7 researchers, given the nature of the databases available to them, were less interested in the actual database storage and more interested in what their systems had to go through to prepare a document for storage. They simply did not report on database footprint and the speed of storing a document in the database. Instead, they reported how long it took to prepare an XML document for storage—usually by creating a file which enabled converting the XML into something that could be stored in an XML-enabled RDBMS and then reconstructing it back into XML later. One of the databases stored the incoming XML document intact and then had to deal with how it would do updates, inserts, and complex queries. For both stores and queries, comparing NeoCore performance with what the XOO7 team reported is comparing two totally different things.
Even though what NeoCore is doing is completely different from what the XOO7 team was dealing with, we still provide both sets of figures. Even when you are comparing apples to oranges, it is still useful to note that oranges are a better source of vitamin C. And it is equally meaningful to say that the NeoCore XMS could store a document faster than the other systems could prepare the item for storage.
4. WE DID NOT FINISH STORING NO-DUPE XML DOCUMENTS. When we started our first attempt to store 100 megabytes of No-Dupe files, we failed to realize how large a Tag Core Index would be required: we ran out of space almost immediately. When we reconfigured the system to have a huge Tag Core Index, we ran square into the problems noted above about RAM needing to hold the core indices if possible. Data was being stored, and eventually everything would have succeeded, but we ran out of time and patience. Probably we could have benefited from more experience with tuning the database; in the end we decided that there was too much overlap with the extensibility testing to make it worthwhile to continue. Note that the No-Dupe documents have no similarity to actual documents that a real-life database would contain; the No_Dupe documents represent merely a logical extreme.
5. THE DUPLICATE INDICES CAN BE A BOTTLENECK. When the duplicate indices have to reallocate more quanta to a particular string, the duplicate index has to be locked during the reallocation. Reallocating quanta in a duplicate index is one of the slower processes to take place within the database proper. Especially if the indices are large and use lots of RAM, it becomes bothersome.
If, in addition, we have several applications trying to input documents at the same time, we have to increase the buffer areas in RAM to handle the additional inputs; this further restricts the already-limited RAM. And so when the first application locks a duplicate index, all the others continue what they are doing until they also need to access the locked duplicate index. Pretty soon, all the using applications are queued up—all waiting for the same resource. If there are lots of duplicates so that the duplicate index has to immediately lock again for the next cycle, it becomes a big bottleneck.
We exacerbated the problem by running clients and server on the same system, restricting RAM availability even more.
Our recommendation is to modify the system so that the XMS can create several sub-indices instead of using just one—e.g., there would be several tag duplicate indices rather than the single one. Then when a quanta reallocation has to occur, at least some of the other using applications will be able to continue processing because they won’t happen to need that particular sub-index at that moment.
Data generated by the storage footprint testing is detailed in Appendix E. The first chart shows how much disk space was needed. The second chart shows which files took up the space.
An “empty” database contained 3,067,712 bytes in our configuration. This is the size of the binaries and configuration files, the small Data Duplicate Index (unused in our testing, but still present), and nine stored XML documents that the NeoCore XMS uses to set up initial users, user groups, passwords, permission levels, and other administrative settings.
For XOO7 benchmark documents, which would be the equivalent of data-oriented business-to-business applications, we used about 2.25 bytes of disk storage for every byte input. For document (text) storage, we used about 1.2 bytes of disk storage for every input byte. The XOO7 report did not calculate storage in this direct way. They instead showed that the databases they tested required intermediate files to interpret the XML tag structure that were about three times, on average, the size of the input documents.
Appendix F contains the mini-documents we stored for this test, as well as the replies received on querying for the entire document. To make it easier to compare the input and output, in Appendix F we dropped the “<ND>” tags the NeoCore XMS adds to each document; we also dropped all the metadata (document ID, original filename, etc).
There are no clear standards for what an XML database should return when we query for the entire original document. In the absence of clear standards, it is difficult to assess how well the NeoCore XMS performs. Determining how well NeoCore performs in this regard is more of a philosophical undertaking than one of information management.
For standard business-to-business communications, typified by the benchmark documents, the NeoCore performance is fine. Purchase orders, customer information, and shipping data will all be stored and retrieved accurately.
For the archiving and retrieval of text (as might be required by publishers and document storage professionals), the NeoCore database, in its current version, is weak. The primary reason is its lack of precision in the handling of whitespace. In effect, NeoCore’s handling of whitespace is the equivalent of altering the data in order to store it. The XML standards require that between an open/close tag pair with no intervening tags, “processors” should return exactly the same whitespace as contained in the original document. “Processor” is normally understood to mean “parser”, but a good case can be made that the same thing should be demanded of databases—otherwise it becomes impossible to accurately store such things as poetry. Note that there is nothing inherent in the NeoCore architecture to make it difficult or impossible to store whitespace accurately. Because NeoCore’s customers tend to use it for data-oriented business-to-business applications, it has never been important for them to be precise in the use of white space.
There is no good reason for NeoCore to drop such things as comments (Document C1), the XML declaration (Documents I1 and I2), and DOCTYPE declarations (Documents D1 and H1). Someone wanted these things in the document; the document is being stored for some sort of possible future use and it could well require the materials just described. Further, there is no need for the “<prolog>” tag (Document D2)—if NeoCore accepted the original document as legal XML without the “<prolog>” tag, then NeoCore is also allowed to return it without the prolog tag. While the issues contained in this paragraph are perhaps not crucial for NeoCore’s typical current customer, they must eventually be dealt with. For example, some people will not understand the inconsistency between what is returned when documents D1 and D2 are returned. Another example: NeoCore’s own Console wouldn’t accept the query reply for document H1.
Intuitively, a traditional database shouldn’t change any data. Similarly, an XML database shouldn’t change the XML document. Ideally, an XML database should be able to “round trip” all syntactically correct XML—return exactly what was input. Most of NeoCore’s competitors would have a difficult time meeting that ideal—except those that use the technique of simply storing the entire document intact. But the NeoCore architecture in principle can meet the round trip ideal. NeoCore should take advantage of its architecture, by promoting round trip functionality as one of its features. Round-tripping will allow NeoCore to become a general-purpose XML database. NeoCore could profit from adding a NeoCore–unique command to supplement XQuery: “for $a in original(“MyFileName.xml”) return $a”. This “original()” function would then return the exact document that the user originally input. When combined with its other functionality, the capability of returning the exact original document would be a big plus for NeoCore. No one can predict what uses future customers might have for XML databases—but it is likely that some customers will need to get out exactly what is put in, and there is no architectural reason why NeoCore could not provide that capability.
Appendix G contains the graphs which show storage speeds for the various kinds of XML documents. In general, the documents whose characteristics are closest to those used in real-life applications took about seven seconds per megabyte to store (Multi-Bmark) and four seconds per megabyte (text). The original XOO7 team found it took over a minute per megabyte to process the document for storage.
The variation from slowest time to most rapid, even within a given category, shows more variation than we had expected. Perhaps that is simply due to normal background processes becoming active and then inactive, taking away some RAM and CPU cycles from the database. This was especially true for the extensibility data; there were over 3500 extensibility documents and they plotted all over the graph—we had to resort to using a trendline to get any semblance of readability out of the statistics.
The documents that took the longest time per megabyte to store were the short little poems in the text documents. They also were near the very beginning of the text documents; it is possible the operating system was still trying to figure out how to manage RAM.
The original benchmark queries and the variations we used are at Appendix D. All the queries we could use produced the expected results.
Testing results are at Appendix H.
Here are comparisons with the results observed by the original XOO7 research team:
MEAN QUERY TIME FOR NEOCORE
MEAN QUERY TIME FOR XOO7 TEAM (APPROXIMATE)
Here are the results of the test of the Core Indices:
FIGURE 8-2: Core Index Test Results
The difference between the queries run when the index is full and the queries run when it is empty is negligible. This is what we expected.
The footprint and storage times for the extensibility documents are included with similar data in Tests 1 and 3.
The extensibility data generates few duplicates, either in the data items or in the tagsets. Because of the need to store virtually every tagset, the extensibility documents has a very large Tag Dictionary and hence a very large footprint. It also means that the Tag Core Index will be very large, causing problems managing RAM. So the extensibility documents also take longer than the other documents to store.
While the extensibility documents cause these expected problems for the NeoCore database, the basic NeoCore claim remains true: you can drop documents with strange or unexpected tagset structures into the NeoCore database with no extra preparation. From new CD to installation on the computer to creating and initializing the database to storing data to getting a successful response to a query takes less than ten minutes if you are willing to accept default configurations.
The insertion test worked exactly as expected. We did the insertions and deletions manually from the Console, and so we did not time them. But they were obviously very quick: the system made about two hundred fifty insertions in what seemed like two seconds or so (not two seconds each; two seconds for all 250).
When we did the query, each one averaged 72 milliseconds the first time through (i.e., with the ordered map store). The second time through (i.e., with the jumps in the map store and the extra duplicate entries in the Tag Duplicate Index), each query averaged 74 milliseconds.
The queries, the responses they generated, and the time they took are at Appendix I. We did the queries twice—once when only the six megabytes of classical texts were in the database and once when we added the 94 megabytes of nonsense text.
We were not expecting either fast or slow searches: NeoCore has not yet added their patented sub-string search technique to the XMS. But we have documented the speeds we found, so we have a baseline for future testing.
It should also be pointed out that our substring searches take a lot of memory. We had intended to do some text queries which were more demanding, but they were simply too RAM intensive and we abandoned them.