Specifications - Page 2
December 2, 2002
The types of documents that may be stored via this application are limitless, provided that they all follow the
same XML skeleton (discussed in the next section). This necessitates the use of an XML authoring tool,
whether manual or automated, like XML Spy (see http://www.xmlspy.com/) or oXygen (see
http://www.oxygenxml.com/). The examples in this chapter provide a sample of what can be stored.
However, because XML is used, the average application user requires some technical skills. Alternatively, a
more user-friendly Graphical User Interface (GUI) can be wrapped around this application engine, but that is
beyond the scope of this case study.
Note that the source documents do not necessarily represent web pages although they can. In fact, we could
store HTML, WML, CML, SVG, or any other document in the database as well. In the case of WML,
however, the <?xml ...?> Processing Instruction (PI) does not get stored. It would have to be added by
the application component that manipulates WML documents. The same goes for XML documents. The
XML PI gets stripped off.
This chapter does not cover XSLT stylesheets. However, they can be used in various application components
that manipulate the stored documents. Such stylesheets could be used to export document content in various
formats including HTML, WML, RTF, PDF, and so on.
The list below highlights some of the functional specifications and features that can be a part of the
application (not all of these items are covered in this chapter):
- Private System
- Technically skilled users could use this system. There is no login feature in the present
- application, but this could be added later.
- Variety of document content
- Web pages
- Wireless web pages
- General XML documents
- Articles
- Product manuals (print and online versions)
- Indexes
- Bibilography
- Research notes
- Non-fiction books
- General lists
- Company records
- Recipes
- Letters
- Poems
- Short stories
- Fiction/novels
- Variety of export facilities
- Text
- XML
- WML
- HTML
- RTF
- PDF
As an application for storing a wide variety of XML-based document structures, export facilities are limited to
pre-defined formats based on content type and subtype. The only facility for changing the appearance of
finished documents is XSLT, which we do not discuss here.
On the surface, this application appears to be an awkward alternative to a word processing package. However,
for users who want to store a wide range of documents in XML format, this is an ideal, albeit simplified,
application. The advantage is that we now have an abstract model of a document to which we can add
interfaces. Using these we can generically search for information regardless of the type of the stored document.
To aid in building a generic search engine, we can go a step further by building a document corpus. A corpus
is a unique-word list used in the sampling of documents. By temporarily stripping off XML tags for each
document, we can add any new unique words, which we encounter, to a web corpus.
A web corpus is the necessary set of database tables and their contents that represent the list of unique
words and pointers to their occurrences in documents
For a simple web corpus, we simply log one occurrence of a word in a document. For example, if the word
'science' occurs one or more times in three different imported documents, we would catalog one occurrence
in each document by creating three occurrence records. Each record would contain a reference to the actual
word, as well as a reference to the document where it occurs.
Refer to Chapter 6 for more information about a web corpus and on how to build a search engine
based on a web corpus.
A different type of search engine could be designed by parsing an imported document's body markup,
recording tag, and attribute names. This would allow users to search for all documents containing a
specific tag name or even all tags that have a particular attribute. This is particularly useful if we are
designing parsing components for this application on a need-to basis and have recurring tag names that
must have the same attributes.
For example, if we are storing company documents that are mostly about employees, some document types
might have an <employee> tag with the following attributes:
<employee lname="mouse" empid="1234567890"/>
Storing occurrences of a tag name and its attribute names in a database gives us the ability to query for specific
documents based on the inclusion of this tag and its attributes. In the case of the <employee> tag, we can query
for a list of documents in which this tag occurs either with or without the attributes. This makes it easy to pinpoint
markup errors, for instance in legacy documents that have been converted into XML through brute force methods.
The type of search engine mentioned here is briefly touched upon towards the end of this chapter.
|
All content blocks are supplied using custom XML tagsets defined by the end user.
Documents can be defined in an incomplete manner (provided they are still well-formed
XML) and updated later (the application code for update is not covered here). The code
that is covered in the rest of this chapter is the basic import code required to catalog each
document. Numerous interfaces can be built over the top of the database that results from
running the import code.
|
Several examples for different types of content are detailed after a discussion of the general structure required.
As usual, we will use PEAR::DB for communicating with the MySQL database on which this whole application
relies. By using PEAR::DB, we also makes it easy to change database server if that ever gets interesting.
Because we are only covering a batch mode document import component, there are no particular
web page needs.
Professional PHP4 Web Development Solutions
Professional PHP4 Web Development Solutions
General Structure of XML Documents - Page 3
|