Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


Specifications - Page 2

December 2, 2002

The types of documents that may be stored via this application are limitless, provided that they all follow the same XML skeleton (discussed in the next section). This necessitates the use of an XML authoring tool, whether manual or automated, like XML Spy (see http://www.xmlspy.com/) or oXygen (see http://www.oxygenxml.com/). The examples in this chapter provide a sample of what can be stored. However, because XML is used, the average application user requires some technical skills. Alternatively, a more user-friendly Graphical User Interface (GUI) can be wrapped around this application engine, but that is beyond the scope of this case study.

Note that the source documents do not necessarily represent web pages although they can. In fact, we could store HTML, WML, CML, SVG, or any other document in the database as well. In the case of WML, however, the <?xml ...?> Processing Instruction (PI) does not get stored. It would have to be added by the application component that manipulates WML documents. The same goes for XML documents. The XML PI gets stripped off.

This chapter does not cover XSLT stylesheets. However, they can be used in various application components that manipulate the stored documents. Such stylesheets could be used to export document content in various formats including HTML, WML, RTF, PDF, and so on. The list below highlights some of the functional specifications and features that can be a part of the application (not all of these items are covered in this chapter):

  • Private System
    • Technically skilled users could use this system. There is no login feature in the present
    • application, but this could be added later.
  • Variety of document content
    • Web pages
    • Wireless web pages
    • General XML documents
    • Articles
    • Product manuals (print and online versions)
    • Indexes
    • Bibilography
    • Research notes
    • Non-fiction books
    • General lists
    • Company records
    • Recipes
    • Letters
    • Poems
    • Short stories
    • Fiction/novels
  • Variety of export facilities
    • Text
    • XML
    • WML
    • HTML
    • RTF
    • PDF

As an application for storing a wide variety of XML-based document structures, export facilities are limited to pre-defined formats based on content type and subtype. The only facility for changing the appearance of finished documents is XSLT, which we do not discuss here.

On the surface, this application appears to be an awkward alternative to a word processing package. However, for users who want to store a wide range of documents in XML format, this is an ideal, albeit simplified, application. The advantage is that we now have an abstract model of a document to which we can add interfaces. Using these we can generically search for information regardless of the type of the stored document.

To aid in building a generic search engine, we can go a step further by building a document corpus. A corpus is a unique-word list used in the sampling of documents. By temporarily stripping off XML tags for each document, we can add any new unique words, which we encounter, to a web corpus.

A web corpus is the necessary set of database tables and their contents that represent the list of unique words and pointers to their occurrences in documents

For a simple web corpus, we simply log one occurrence of a word in a document. For example, if the word 'science' occurs one or more times in three different imported documents, we would catalog one occurrence in each document by creating three occurrence records. Each record would contain a reference to the actual word, as well as a reference to the document where it occurs.

Refer to Chapter 6 for more information about a web corpus and on how to build a search engine based on a web corpus.

A different type of search engine could be designed by parsing an imported document's body markup, recording tag, and attribute names. This would allow users to search for all documents containing a specific tag name or even all tags that have a particular attribute. This is particularly useful if we are designing parsing components for this application on a need-to basis and have recurring tag names that must have the same attributes.

For example, if we are storing company documents that are mostly about employees, some document types might have an <employee> tag with the following attributes:

<employee lname="mouse" empid="1234567890"/>

Storing occurrences of a tag name and its attribute names in a database gives us the ability to query for specific documents based on the inclusion of this tag and its attributes. In the case of the <employee> tag, we can query for a list of documents in which this tag occurs either with or without the attributes. This makes it easy to pinpoint markup errors, for instance in legacy documents that have been converted into XML through brute force methods. The type of search engine mentioned here is briefly touched upon towards the end of this chapter.

All content blocks are supplied using custom XML tagsets defined by the end user. Documents can be defined in an incomplete manner (provided they are still well-formed XML) and updated later (the application code for update is not covered here). The code that is covered in the rest of this chapter is the basic import code required to catalog each document. Numerous interfaces can be built over the top of the database that results from running the import code.

Several examples for different types of content are detailed after a discussion of the general structure required. As usual, we will use PEAR::DB for communicating with the MySQL database on which this whole application relies. By using PEAR::DB, we also makes it easy to change database server if that ever gets interesting.

Because we are only covering a batch mode document import component, there are no particular web page needs.

Professional PHP4 Web Development Solutions
Professional PHP4 Web Development Solutions
General Structure of XML Documents - Page 3


Up to => Home / Authoring / Languages / PHP / WebDevSolutions




Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

Jupitermedia Corporate Info


Legal Notices, Licensing, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers