Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


Well-Formed or Toast?

April 12, 2002

Ignoring for a moment the potential importance of validity to data-oriented applications, you might wonder why even when an XML document does not require a DTD (i.e., is standalone), it still must be well-formed. In fact, if a document is not well-formed, it cannot even be called an XML document.

The reason for insisting on well-formedness is to counteract the "browser bloat" syndrome that occurred when the major browser vendors decided they wanted their browser to be able to render the horribly inaccurate HTML developed by graduates (or perhaps flunkies) of the Learn HTML in 2 Days or Less school. Many Web pages contain completely invalid HTML, with improperly nested elements, missing end tags, misspelled element names, missing delimiters, and other aberrations. Browsers such as Netscape Communicator and Internet Explorer do an admirable job of recovering from these errors, but only at the expense of a considerable amount of built-in recovery code.

Fortunately, with XML (and XHTML), parsers do not need to implement recovery code and can therefore stay trim and lightweight. If the parser encounters a well-formedness problem, it should only report the problem to the calling application. It explicitly must not attempt to correct what might be missing, overlapping, or misspelled. Violations of well-formedness constraints are considered fatal errors, according to the XML 1.0 Recommendation. The bottom line here is: either a document is well-formed XML, or it's toast.

The extra code necessary to do the HTML-like corrections might not be a significant problem for a desktop PC with lots of memory. It's more of an issue as XML is fed to handheld PCs and other devices with limited memory and/or processing power.

Validating and Nonvalidating Parsers

The differences between validating and nonvalidating parsers are not quite as clear as you might think. According to the XML 1.0 specification ( http://www.w3.org/ TR/REC-xml#proc-types),

Validating processors must, at user option, report violations of the constraints expressed by the declarations in the DTD, and failures to fulfill the validity constraints given in this specification. To accomplish this, validating XML processors must read and process the entire DTD and all external parsed entities referenced in the document. Non-validating processors are required to check only the document entity, including the entire internal DTD subset, for well-formedness

In other words, validating parsers must read the entire DTD and check the document against the structural constraints it describes. You might conclude, therefore, that nonvalidating parsers do not need to consult the DTD, but that turns out to be incorrect. Even nonvalidating parsers need to supply default values for attributes and to replace text based on internal entities (discussed in chapter 4).

Although there used to be a class of strictly nonvalidating parsers, they tend to be much less popular of late. Most modern parsers (2000 and beyond) can be run in either validating or nonvalidating mode. Why run in nonvalidating mode when a parser is capable of validation? Because validation can significantly impact performance, especially when long and complex DTDs are involved. Some developers find that while enabling validation during development and test phases is crucial, it's sometimes beneficial to surpress validation in production systems where document throughput is most valued and the reliability of the data is already known. Consult the documentation of prospective parsers to determine how to toggle this switch, and which is the default mode. For example, the Apache Xerces parser is nonvalidating by default.

Some of the more highly regarded XML parsers include:

  • Apache XML Project's Xerces
  • IBM's XML Parser for Java (xml4j)
  • JavaSoft's XML Parser
  • MSXML 4.0 Release: Microsoft XML Core Services component (aka MSXML Parser) and SDK
  • Oracle's XML Parser
  • ElCel Technology's XML Validator

URLs for these parsers and many more can be found on the XML Parsers/ Processors list at XMLSoftware.com, http://www.xmlsoftware.com/parsers/.

Event-Based vs. Tree-Based Parsing

We will cover tree-based and event-based parsing in some depth when we cover SAX and DOM in chapters 7 and 8, respectively. For now, an overview should be sufficient.

Event-based Parsing

Event-based parsers (SAX) provide a data-centric view of XML. When an element is encountered, the idea is to process it and then forget about it. The event-based parser returns the element, its list of attributes, and the content. This is more efficient for many types of applications, especially searches. It requires less code and less memory since there is no need to build a large tree in memory as you are scanning for a particular element, attribute, and/or content sequence in an XML document.

Tree-Based Parsing

On the other hand, tree-based parsers (DOM) provide a document-centric view of XML. In tree-based parsing, an in-memory tree is created for the entire document, which is extremely memory-intensive for large documents. All elements and attributes are available at once, but not until the entire document has been parsed. This technique is useful if you need to navigate around the document and perhaps change various document chunks, which is precisely why it is useful for the Document Object Model (DOM), the aim of which is to manipulate documents via scripting languages or Java.

David Megginson, the main force behind SAX (Simple API for XML), contrasts these two approaches in "Events vs. Trees" on the SAX site ( http://www.saxproject.org/?selected=event). The W3C presents its viewpoint in an item from the DOM FAQ, "What is the relationship between the DOM and SAX?" ( http://www.w3.org/DOM/faq#SAXandDOM).

Well-Formed vs. Valid Documents
XML Family of Specifications: A Practical Guide
Summary


Up to => Home / Authoring / Languages / XML / XMLFamily / XMLSyntax




Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

Jupitermedia Corporate Info


Legal Notices, Licensing, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers