XML and Java: Definitions
November 16, 1998
To understand the relative pros and cons of the diverse
Java XML software discussed in Parts 2 and 3 of this article,
there are several terms which must be clarified.
APIs
APIs are Application Programming Interfaces. APIs describe
how a programmer must use the software written by others.
In Java, an API specifies the class name and usually its
superclass (parent), the return types, the methods (functions),
and the arguments (parameters) to the methods. In some languages,
this is referred to as the signature of a function. The following
API example is from the
startElement
method of SAX:
public void startElement(String name,
AttributeList attributes) throws SAXException
In Java, APIs are described using
javadoc.
For example, see the
JDK 1.1 API Documentation or the
JDK 1.2 API Documentation.
Document Object Model
The
DOM, an October 1, 1998 W3C Recommendation,
specifies a standard tree-based API for both XML and
HTML documents. The DOM provides "a platform- and
language-neutral interface that allows programs and
scripts to dynamically access and update the content,
structure and style of documents."
To quote from our own
WDVL DOM page, "The goal of the DOM specification
is to define a programmatic interface for XML and HTML. It
defines the logical structure of documents and
the way a document is accessed and manipulated. This
specification defines the foundation of a platform- and
language-neutral interface to access and update
dynamically a document's content, structure, and style.
Programmers can build documents, navigate their structure,
and add, modify, or delete elements and content. Anything
found in an HTML or XML document can be accessed, changed,
deleted, or added using the Document Object Model, with a
few exceptions."
Since early 1998, a number of APIs and tools have emerged
that support the DOM Recommendation or its earlier Working
Drafts. The Java API to the DOM, called the
Java Language Binding, describes the Java DOM interface
which we will examine in Part 2.
Parsing
Parsing is the process of splitting up a stream of information
into its constituent pieces (often called tokens). In the
context of XML, parsing refers to scanning an XML document
(which need not be a physical file -- it can be a data stream)
in order to split it into its various elements (tags)
and their attributes. XML parsing reveals the structure
of the information since the nesting of elements implies a
hierarchy. It is possible for an XML document to fail to parse
completely if it does not follow the
well-formedness rules described in the
XML 1.0 Recommendation.
A successfully parsed XML document may be either well-formed
(at a minimum) or valid.
Non-validating Parser
A non-validating parser is the minimal case. The parser does not
check a document against any DTD (Document Type Definition); it
only checks that the document is well-formed (that it is properly
markedup according to XML syntax rules). However, a non-validating
parser is typically smaller than a validating one, so it may
be more appropriate for use in a Java applet.
Validating Parser
In addition to checking well-formedness, a
validating
parser verifies that the document conforms to a specific DTD
(either internal or external to the XML file being parsed).
Although a validating parser is generally larger than a
non-validating one, its rigor is necessary in cases where the
structural integrity of the XML data is important, such as in
database and eCommerce applications. It is likely that web
browsers will need to include validating parsers.
Note that for an XML document to be valid, it must either
contain or refer to a DTD. Authors of XML documents will
provide DTDs in situations where a group (company or industry)
wants to standardize on a particular set of elements.
A DTD is also necessary to supply default values for attributes
and to designate binary entities (CDATA).
Event-based Parsing (e.g., SAX)
Event-based parsers provide a data-centric view of XML.
When an element is encountered, process it and then forget
about it. The event-based parser returns the element, its list
of attributes, and the content. This is more efficient
for many types of applications, especially searches. It requires
less code and less memory since there is no need to build a
large tree in memory as you are scanning for a particular
element, attribute, and/or content sequence in an
XML document..
In
What is an Event-Based Interface?,
David Megginson, the SAX proponent, wrote:
"An event-based API.... reports parsing events
(such as the start and end of elements) directly to the
application through callbacks, and does not usually build
an internal tree. The application implements handlers to deal
with the different events, much like handling events in a
graphical user interface....[A]n event-based API provides a
simpler, lower-level access to an XML document:
you can parse documents much larger than your available
system memory, and you can construct your own data structures
using your callback event handlers."
Tree-based Parsing (e.g., DOM)
On the other hand, tree-based parsers provide a
document-centric view of XML. In tree-based parsing,
an in-memory tree is created for the entire document
(extremely memory-intensive for large documents). All elements
and attributes are available at once, but not until the entire
document has been parsed. This technique is useful if you need
to navigate around the document and perhaps change various document
chunks, which is precisely why it is useful for the
Document Object Model (DOM), the aim of which is to manipulate
documents via scripting languages or Java.
According to
David Megginson,
"A tree-based API compiles an XML document into an
internal tree structure, then allows an application to
navigate that tree. The Document Object Model
(DOM) working group at the World-Wide Web consortium is
developing a standard tree-based API for XML and HTML
documents....Tree-based APIs are useful for a wide range
of applications, but they often put a great strain on system
resources, especially if the document is large (under very
controlled circumstances, it is possible to construct the
tree in a lazy fashion to avoid some of this problem).
Furthermore, some applications need to build their own,
different data trees, and it is very inefficient to build
a tree of parse nodes, only to map it onto a new tree."
Coming Next Month
Visit WDVL in mid December for Part 2: "XML and Java: A
Perfect Pair: APIs" in which we will discuss a number of
Java APIs for XML: Sun's XML Library, W3C's DOM, Coins, DOM SDK,
SAXON, Koala XML Serialization, JPython, and XML Testbed.
XML and Java: Why These Two
XML and Java: The Perfect Pair: Part 1
|