Java 版 (精华区)
发信人: rhine (有雨无风), 信区: Java
标 题: Programming XML in Java[2]
发信站: 哈工大紫丁香 (2000年12月17日15:10:46 星期天), 站内信件
Programming XML in Java, Part 1
Page 2 of 3
Customize the parser with org.xml.sax.DocumentHandler
Since the DocumentHandler interface is so central to processing XML with
SAX, it's worthwhile to understand what the methods in the interface
do. I'll cover the essential methods in this section, and skip those
that deal with more advanced topics. Remember, DocumentHandler is an
interface, so the methods I'm describing are methods that you will
implement to handle application-specific functionality whenever the
corresponding event occurs.
Document initialization and cleanup
For each document parsed, the SAX XML parser calls the DocumentHandler
interface methods startDocument() (called before processing begins)
and endDocument() (called after processing is complete). You can use
these methods to initialize your DocumentHandler to prepare it for
receiving events and to clean up or produce output after parsing is
complete. endDocument() is particularly interesting, since it's only
called if an input document has been successfully parsed. If the
Parser generates a fatal error, it simply aborts the event stream and
stops parsing, and endDocument() is never called.
Processing tags
The SAX parser calls startElement() whenever it encounters an open tag,
and endElement() whenever it encounters a close tag. These methods
often contain the code that does the majority of the work while
parsing an XML file. startElement()'s first argument is a string,
which is the tag name of the element encountered. The second argument is
an object of type AttributeList, an interface defined in package org.
xml.sax that provides sequential or random access to element
attributes by name. (You've undoubtedly seen attributes before in HTML;
in the line <TABLE BORDER="1">, BORDER is an attribute whose value is
"1"). Since Listing 1 includes no attributes, they don't appear in Table
1. You'll see examples of attributes in the sample application later in
this article.
Since SAX doesn't provide any information about the context of the
elements it encounters (that <AUTHOR> appears inside <POEM> in Listing 1
above, for example), it is up to you to supply that information.
Application programmers often use stacks in startElement() and
endElement(), pushing objects onto a stack when an element starts, and
popping them off of the stack when the element ends.
Process blocks of text
The characters() method indicates character content in the XML
document -- characters that don't appear inside an XML tag, in other
words. This method's signature is a bit odd. The first argument is an
array of bytes, the second is an index into that array indicating the
first character of the range to be processed, and the third argument
is the length of the character range.
It might seem that an easier API would have simply passed a String
object containing the data, but characters() was defined in this way for
efficiency reasons. The parser has no way of knowing whether or not
you're going to use the characters, so as the parser parses its input
buffer, it passes a reference to the buffer and the indices of the
string it is viewing, trusting that you will construct your own String
if you want one. It's a bit more work, but it lets you decide whether or
not to incur the overhead of String construction for content pieces
in an XML file.
The characters() method handles both regular text content and content
inside CDATA sections, which are used to prevent blocks of literal
text from being parsed by an XML parser.
Other methods
There are three other methods in the DocumentHandler interface:
ignorableWhitespace(), processingInstruction(), and
setDocumentLocator(). ignorableWhitespace() reports occurrences of white
space, and is usually unused in nonvalidating SAX parsers (such as
the one we're using for this article); processingInstruction() handles
most things within <? and ?> delimiters; and setDocumentLocator() is
optionally implemented by SAX parsers to give you access to the
locations of SAX events in the original input stream. You can read up on
these methods by following the links on the SAX interfaces in
Resources.
Implementing all of the methods in an interface can be tedious if you're
only interested in the behavior of one or two of them. The SAX
package includes a class called HandlerBase that basically does nothing,
but can help you take advantage of just one or two of these methods.
Let's examine this class in more detail.
HandlerBase: A do-nothing class
Often, you're only interested in implementing one or two methods in an
interface, and want the other methods to simply do nothing. The class
org.xml.sax.HandlerBase simplifies the implementation of the
DocumentHandler interface by implementing all of the interface's methods
with do-nothing bodies. Then, instead of implementing DocumentHandler,
you can subclass HandlerBase, and only override the methods that
interest you.
For example, say you wanted to write a program that just printed the
title of any XML-formatted poem (like TitleFinder in Listing 1). You
could define a new DocumentHandler, like the one in Listing 2 below,
that subclasses HandlerBase, and only overrides the methods you need.
(See Resources for an HTML file of TitleFinder.)
012 /**
013 * SAX DocumentHandler class that prints the contents of "TITLE"
element
014 * of an input document.
015 */
016 public class TitleFinder extends HandlerBase {
017 boolean _isTitle = false;
018 public TitleFinder() {
019 super();
020 }
021 /**
022 * Print any text found inside a <TITLE> element.
023 */
024 public void characters(char[] chars, int iStart, int iLen) {
025 if (_isTitle) {
026 String sTitle = new String(chars, iStart, iLen);
027 System.out.println("Title: " + sTitle);
028 }
029 }
030 /**
031 * Mark title element end.
032 */
033 public void endElement(String element) {
034 if (element.equals("TITLE")) {
035 _isTitle = false;
036 }
037 }
038 /**
039 * Find contents of titles
040 */
041 public static void main(String args[]) {
042 TitleFinder titleFinder = new TitleFinder();
043 try {
044 Parser parser = ParserFactory.makeParser("com.ibm.xml.parsers.
SAXParser");
045 parser.setDocumentHandler(titleFinder);
046 parser.parse(new InputSource(args[0]));
047 } catch (Exception ex) {
048 ; // OK, so sometimes laziness *isn't* a virtue.
049 }
050 }
051 /**
052 * Mark title element start
053 */
054 public void startElement(String element, AttributeList attrlist) {
055 if (element.equals("TITLE")) {
056 _isTitle = true;
057 }
058 }
Listing 2. TitleFinder: A DocumentHandler derived from HandlerBase
that prints TITLEs
This class's operation is very simple. The characters() method prints
character content if it's inside a <TITLE>. The private boolean field
_isTitle keeps track of whether the parser is in the process of
parsing a <TITLE>. The startElement() method sets _isTitle to true
when a <TITLE> is encountered, and endElement() sets it to false when
</TITLE> is encountered.
To extract <TITLE> content from <POEM> XML, simply create a <Parser>
(I'll show you how to do this in the sample code below), call the
Parser's setDocumentHandler() method with an instance of TitleFinder,
and tell the Parser to parse XML. The parser will print anything it
finds inside a <TITLE> tag.
The TitleFinder class only overrides three methods: characters(),
startElement(), and endElement(). The other methods of the
DocumentHandler are implemented by the HandlerBase superclass, and those
methods do precisely nothing -- just what you would have done if
you'd implemented the interface yourself. A convenience class like
HandlerBase isn't necessary, but it simplifies the writing of handlers
because you don't need to spend a lot of time writing idle methods.
As an aside, sometimes in Sun documentation you'll see javadocs with
method descriptions like "deny knowledge of child nodes." Such a
description has nothing to do with paternity suits or Mission:
Impossible; instead, it is a dead giveaway that you're looking at a
do-nothing convenience class. Such classes often have the words Base,
Support, or Adapter in their names.
A convenience class like HandlerBase does the job, but still isn't quite
smart enough. It doesn't limit you to a <TITLE> element inside a
<POEM>; it would print the titles of HTML files, too, for example. And
any tags inside a <TITLE>, such as <B> tags for bolding, would be lost.
Since SAX is a simplified interface, it's left up to the application
developer to handle things like tag context.
Now you've seen a useless, simple example of SAX. Let's get into
something more functional and interesting: an XML language for
specifying AWT menus.
Page 2 of 3, continued...
Page 1.
Page 2. Customize the parser with org.xml.sax.DocumentHandler
Page 3. An applied example: AWT menus as XML
Printer-friendly (all-in-one) version
Resources and Related Links
Resources
"XML for the Absolute Beginner," Mark Johnson (JavaWorld, April 1999):
http://www.javaworld.com/javaworld/jw-04-1999/jw-04-xml.html
David Megginson, creator of SAX, has an excellent SAX site:
http://www.megginson.com/SAX/index.html
"Portable Data/Portable Code: XML & Java Technologies," JP Morgenthal --
Sun whitepaper on the combination of XML and Java:
http://java.sun.com/xml/ncfocus.html
"XML and Java: A Potent Partnership, Part 1," Todd Sundsted (JavaWorld,
June 1999) gives an example of how XML and SAX can be useful for
enterprise application integration:
http://www.javaworld.com/javaworld/jw-06-1999/jw-06-howto.html
"Why XML is Meant for Java," Matt Fuchs (WebTechniques, June 1999) is an
excellent article on XML and Java:
http://www.webtechniques.com/archives/1999/06/fuchs/
Download the source files for this article in one of the following
formats:
In jar format (with class and java files):
http://www.javaworld.com/javaworld/jw-03-2000/xmlsax/SAXMar2000.jar
In tgz format (gzipped tar):
http://www.javaworld.com/javaworld/jw-03-2000/xmlsax/SAXMar2000.tgz
In zip format:
http://www.javaworld.com/javaworld/jw-03-2000/xmlsax/SAXMar2000.zip
--
海纳百川,
有容乃大,
壁立千尺,
无欲则刚。
※ 来源:·哈工大紫丁香 bbs.hit.edu.cn·[FROM: dip.hit.edu.cn]
Powered by KBS BBS 2.0 (http://dev.kcn.cn)
页面执行时间:2.845毫秒