| ACADEMIC COMPUTING and COMMUNICATIONS CENTER | |||||||||
XML and the Future of the Web | ||||
That's right, a new acronym: XML, eXtensible Markup Language. Sure, buzzwords come and buzzwords go, but XML is going to be with us for quite a while. It's more than a replacement for HTML; it will enable structured information exchange and interactivity. Are you at all interested in how the Web is going to evolve? XML has reached critical mass. |
||||
| Warning: High Risk of Acronym Overload! | ||||
|
But don't panic yet; most of the acronyms/initialisms in this article are discussed in the following articles. Or check this newsletter out online; all the blue words are links to further info there. About these ArticlesThe articles about XML and its related protocols were written by Bob Goldstein, who, in one of his many guises, is head of the ACCC Web staff. Bob's an old hand at SGML (see the section "SGML and the Solution") and he's really enthusiastic about XML. Any questions or comments? You can reach Bob at bobg@uic.edu. |
||||
| Ancient History: Plain Text | ||||
|
Our purpose in this article is to answer the question: What is XML? But let's start another question: What is text and how do we use it? Does anyone remember Gopher? A Gopher page was plain text, pretty much like a page from a book, except it used only one font. No bold, no italic, no larger or smaller type, no wrapping to fit your window, no hyperlinks, no images. Everything was displayed exactly as if it had been typed on a typewriter. You recognized the title because it was at the top of the page. Maybe there was something that looked like a name near the title; from its position and the fact it looked like a name, you could guess that was the page's author. The point is that only the human brain could extract information from a Gopher text page. A computer couldn't do it; it would have no standard way to break the page into its constituent parts. Not only is this lack of structure bad for computer applications and databases, it's not all that good for humans. You couldn't change the display to fit your window size, or change the default font, or find that one specific stock price to display on your Palm Pilot. Although modern Web pages look a lot better than Gopher pages did, Web browsers still can't adapt their display based on content. This means no real interactivity, no e-commerce, no business-to-business transactions, at least not without different custom applications for each case. |
||||
| Recent History and the Web | ||||
|
HTML was a good start toward organizing the information on a page. HTML introduced the concept of markup as a way to designate which parts of the text were which. HTML has <h1> for a big heading, <h2> for a smaller heading, <p> for paragraph, <b> for boldface, <font> to change fonts, <br> for a line break, <tr> for a row in a table, and so on. And two crucial tags: <a> for links and <img> for images. These tags give the Web its hypertext navigation and its graphical look. Finally, a browser had a fighting chance at adaptive display. As good as HTML is, it has two significant drawbacks. The first is that HTML is not easily extensible. Mathematicians can't add a <polynomial> tag, chemists can't add <benzene>, and stockbrokers can't add <stockprice company="cisco">376</stockprice>. New tags had to be pre-understood by the browser, so authors were stymied. Of course, given the competition between Netscape and Microsoft, new tags were added to each browser's HTML in incompatible ways. These new tags only made life worse, because now authors could write HTML that will work well on one browser but poorly on the other. And it is still impossible to add enough tags to suit everyone without making both the tag set and the browsers much too large and cumbersome. The second drawback of HTML is the nature of the tags themselves, coupled with the desire of advertisers-cum-Web-authors to control the look and feel of their Web pages in every detail. That is, the tags have become largely layout-oriented, rather than meaning-oriented. If a designer decided that <h1> was usually rendered in too large a font, he'd use some combination of <font>, <b>, and <center> instead. While this means that the browser will display the page more to its designer's liking, it also means there is no <h1> tag that can be interpreted as an "important heading" for search and retrieval. Designers also are too often tempted to control the width and placement of text, not allowing the end user to make use of wider windows or higher-resolution screens. |
||||
| SGML and the Solution | ||||
|
What to do? The answer was clear to those in the field, even before the popularity of HTML was apparent. That is, there needs to be a separation between the tags in a document and the rules for dealing with those tags. Roughly speaking, a browser needs to download a set of rules for each Web page to use in rendering that particular page. This would allow authors to invent new tags because they could also specify the rules for dealing with those tags. It would also encourage designers to put styles, fonts, and layout into the rules, not the tags. This way, a given document could be rendered for different displays by changing the rules, not by changing the markup tags. Fortunately for the Web's creators, SGML, Structured Generalized Markup Language, predated HTML; in fact, SGML inspired HTML. The answers to HTML's problems are also coming from SGML. However similar the names, HTML and SGML are quite different. HTML is a set of markup tags (and more-or-less a set of rules for interpreting the tags). SGML, on the other hand, is a meta-language for defining general tag sets. In fact, the HTML tag set is now defined in SGML. There are any number of other instances of SGML tag sets. I wrote the one we use for the ACCC home page, for example, and I might write another one tomorrow. SGML provides a standard way of saying which tags belong in a set, what attributes the tags have, and what arrangement of tags constitutes a valid document (one that strictly adheres to the grammar defined for SGML tag set the document uses). It lets you specify syntax (how to produce a valid document), but it does not deal with semantics (what the tags mean). |
||||
| Enter XML | ||||
|
The Web world could have just accepted SGML, but it didn't, because SGML is complicated in ways that many thought were unnecessary. Instead, the W3C, World Wide Web Consortium, the Web's standardizing body, convened a committee to simplify SGML. That committee gave birth to XML in 1998. XML, like SGML, is a language for specifying tag sets. It lacks some of the SGML bells and whistles that were off-putting to the Web community. XML is being embraced not only by Netscape and Microsoft, but also by a host of other companies, for all sorts of applications that may never involve a Web browser. One encouraging aspect of current XML activity is the development of related standards. XML is not the solution by itself; it's only a very good start. XML provides a way to mark parts of a document with arbitrary tags, so that a generic XML parser can identify these parts. It does not, by itself, say what these parts mean or what to do with them. For that, we need other standards: style sheets (rules that tell browsers how to render the text), such as XSL and CSS, and rules that specify links between documents, such as XLink. XML extensions are also needed, such as XML namespaces, which allow the merging of different XML tag sets in the same document, or XML schemas, which allow authors to specify exactly what constitutes a valid document. And then there's DOM and SAX, which are ways to model documents. There are specialized tag sets such as MathML, MatML, and ChemML, not to mention the crucial XHTML -- a bridge that is both HTML and XML that you can use now. Of course, there are new protocols that use XML, such as RSS and ebXML. |
||||
| Netscape and XML | ||||
|
Mozilla is the open-source version of Netscape's Web browser. The upcoming, next-generation version of Mozilla uses XML in two interesting ways. It supports the rendering of arbitrary XML documents, styled with CSS style sheets. This lets you go way beyond publishing customary HTML. It also lets you customize the browser itself, with an XML configuration file written in XUL, eXtensible User interface Language. That's right, you can use XML to change the browser's "chrome", the parts of the browser normally outside of the page you are downloading. One could conceivably build new non-browser applications, with new chrome, based on the Mozilla Gecko rendering engine, without rewriting Gecko's internals. |
||||
| And Getting Back to Plain Text | ||||
|
As complicated as some of these developments may become, one very simple and attractive characteristic of XML won't change -- an XML document is text. Marked text, to be sure, but readable with any word processor or editor. And this is a very good thing. Got old WordStar files? You've got a problem. Got old XML? You'll always be able to read it, even when you're using the latest Windows 2020 and the program that produced the XML is long gone. You won't need any special program, either; just whatever text editor is handy. I've only scratched the surface in this article. And if all the acronyms I've given you are confusing, you're not the only one; many of them are still in development. I do know that XML and its cousins will change your cyberlife, I think for the better.
|
||||
| The A3C Connection, July/Aug/Sept 2000 | Previous: Welcome Back! | Next: XHTML: Straddling the Fence |
| 2000-10-13 connect@uic.edu |
|