The two perspectives on XML: data and documents

I have been working with XML since it was a glimmer in the eye of Jon Bosak. In fact, before XML was conceived, there was SGML; this evolution of SGML represented a streamlining for the web, but at its core there was not much functional difference. In fact, the new invention was defined as a mere SGML subset. The key concept of semantic markup is central to the core value of SGML as well as its “streamlined for mass consumption” child.

The two main perspectives I have seen are Document-centric and Data-centric. SGML initially appeared in support of document-centric work: managing all of the technical documents or contracts of IBM or Boeing, for example. Charles Goldfarb has maintained that “SGML literally makes the infrastructure of modern society possible” and I think he’s right – hmm, should we blame him for the lengths to which humans have gone to destroy the earth?

The gentle document-centric world

The document-centric world is really a direct continuation of SGML. When XML came out as a standard in 1998, those of us working with document-centric use cases became giddy with excitement, anticipating that the standards being proposed at the time (notably XML itself, XLink, XML Schema, RDF, XSL and pre-cursors to SVG) would finally facilitate tools that made publishing work for organizations that weren’t quite as big as IBM or the Department of Defense. The vision of a semantic web and ubiquitous multi-channel publishing, seemed to be growing a foundation in theories gaining critical mass, with apparent support of software companies. It appeared these vendors might actually adopt the standards of the committees they were sitting on. “Throw away Xyvision!” I told my boss at Bertelsmann, “this XSL-FO will completely revolutionize database publishing!”

We were sorely disappointed over the next five years. In the years before 1998, W3C standards seemed magical; concepts from the standards were implemented relatively quickly, without perfection but with steady progress: browser updates would reflect CSS and HTML advances; even Microsoft was shamed into some level of compliance. But the monopolistic tendencies of those on the standards committees, coupled with the academic approach of others on those committees, managed to make it less and less likely that a given standard would enjoy a functional implementation.

Data-centric newbies crash the party

And there was that other perspective – the data-centric side of things. For many reasons, XML was at the right place at the right time in terms of data management and information exchange. In fact, the very year that it became a standard, it also became the dominant way that machines (servers) talked to each other around the world. Highly convenient for exchanging info, as firewalls would tend to block anything but text over http, while the semantic markup would allow any sort of specification for data structures, and validation tools would ensure that no info was lost.

In 1998, when you asked a programming candidate “what do you know about XML?” only the document-centric people would know anything. By 2000, everyone doing any serious programming “knew” about the acronym. Trouble was, they typically knew about it only in the much easier-to-use, barely-relevant-to-publishing, sense.

And the standards now had to accommodate two crowds. The work of the W3C XML Schema Working Group, in particular, showed the disconnect. Should a schema be easily human readable? What was the primary purpose of schema? Goals were not shared by the document- and data-centric sides, and data-centric won out, as they have tended to dominate the standards space around structure ever since that time. RELAX NG came about as an alternative, and if you contrast RELAX NG with W3C Schema, you will see the contrast between the power of a few brilliant individuals aligned in purity of purpose, versus the impotence of a committee with questionable motives and conflicting goals. Concurrent with a decline in the altruism of committee participants was the huge advance of data-centric XML, and the disproportionate representation of that perspective.

XML tooling solves mainly the trivial data-centric challenges

Ten years later, we find in the document-centric world that toolsets related to XML in a data sense – parsing, transforming, exchanging info – have made great leaps forward, but we are in many ways still stuck in the 1990s in terms of core authoring and publishing technologies. It is telling that descendants of the three great SGML authoring tools as of 1995 – FrameMaker+SGML, Arbortext Epic, and SoftQuad’s Author/Editor, are, lo and behold, the leading three XML authoring tools in 2009.

There have been some slow-paced advances in document-centric XML standards and tool chains as well, especially the single bright light out there for us, Darwin Information Typing Architecture (DITA) which came out of IBM, much like XML itself. Yet standards for rendition, XSL-FO and SVG especially, have not advanced along with core proprietary rendition technologies such as InDesign, Flash, or Silverlight, though all of these enjoy nicely copied underpinnings pillaged from the standards. More important, nothing has stepped in to replace the three core authoring tools: the “XML support” of Microsoft Word and Adobe InDesign, for example, do not approach the capabilities of a true structured authoring application. There is a proliferation of XML “editors” but most of the new ones are appropriate for editing a WSDL file or a SOAP message (data-centric forms of XML), but not a full-fledged document.

Meanwhile, on the data-centric front, XML has simply permeated every aspect of computing. There are XML data types in database systems, XML features in most programming languages, XML configuration files at the heart of most applications, and XML-based Web Services available in countless flavors. With the advent of JSON at the turn of the 21st century, the torch was passed on to an even more streamlined and “web-convenient” approach for managing semantic content. And while JSON is finding its way into ever-richer content, it is used first and foremost in a data-centric way.

Document-centric XML is simply a deep challenge that will take more time (and probably more of a commercial incentive) to tackle. For the time being, structured authoring managed the XML way is still implemented mainly by very large organizations: such an approach has “trickled down” from organizations the size of IBM to organizations the size of Adobe (which does, in fact, use DITA now), but there are no tool chains yet available that will bring it down much further. The consequences of the failure of the W3C XML Schema Working Group to provide a functional specification supporting document-centric XML can hardly be overestimated.

As long as content is not easily authored in a semantically rich, structured fashion, the vision of the semantic web will remain an illusion. Should document-centric XML get more attention from standards bodies and software vendors, human communications might become far more efficient and effective. Yet the challenges are substantial, the short-term gain not so obvious, it appears that semantic depth will not commonly be available in such a controlled and intentional fashion, but instead will be deduced, after the fact, through analysis of “unstructured” and pseudo-structured content.

The Two Perspectives on XML

The gentle document-centric world

Data-centric newbies crash the party

XML tooling solves mainly the trivial data-centric challenges

Submit a Comment Cancel reply

Contact Us

Newsletter Signup

Request a Consultation or Demo