Keywords: DAML, DTD, Dublin Core, RDF, Semantic Web, XML, XSLT
Biography
Dave is a researcher at the ILRT, University of Bristol and works on the Semantic Web Advanced Development Europe (SWAD-E) project. He has been working with metadata and the Dublin Core since 1995, RDF since 1998, is a member of the W3C RDF Core Working Group, and editor of the RDF/XML Syntax W3C Recommendation. Dave is the author of the Redland, Raptor and Rasqal RDF tools and maintainer of the RDF Resource Guide.
The Resource Description Framework (RDF) web metadata format has an XML syntax RDF/XML which has been described as a ugly and flawed, mainly as a consequence of it being an early XML format, dating from 1998. This presentation will describe the perceived and real problems and select appropriate modern XML and web best practices for improving RDF markup that can be better used with the latest XML technologies such as XSLT 2 and XQuery.
The presentation will distinguish a semantic web markup format rather than a format intended solely for software as one intended to be easier for end users to author and more clearly be appropriate for typical application areas of lightweight web metadata and authored web ontologies.
XML best practice in any area is a tricky subject to discuss and get agreement on but the XML technologies considered include XML Namespaces, XML QNames in content, omitting some darker corners of the XML specification along with use of clear user-friendly technologies such as the RELAXNG grammar-based XML schema language, part of the ISO DSDL work. The presentation will also discuss approaches starting from XHTML to generate semantic web data.
Introduction
Problems with RDF/XML
Recent RDF XML alternatives
Selecting XML for improved RDF markup
RXR (Regular XML RDF)
RDF and HTML/XHTML
Conclusions
Appendix A - XML Schemas
Bibliography
[RDF/XML Revised] is the W3C Recommendation that I edited for the W3C [RDF Core WG] 2001-2004. It re-defines the RDF/XML syntax designed in 1998 by the original RDF working group in terms of the XML Infoset (with XML Base). The original RDF/XML syntax was created with a variety of goals that somewhat clashed. This created a syntax that is often criticized as not meeting modern XML best practice. This paper discusses some of the issues that have been raised with the syntax, looks at other work in creating new XML syntaxes for RDF and describes a strawman new XML syntax for RDF graphs, RXR.
The goals for this work are to design a simple XML syntax that covers RDF in a straightforward fashion which that is quickly understandable. The intended users of this format are new authors, or those that have read the RDF documentation and want to write the triples down simply, for possible later XML-level processing.
Things that are out of scope for the syntax include RDF model extensions (contexts, quotation, literal subjects, nested graphs, named graphs) and complex triple structures in the RDF model (reification, collections). In order to make a simple XML format, that generates restrictions that will be discussed in later sections.
The RDF Core working group looked at comments on the original RDF Model &Syntax document and the later work from the community and recorded these in the [RDF Core Issues List] . Not all of these issues were possible to address during the syntax revising without inventing a new syntax, which was out of scope for the working group. The major remaining problems were as follows:
xsi:type
for
specifying W3C XML Schema datatypes.These cannot all be addressed while keeping the resulting XML format simple and for the intended use; some of the triple structures that could be generated are complex and would not be easy for the typical user to see at a glance how to write them in XML.
[RDF/XML Retrospective] described the history of the revising of RDF/XML and outlined some potential solutions to the various problems, for both users and machines as well as in XML and non-XML formats. This was not taken further on the XML, but led to the development of a non-XML format [Turtle] intended for quick writing of RDF, not discussed further here.
Carroll and Stickler in [TRiX] propose a new XML syntax TRiX based on a triples-level markup with the following form:
This beyond-RDF extension counters the simple, unsurprising approach intended here, and the use of XSLT (especially XSLT2 and W3C XML Schemas) takes the XML tool requirements much beyond what could be called core XML.
A syntax is most likely to be user friendly if the terms used are minimal, consistent and appropriate. The syntax terms should correspond directly to RDF concepts such as triples and the parts: subject, predicate and object, so that how the syntax is written clearly maps to the concepts. You should not need to understand either what an RDF schema or XML language is. An RDF schema language is a description of the vocabulary in the RDF graph; an XML schema language describes what an XML syntax looks like and how it is structured and constrained - they work at different levels.
The types of things that can be subjects, predicates and
objects in RDF are either RDF URI References, blank node identifiers
or literals. The literals can be datatyped (which can be XML
content) and may have a content language (xml:lang
).
There are also some restrictions on which types can be used in the
subject, predicate and object fields. As far as is possible, these
constraints must be enforced by an appropriate XML schema language so
that if it is wanted, the user can use standard validation tools or
work from the schema. The XML syntax should be simple enough
that knowledge of any XML schema language is not required but
knowledge of such a language is beneficial.
The goals force restrictions on the complex XML detail that humans have problems with constructing, in this case XML Literals in RDF which use Exclusive XML Canonicalisation. If those are needed, RDF/XML provides that facility in a better form that can be given in a simple XML format. The XML format also should use the minimum of XML specifications and in particular, stick to the ones most widely understood, used and deployed. Those include XML itself, possibly XML Namespaces and some XML schema language - taking care to use the minimum possible complexity of the common schema languages: DTDs, W3C XML Schemas or RELAX NG. Another choice is to not use some darker corners of the XML specifications such as processing instructions and entities, showing the SGML background of XML and not seen in most modern XML designs.
XML namespaces are used in many modern XML formats, especially
in order to use W3C XML schema datatypes which use QNames to identify the
datatypes. QNames are thus good candidates for technology to use in
order to get familiar looking and modern XML. The W3C TAG has an issue
with using QNames being used in places not representing element
names. This is most tricky when used in attribute value for content
that does not have type support for QNames, such as DTDs. DTDs are
therefore not so appropriate for use here: although they do handle
namespaces, they cannot do the full type checking. The issue
for RDF and QNames is that they have historically been used
differently from how they are used as identifiers in XML schema
datatypes - RDF concatenates the namespace name and the local name
to form a URI reference, whereas W3C XSD keeps them as a pair. This
has an unexpected consequence for describing datatypes; the namespace
URIs have to be different for xsd
if they were used in
RDF, since to construct the URIs for the XML schema datatypes requires
a different namespace that that used when used in WXS schema documents.
RXR takes the approach of mapping the RDF concepts to the same XML element names in what might be called element-normal form where every choice point gives a new element. This would generate a very deep tree of tags that are rather verbose as shown in Figure 1 if elements alone were used.
<triple> <subject><uri>http://purl.org/net/dajobe/</uri></subject> <predicate><uri>http://purl.org/dc/elements/1.1/creator</uri></predicate> <object><literal>Dave Beckett</literal></object> </triple> |
Figure 1: An RDF XML syntax in element-normal form
Although very regular, this is rather verbose and in particular it is not a natural way to write a string as a literal compared as a URI. It is more natural to make literal content appear as element content and this makes the other forms as alternatives; this suggested using attributes for the types indications rather than remaining totally as an element-normal format.
The XML attributes then become the modifiers for the XML
elements for the RDF concepts for the parts of the triples. The
triple
element content enforces the standard order of
describing a triple as used in the RDF specifications. (The order is
not significant but is commonly used in one order for consistency and
to ease learning).
Rewriting Figure 1 into the final RXR form gives Figure 2 with the root element added.
<graph xmlns="http://ilrt.org/discovery/2004/03/rxr/"> <triple> <subject uri="http://purl.org/net/dajobe/"/> <predicate uri="http://purl.org/dc/elements/1.1/creator"/> <object>Dave Beckett</object> </triple> </graph> |
Figure 2: RDF Triple in RXR
The remaining detail is primarily with literals. They can have a
datatype URI so add an attribute datatype
for that
with URI content and a language which can reuse
xml:lang
from XML.
Figure 3 shows examples of some ways
that literals can be written.
<graph xmlns="http://ilrt.org/discovery/2004/03/rxr/"> <triple> <subject uri="http://example.org/res1"/> <predicate uri="http://example.org/pred1"/> <object>simple literal</object> </triple> <triple> <subject uri="http://example.org/res2"/> <predicate uri="http://example.org/pred2"/> <object datatype="http://example.org/mytype">1,2,3</object> </triple> </graph> |
Figure 3: RXR Literals - simple and datatyped
A lesson learned when designing [Turtle]
was that RDF collections are tedious to write down in RDF triples
and as this is prone to error, worthy of having special support.
These are used at the object part of triples, so an additional
collection
tag is introduced, allowing a sequence
of object
elements inside.
Figure 4 shows an example of
a collection of literals as the object of a triple.
<graph xmlns="http://ilrt.org/discovery/2004/03/rxr/"> <triple> <subject uri="http://example.org/res"/> <predicate uri="http://example.org/pred"/> <collection> <object>a</object> <object>b</object> <object>c</object> </collection> </triple> </graph> |
Figure 4: RXR Collection of literals
As already discussed, RXR omits complex parts of RDF such as XML Literal which moves between levels of abstraction - the XML level of elements, attributes and the encoded versions as a string. There is no easy way to these layers in XSLT say, without essentially writing an XML serializer performing Exclusive XML Canonicalization in XSLT which probably requires XSLT2 at a minimum ([TRiX] ).
The XML schemas for RXR were written in RELAX NG compact and
translated to W3C XML Schema and DTD using James Clark's
trang
tool. The resulting schemas are
complete and relatively straightforward.
Another approach to making easy to use, user friendly semantic web markup is to start from XHTML markup and transform it or annotate it into triples. In February 2004, the HTML working group announced a draft document [RDF/XHTML] which defines an approach for semantic markup for XHTML2 (intended to be an XHTML2 module) adding two main new features:
<meta>
element can take element content and new
attributes to define the subject or object of an RDF statement
including datatypes.<span>
element can take a resource
attribute that allows the covered content to be the literal object of
an RDF triples.A second approach is to describe how parts of XHTML can
be mapped into triples, typically via an XSLT transform. One of
these described most recently is
[GRDDL]
which designates how RDF triples are generated via an HTML
head
profile attribute and value.
The new W3C Semantic Web Best Practices and Deployment (SWBPD) Working Group is coordinating proceeding this development with the HTML WG using both approaches, as it provides a way to get semantic markup from existing XHTML as well as in a new XHTML2. This remains draft and ongoing work.
RXR describes a simple and mostly regular triple XML format for RDF that is straightforward to explain and match to the RDF triple model. It is compatible with XML schemas in several languages and does not use XML QNames.
Avoiding URI abbreviation with QNames does have a downside that the
URIs of RDF are visible and verbose. Replacing these or adding
additional QName alternatives would have a cost in usability as how
to explain why some things in attributes with ':' in them are URIs,
others are QNames, These would have to be clearly distinguished
with new attributes. Adding support for, say, xsi:type
attributes for XML schema datatypes would also have issues confusing
the use of that attribute with datatype
and the
RDF property often abbreviated as rdf:type
.
The RELAX NG schema is available at http://ilrt.org/discovery/2004/03/rxr/rxr.rng (XML) and http://ilrt.org/discovery/2004/03/rxr/rxr.rnc (Compact); the W3C XML Schema at http://ilrt.org/discovery/2004/03/rxr/rxr.xsd and the DTD at http://ilrt.org/discovery/2004/03/rxr/rxr.dtd.