SXML

From Seo Wiki - Search Engine Optimization and Programming Languages
Jump to navigationJump to search
SXML
Filename extension .sxml, .scm
Internet media type text/sxml
Type code TEXT
Type of format markup language

SXML is a way to write and process XML data in the form of S-expressions.

Textual correspondence between SXML and XML for a sample XML snippet is shown below:

XML SXML
<tag attr1="value1"
     attr2="value2">
  <nested>Text node</nested>
  <empty/>
</tag>
(tag (@ (attr1 "value1")
        (attr2 "value2"))
  (nested "Text node")
  (empty))

The following two observation can be drawn from the above example:

  1. Textual notations for XML and SXML are much alike; informally, SXML textually differs from XML in relying on round brackets instead of angular braces.
  2. Additionally, SXML is not only a straightforward textual notation for XML data, but also a primary data structure for a family of functional programming languages, thus providing an illustrative approach for processing XML data with a general-purpose programming language.

Similarity between XML and S-expressions reified in SXML allows achieving close integration between XML data and programming language expressions, resulting in illustrativeness and simplicity of XML data processing for an application programmer.

Motivation

XML data processing languages suggested by the W3 Consortium, such as XPath, XSLT, XQuery etc., are not general-purpose programming languages and are thus not sufficient for implementing complete applications[citation needed]. For this reason, most XML applications are implemented by means of traditional programming languages, such as C and Java; or scripting languages, for example, Perl, JavaScript and Python.

An attempt to combine two different languages (for example, XPath and Java) leads to a problem known as impedance mismatch. Impedance mismatch problem consists of two aspects:

Impedance mismatch requires complicated converters and APIs (for example, DOM) to be used for combining such two languages.

Impedance mismatch problem can be reduced and even eliminated using the Scheme functional programming language for XML data processing [1]:

  • Nested lists (S-expressions in Scheme) provide a natural representation for nested XML documents. Scheme represents its code and data as nested lists of dynamically typed elements. XML document, being a hierarchical structure of nested XML elements, can be thought of as a hierarchical nested Scheme list (so-called S-expression).
  • Scheme is a functional language, as most XML-languages are (e.g. XSLT and XQuery). Scheme processes (nested) lists in a recursive manner which can be thought of as traversing/transforming the document tree.

Scheme, being a Lisp dialect, is a widely recognized scripting language. It is one of the most elegant and compact programming languages practically used: the description of Scheme standard consists of 40 pages only. Scheme is a high-level programming language, suitable for fast prototyping. Moreover, Scheme programs are generally several times more compact than the equivalent C programs.

XML and SXML textual notations are much alike: informally, SXML replaces XML start/end tags with opening/closing brackets. On the other hand, SXML is an S-expression and is thus the main data structure for Scheme programming language, and consequently SXML can be easily and naturally processed via Scheme.

XML, XML Information Set and SXML

An XML document is essentially a tree structure. The start and the end tags of the root element enclose the whole content of the document, which may include other elements or arbitrary character data. Text with familiar angular brackets is an external representation of an XML document. Applications ought to deal with an internalized form: an XML Information Set, or its specializations (such as the DOM). This internalized form lets an application locate specific data or transform an XML tree into another tree.

The W3 Consortium defines the XML Information Set (Infoset) as an abstract data set that describes information available in a well-formed XML document. An XML document's information set consists of a number of information items, which denote elements, attributes, character data, processing instructions, and other components of the document. Each information item has a number of associated properties, e.g., name, namespace URI. Some properties -- for example, "children" and "attributes" -- are collections of other information items. Although technically Infoset is specified for XML, it largely applies to other semi-structured data formats, in particular, HTML.

XML document parsing is just one of possible ways to create an instance of XML Infoset.

It is worth a note that XML Information Set recommendation does not attempt to be exhaustive, nor does it constitute a minimum set of information items and properties. Its purpose is to provide a consistent set of definitions for use in other specifications that need to refer to the information in a well-formed XML document.

The abstract data model defined in the XML Information Set Recommendation is applicable to every XML-related specification of the W3 Consortium. Namely, the Document Object Model can be considered the application programming interface (API) for dealing with information items; the XPath data model uses the concept of nodes which can be derived from information items, etc. The DOM and the XPath data model are thus two instances of XML Information Set.

XML Information Set Recommendation itself imposes no restrictions on data structures or interfaces for accessing information items. Different interpretations are thus possible for the XML Information Set abstract data model. For example, it is convenient to consider an XML Information Set a tree structure, and the terms "information set" and "information item" are then similar in meaning to the generic terms "tree" and "node" respectively.

An information item may be also considered as a container for its properties, either text strings (e.g. name, namespace URI) or containers themselves (e.g. child elements for an XML element). The information set is thus a hierarchy of nested containers. Such a hierarchy of containers comprised of text strings and other containers greatly lends itself to be described by an S-expression, because the latter is recursively defined as a list whose members are either atomic values or S-expressions themselves. S-expressions are easy to parse into an internal representation suitable for traversal; they also have a simple external notation, which is relatively easy to compose even by hand.

SXML is a concrete instance of the XML Infoset in the form of S-expressions. Infoset's goal is to present in some form all relevant pieces of data and their abstract, container-slot relationships with each other. SXML gives the nest of containers a concrete realization as S-expressions, and provides means of accessing items and their properties. SXML is a "relative" of XPath and the DOM, whose data models are two other instances of the XML Infoset. SXML is particularly suitable for Scheme-based XML/HTML authoring, XPath queries, and tree transformations.

XML and SXML can thus be considered two syntactically different representations for the XML Information Set.

SXML Specification

As noted in the previous section, SXML is the concrete instance of the XML Infoset in the form of S-expressions. Further discussion on SXML in this section is based on the SXML specification.

A simplified SXML grammar in EBNF notation is presented below. An SXML <name> is a single Scheme symbol.

[1]              <TOP> ::= ( *TOP* <PI>* <Element> )
[2]          <Element> ::= ( <name> <attributes-list>? <child-of-element>* )
[3]  <attributes-list> ::= ( @ <attribute>* )
[4]        <attribute> ::= ( <name> "value"? )
[5] <child-of-element> ::= <Element> | "character data" | <PI>
[6]               <PI> ::= ( *PI* pi-target "processing instruction content string" )

Since an information item in the XML Infoset is a sum of its properties, a list is a particularly suitable data structure to represent an item. The head of the list, a Scheme identifier, names the item. For many items this is their (expanded) item name. For an information item that denotes an XML element, the corresponding list starts with element name, optionally followed by a collection of attributes. The rest of the element item list is an ordered sequence of element children, character data, processing instructions, and other elements in turn. Every child is unique; items never share their children even if the latter have the identical content.

The following example illustrates an XML element and its SXML form (which satisfies the <Element> production in SXML grammar).

<WEIGHT unit="pound">
  <NET
   certified="certified">67</NET>
  <GROSS>95</GROSS>
</WEIGHT>
(WEIGHT (@ (unit "pound"))
  (NET
   (@ (certified)) "67")
  (GROSS "95"))

The value of an attribute is normally a string; it may be omitted (in case of HTML) for a boolean attribute, e.g., a "certified" attribute in the above example.

A collection of attributes is considered an information item in its own right, tagged with a special name @. The character "@" may not occur in a well-formed XML name; therefore an <attributes-list> cannot be mistaken for a list that represents an element. An XML document renders attributes, processing instructions and other meta-data differently from the element markup. In contrast, SXML represents element content and meta-data uniformly -- as tagged lists. RELAX NG -- a schema language for XML -- also aims to treat attributes as uniformly as possible with elements. This uniform treatment, argues James Clark, is a significant factor in simplifying the language. SXML takes advantage of the fact that every XML name is also a valid Scheme identifier, but not every Scheme identifier is a valid XML name. This observation lets us introduce administrative names such as @, *PI*, *TOP* without worrying about potential name clashes. The observation also makes the relationship between XML and SXML well-defined. An XML document converted to SXML can be reconstructed into an equivalent XML document (in terms of the Infoset). Moreover, due to the implementation freedom given by the Infoset specification, SXML itself is an instance of the Infoset.

The XML Recommendation specifies that processing instructions (PI) are distinct from elements and character data; processing instructions must be passed through to applications. In SXML, PIs are therefore represented by nodes of a dedicated type *PI*. XPath and the DOM Level 2 treat processing instructions in a similar way.

A sample XML document and its SXML representation are both shown below, thus providing an illustrative comparison between nested XML tags and nested SXML lists. Note that the SXML document is a bit more compact than its XML counterpart.

<?xml version='1.0'>
<di contract="728g">
  <wt refnum="345">
    <delivery>
       <date month="6" 
             day="01" 
             year="2001"/>
       <weight>783</weight>
    </delivery>
    <vehicle type="lorry" 
	     number="A567TP99"/>
  </wt>
  <wt refnum="459">
     <vehicle type="car" 
              number="25676043"/>
  </wt>
</di>
(*TOP* (*PI* xml "version='1.0'") 
(di (@ (contract "728g"))
  (wt (@ (refnum "345")) 
    (delivery
      (date (@ (month "6") 
               (day "1") 
               (year "2001")))
      (weight "783"))

    (vehicle (@ (type "lorry") 
                (number "A567TP99"))))

  (wt (@ (refnum "459")) 
    (vehicle (@ (type "car")
                (number "25676043"))))))

SXML can also be considered an abstract syntax tree for a parsed XML document. An XML document or a well-formed part of it can automatically be converted into the corresponding SXML form via a functional Scheme XML parsing framework SSAX.

It is worth a note that SXML represents all the information contained in XML documents, including comments, namespaces and external entities. These are omitted in this section for the sake of simplicity, but they are considered in the SXML specification.

Example

For example, a simple XHTML page which looks like this:

 <html xmlns="http://www.w3.org/1999/xhtml"
         xml:lang="en" lang="en">
    <head>
       <title>An example page</title>
    </head>
    <body>
       <h1 id="greeting">Hi, there!</h1>
       <p>This is just an &gt;&gt;example&lt;&lt; to show XHTML &amp; SXML.</p>
    </body>
 </html>
 

When translated to SXML it looks like this:

 (*TOP* (@ (*NAMESPACES* (x "http://www.w3.org/1999/xhtml")))
  (x:html (@ (xml:lang "en") (lang "en"))
    (x:head
       (x:title "An example page"))
    (x:body
       (x:h1 (@ (id "greeting")) "Hi, there")
       (x:p  "This is just an >>example<< to show XHTML & SXML."))))
 

Each element's tag pair is replaced by a set of parentheses. The tag's name is not repeated at the end, it is simply the first symbol in the list. The element's contents follow, which are either elements themselves or strings. There is no special syntax required for XML attributes. In SXML they are simply represented as just another node, which has the special name of @. This can't cause a name clash with an actual "@" tag, because @ is not allowed as a tag name in XML. This is a common pattern in SXML: Anytime a tag is used to indicate a special status or something that is not possible in XML, a name is used that does not constitute a valid XML identifier.

We can also see that there's no need to "escape" otherwise meaningful characters like & and > as &amp; and &gt; entities. All string content is automatically escaped because it is considered to be pure content, and has no tags or entities in it. This also means it is much easier to insert autogenerated content and there is no danger that we might forget to escape user input when we display it to other users (which could lead to all kinds of nasty cross-site scripting attacks or other annoyances).

SXML features

This section considers some important features of SXML, deductible from SXML grammar and properties of S-expressions.

SXML attributes

The cdr of an SXML attribute list forms an association list, so that, when SXML is read into a Lisp program, any SXML attribute can be extracted from an attribute list using Lisp's built-in assoc function.

SXML elements and attributes

The uniformity of the SXML representation for elements, attributes, and processing instructions simplifies queries and transformations. For the SXML data model, attributes and processing instructions look like regular elements with a distinguished name. Therefore, query and transformation functions dedicated to attributes become redundant, because ordinary functions with distinguished names can be used.

The uniform representation for SXML elements and attributes is especially convenient for practical tasks. Differences between elements and attributes in XML are blurred. Choosing either an element or an attribute for representing concrete practical information is often a question of style, and such a choice can later be changed. Such a change in a data structure is expressed in SXML as simply an addition/removal of one hierarchy level, namely an attribute-list. This requires the minimal modification of an SXML application. For the SXML notation, the only difference between an attribute and an element is that the former is contained within the attribute-list (which is a special SXML node) and cannot have nested elements.

For example, if data restructuring requires that the weight of the delivered load, initially represented as a nested element, to be represented as an attribute, the SXML element

(delivery
   ...
   (weight "789")))

will be changed to

(delivery
  (@ (weight "789"))
  ...)

Such a notation for elements and attributes simplifies SXML data restructuring and allows uniform queries to be used for data processing.

SXML document as a tree of uniform nodes

Since an SXML document is essentially a tree structure, it can be described in a more uniform way by introducing the term of an SXML node for nodes of this tree.

An SXML node can be defined on the basis of SXML grammar as a single production [N] given below. Alternatively, an SXML node can be defined as a set of two mutually-recursive datatypes: [N1], [N2] and [N3]. In the latter case, a Node is constructed by adding a name to the Nodelist as its leftmost member; a Nodelist is itself a (sorted) list of Nodes.

[N]      <Node> ::= <Element> | <attributes-list> | <attribute> | "character data: text string" | <TOP> | <PI>
[N1]     <Node> ::= ( <name1> . <Nodelist> ) | "text string"
[N2] <Nodelist> ::= ( <Node> <Node>* )
[N3]    <name1> ::= <name> | @ | *TOP* | *PI*

Such a consideration emphasizes SXML tree structure and the uniform representation for information items as S-expressions.

SXML as a Scheme program

The syntax of LISP family programming languages, in particular, Scheme, is based on S-expressions used for both data and code representation. This makes it possible and convenient for Scheme programs to be treated as a semi-structured data and vice versa.

Since an SXML document and its nodes are S-expressions, they can be used for representing a Scheme program. For making this possible, it is sufficient that the first member of every list contained in the SXML tree is a function; the use of macros offers more possibilities. The rest of the members of the list are then the arguments, which are passed to that function. In accordance with SXML grammar, attribute and element names and special names must be bound to functions.

An SXML document or an SXML node that fulfills these requirements may be considered a Scheme program which can be evaluated, for example, by means of eval function.

For example, if para and bold are defined as functions as follows:

(define (para . x) (cons 'p x))
(define (bold . x) (cons 'b x))

then the following SXML element

(para "plain"
      (bold "highlighted")
      "plain")

can be treated as a program, and the result of its evaluation is the SXML element:

(p "plain"
   (b "highlighted")
   "plain")

Note that the result of evaluating such a program is not necessarily an SXML element. Namely, a program may return a textual representation for the source data in XML or HTML; or even have a side effect, such as saving the SXML data in a relational database.

SXML shortcomings

Because the underlying list representation is singly linked, the nodes have no access to either their previous siblings or the parent. As such, not all DOM and XPath traversal operations are possible with SXML unless you keep a stack, where each position has the parent nodes and the index of the traversed child node, for the traversed path of the current node. But this is really just an implementation detail. The syntax is orthogonal to the internal data structures off a hypothetical parser.


Arbitrary position query like in DOM's NodeList.item or XPath's [] accessor is not optimal, because the list must be traversed until the nth node which is o(n), vesus a vector or array where the access is o(1).

In fact, the use of this and similar s-expression formats to represents XML is well suited for sequential processing, like starting SAX events, inserting attributes or serializing to XML. The possible transformations are limited by the requirement of not needing either previous siblings or parents, or sacrificing simplicity and use a stack.

Citations

  1. Kirill Lisovskiy, Dmitry Lizorkin. "SXML: an XML document as an S-expression" // Russian Digital Libraries Journal. – Moscow: IIS, 2003. – Vol. 6, No 2. – ISSN 1562-5419. - http://modis.ispras.ru/Lizorkin/Publications/sxml-eng.pdf

External links

Detailed introduction, motivation and real-life case-studies of SSAX, SXML, SXPath and SXSLT. The paper and the complementary talk presented at the International Lisp Conference 2002.

ja:SXML

If you like SEOmastering Site, you can support it by - BTC: bc1qppjcl3c2cyjazy6lepmrv3fh6ke9mxs7zpfky0 , TRC20 and more...