More servicesWindows Live
HomeHotmailSpacesOneCare
 
MSN
Sign in
 
 
Spaces home  XSLT: Riding the challen...ProfileFriendsBlogMore Tools Explore the Spaces community
View space
M. David Peterson

XSLT: Riding the challenge

All the things you thought impossible to do with XSLT
March 16

Jeni's "Grouper" (or how to specify and parse additional rules to a grammar) is just a minor task for LBNF

In her recent post "RELAX NG for matching" Jeni Tennison said:

 

The “grouper” is the most interesting and difficult of these. It needs to act like a tokeniser, except use regular expressions over markup rather than over text. For example, say I had:

<number>06</number><punc>/</punc><number>03</number><punc>/</punc><number>08</number>

I want to be able to create a rule that says “any sequence that looks like a number element that contains a
two-digit number between 1 and 31, followed by a punc element that contains a slash, followed by another two-digit number between 1 and 12, followed by a punc element that contains a slash, followed by another two-digit number should be wrapped in a date element”.

Now this is something that XPath is really bad at. Try writing an expression that selects, from a sequence of elements that may contain other <number> and <punc> elements as well as other elements, only those sequences of elements that match the pattern I just described. It’s something like:

number[. >= 1 and . <= 31 and string-length(.) = 2]
      [following-sibling::*[1]/self::punc = '/']
      [following-sibling::*[2]/self::number[. >= 1 and . <= 12 and string-length(.) = 2]]
      [following-sibling::*[3]/self::punc = '/']
      [following-sibling::*[4]/self::number[string-length(.) = 2]]
  /(self::number, following-sibling::*[position() <= 4])

which is fiddly and messy and only works in this particular example because I know precisely how many elements there are
supposed to be in the group.

 

Then she proceeds by making the suggestion that

    "As a language, RELAX NG is really good at this"

 

Jeni ends with the following statement:

 

"So I think a “grouper” component should use RELAX NG to identify sequences to be marked up. But I have no idea if there are RELAX NG libraries out there that can be used in this way: to identify and extract matching sequences rather than to validate entire documents"

 

It is obvious that the solution of this problem is not strictly bound to RELAX NG itself
(which just happens to be able to parse a schema defined using the RNC grammar). The tool Jeni reasons about would be any compiler writing system that allows additional grammar rules, that are not required to reach the start symbol of the language, but may be useful for other purposes.

Very fortunately, I know at least one such system:

         the Labelled BNF Grammar Formalism (LBNF).

 Jeni's "grouper" tool can be implemented by adding additional rules to the language being specified
using LBNF's "internal pragmas".
 

Here's how the authors of LBNF Markus Forsberg, Aarne Ranta from Chalmers University of Technology
and the University of Gothenburg describe LBNF's internal pragmas:
  

6 LBNF Pragmas

6.1 Internal pragmas

Sometimes we want to include in the abstract syntax structures that are not part of the concrete syntax, and hence not parsable. They can be, for instance, syntax trees that are produced by a type-annotating type checker. Even though they are not parsable, we may want to pretty-print them, for instance, in the type checker’s error messages. To define such an internal constructor, we use a pragma 

   
"internal" Rule ";"

where Rule is a normal LBNF rule. For instance,

   internal EVarT. Exp ::= "(" Ident ":" Type ")";

introduces a type-annotated variant of a variable expression.

Of course, LBNF is a very nice and cool tool to carry out a number of really important language development tasks, besides the "grouper".

November 09

Wide Finder in XSLT --> deriving new requirements for efficiency in XSLT processors.

With his twelve posts under the common title of "The Wide Finder Project" Tim Bray formulated a problem, which obviously has since then stirred some people, anxious to prove that their programming language of choice fares well in implementing solutions for this class of problems.

While I am not aware of any XSLT implementation that provides explicit or implicit support for parallel processing (with the obvious goal to take advantage of the multi-core processors that have almost reached a "prevalent" status today), I was curious to determine at least two things:

  • How well a good XSLT 2.0 processor and a straightforward solution fare against other languages/solutions?
  • Where is the XSLT code on the scale of "beautiful code"?

 

Before going further let me remind that there is a popular (and as we'll see unfounded!) belief that any XSLT solution must be hugely inefficient and that any XSLT code can only be ugly. In fact, the following comment to one of Tim's posts reflects exactly this mindset:

 

"From: Rornan (Sep 25 2007, at 08:31)

Tim,

If you had anything at all to do with creating XSLT, then you have no right at all to comment on any other language deficency ever ever again."

  

My initial XSLT solution to Tim's problem is below. No optimization attempts have been attempted, not only because I don't have an XSLT processor that utilizes multi-core processor, but also because it seems there's only a limited possibility for optimization (adding more CPU's is not likely to speed up the reading of a huge file from a single drive).

 

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xsl:output method="text"/>
<xsl:variable name="vLines" as="xs:string*" select= "tokenize(unparsed-text('file:///C:/Log1000000.txt'),'\n')"/>
<xsl:variable name="vRegEx" as="xs:string" select= "'^.*?GET /ongoing/When/[0-9]{3}x/([0-9]{4}/[0-9]{2}/[0-9]{2}/[^ .]+) .*$|^.+$'"/>
<xsl:template match="/"> <xsl:for-each-group group-by="." select= "for $line in $vLines return replace($line,$vRegEx,'$1')[.]"> <xsl:sort select="count(current-group())" order="descending" />
<xsl:if test="not(position() > 10)"> <xsl:value-of select=
"concat(count(current-group()),':',current-grouping-key(),'&#xA;')"/> </xsl:if> </xsl:for-each-group> </xsl:template> </xsl:stylesheet>

This 22-line-transformation is performed by Saxon 9.0 on a 3GHz DELL desktop. The file Log1000000.txt was constructed using the 10000 lines file provided by Tim and copying it into another file 100 times. The size of this file is about 200MB.

The results were produced in 36.175 seconds using about 929MB of RAM. Saxon should be timed using the -repeat:3 command-line option, which performs the transformation three times. Using only the results from the first/single transformation will be misleading, as they include the time for the Java VM startup.

 

Discussion

 

To answer the questions:

1. How well a good XSLT 2.0 processor and a straightforward solution fare against other languages/solutions?

The non-optimized, uniprocessor version of this solution has a time of 36 seconds for processing about 200MB log file. It is likely the timing for the full , five times bigger log file used by Tim Bray, will be about 5 times bigger: 180 sec, or about 3 minutes. 3 minutes for processing 1GB of data is not a bad time for an XSLT transformation, considering that sizes even of a few hundreds of megs are still considered monstrous.

XSLT could be in fact one of the winners in the WF competition, had its creators stepped just a little bit further exploiting the natural, non-sequential XSLT processing model. Here I am talking about specifying a single f:parMap() function, which can be defined exactly as the current FXSL f:map() function:

   <xsl:function name="f:map" as="item()*">
      <xsl:param name="pFun" as="element()"/>
      <xsl:param name="pList1" as="item()*"/>

      <xsl:sequence select=
       "for $this in $pList1 return
          f:apply($pFun, $this)"
      />
    </xsl:function>


 

The only difference is that f:parMap() adds to the semantics of f:map() the hint to use as many available CPU processors as appropriate when evaluating the "for" expression in the code of the function.

Yes, this can be done for any "/" operator or for any "for" expression such as the one used in lines 13 - 14 in our XSLT solution:

"for $line in $vLines
     return replace($line,$vRegEx,'$1')[.]"

 

and for any <xsl:apply-templates .../> instruction, and for any <xsl:for-each .../> instruction, and for any <xsl:for-each-group ...> instruction and ... and ... and ...

 

Judging from the published results, this straightforward, non-optimized uniprocessor XSLT solution comes not too-far behind some of the optimized for multi-processing solutions. I hope that in the not so distant future there will be XSLT processors exploiting the inherent non-sequential nature of XSLT processing to implement highly-optimized multi-processing solutions.

2. Where is the XSLT code on the scale of "beautiful code"?

By its compactness (22 lines) it is in 3rd place and rivals the 2nd place (17 lines), as we could easily remove 4 lines (3 of them blank) used to add readability. One of Tim's criteria for "beautiful code" is avoiding awkwardness.He speaks about Ruby not needing ending semicolons or variable type declarations. However, Tim's solution still had to use two lines for defining a hash and setting its default to 0. By contrast, the XSLT solution above does not require the programmer to introduce and initialize a special data structure -- the support for grouping is right there in the core of the language. There is simply no way the programmer can do something wrong in declaring or using a hash table.

 

What could the XSLT processor do better?

Both the timing and especially the amount of RAM used indicate how the XSLT processor did its work:

  • It read all the text file in memory.
  • Then it produced all $vLines strings.
  • Then it produced the line replacements.
  • Then it did the grouping/sorting on the line replacements.

Imagine that the XSLT Processor is really lazy:

  • It will not read any contents of the file and will not do any tokenization, until absolutely necessary.
  • Whenever it is really necessary to use the next token (line), only the text necessary to determine that line shall be read.
  • Whenever a token/line is produced, all previously used / unused text shall be marked-for-garbage-collection/discarded/reused.
  • Whenever a string from $vLines is consumed and processed by the <xsl:for-each-group .../> instruction, this string is no-longer used and shall be marked-for-garbage-collection/discarded/reused.

Using these rules a truly lazy XSLT processor will only need space large enough for the longest line, and for a hash table to keep the keys and counters for all different articles being grouped. In this case there are just 562 unique values extracted from the strings of $vLines.

In this way, the processing -- reading a line, finding the article it contains and feeding this to the hash table --  could be accomplished in streaming mode while reading the text file. Upon reaching the end of the file, there would be very little left to do, and thus almost nothing to be further optimized.

 

Conclusion

I truly believe that the described improvements can be implemented by at least some of the best XSLT 2.0 processors around here. For many years I have been using Saxon and am very grateful to its developer Dr. Michael Kay. The performance efficiency has been constantly growing -- a very good example to follow by all other XSLT vendors and if they do there will be competition, and competition is good for us all.

November 04

Closure in XSLT

In a recent post to the xml-dev mailing list, "XSLT question on transitive closures", Rick Jelliffe wrote:

 

"Am I right in thinking that

 1) XPath2 functions don't have have a function for transitive closure (along provided xpaths)

2) SAXON 8 does not have the saxon:closure() extension function that older versions of SAXON had

3) The one to use is probably still Christian Nentwich's code from circa 2001 as adopted into EXSLT ?"

 

Michael Kay replied:

"I think it would be nice to do it properly based on FXSL higher-order functions, which are much more cleanly specified. Perhaps there is already a suitable function in FXSL.

The other thing that's needed is the ability to check for cycles. Simply blowing the stack or looping isn't good enough."

 

David Carlisle replied by providing a solution that would dynamically generate a new XSLT stylesheet and a closure function in it that uses only a specific function and implements its closure. He also said:

 

"> I think it would be nice to do it properly based on FXSL higher-order

True (but I'll leave that for Dimitre  :-)"

 

Thanks to these nice people I felt just like... something had been left for me  :o)

So, now we have in FXSL the function

   f:closure()

which, given a function pFun and a starting set of items pstartSet, produces the transitive closure of pFun.

 

While the complete 47 lines of code can be viewed using the above link, the essence of the implementation is in the following 20 lines:

 

 <xsl:function name="f:closure" as="node()*">
    <xsl:param name="pFun" as="element()"/>
    <xsl:param name="pstartSet" as="node()*"/>
    
    <xsl:sequence select="f:closure2($pFun, $pstartSet,$pstartSet)"/>
  </xsl:function>
  
  <xsl:function name="f:closure2" as="node()*">
    <xsl:param name="pFun" as="element()"/>
    <xsl:param name="pCurClosure" as="node()*"/>
    <xsl:param name="pstartSet" as="node()*"/>
    
    <xsl:if test="exists($pstartSet)">
      <xsl:variable name="vNew" select=
          "f:map($pFun,$pstartSet) except $pCurClosure"/>
      <xsl:sequence select=
           "$pstartSet 
           | $vNew 
           | f:closure2($pFun,$pCurClosure | $vNew, $vNew)"/>
    </xsl:if>
  </xsl:function>

  

And here is my reply to David's post (code hyperlinked, full code omitted):

"
>> I think it would be nice to do it properly based on FXSL higher-order

> > True (but I'll leave that for Dimitre:-)

I am sorry I only read this a few days ago. Below is the code of the FXSL function. While the code is straightforward, the following must be noted:

1. The "set" at present is only a set of nodes. I will probably produce a more general f:closure() function, which operates on any set of items. Then this function should be also passed as parameters a "union" and a "difference" functions.

2. It seems that David's solution would go into an infinite loop for more involved examples (see the second test with reachability of nodes in cyclic graphs below). Therefore, the algorithm was slightly changed and works correctly. The following files can be downloaded from the CVS of the FXSL project:

func-closure.xsl.

testFunc-closure.xsl,

testFunc-closure2.xsl

The last transformation should be applied on the following xml file:
testFunc-closure2.xml

 

The result from running the second test transformation above (two cases of finding all nodes of a given graph, reachable from a specific node. In the second case there is a cycle involving the nodes V2, V6, V7) is:

 === Reachable from V1 =======

<node id="V1"/>
<node id="V3"/>
<node id="V4"/>
<node id="V5"/>
=======================


=== Reachable from V2 =======
<node id="V2"/>
<node id="V4"/>
<node id="V5"/>
<node id="V6"/>
<node id="V7"/>
=======================

 

Due to its generality, the f:closure() function is a useful addition to the FXSL library.

Cheers, Dimitre Novatchev
"

 

October 13

Kurt Cagle: Bridging XML E4X and JSON

Using this title Kurt Cagle writes:

"I'd like to push a proposal to both the XML and AJAX communities, something that I think needs to be taken up by the W3C, the OpenAJAX alliance and JSON.org especially. Establish a set of conventions within JSON that most readily facilitate JSON being used in an XML context. These conventions should be syntactical, things that can be done with hash key naming conventions that can be picked up by a JSON/XML bridge to transform between the two formats."

 

Hmm...  here's what I have to say:

Kurt, this was quietly done some time ago. Just have a look at my blog ("Transforming JSON"). M.David. Peterson has a nice web-service based example, which in real time gets the XML out of the JSON produced by the Yahoo's Traffic Service.

This example just uses the JSON to XML convertor provided by FXSL -- that is the function f:json-document(), written in pure XSLT.

So, no conventions are necessary for transforming JSON to XML !

July 05

Transforming JSON

Update: (Think that) finally managed to get from Windows Live Spaces the formatting I wanted...

Update: Enhanced the JSON Lexer to properly deal with escaped characters within strings. How to handle \b and \f ???  

====================================

Ever wanted to access and manipulate JSON as ordinary XML? To transform it with XSLT?

No problem, use the f:json-document() and f:json-file-document() as provided by FXSL.

Here is a quick example:

Let's have (yes, this is the Yahoo Traffic Service):

 

 

<xsl:variable name="vStr">

 {"ResultSet":  

   {"LastUpdateDate":"1178683597",

    "Result":[{"type":"construction",

               "Title":"Construction work, on I-5 NB at SENECA ST",

               "Description":"I 5 N Construction work, Left Lane Blocked on I 5 northbound from Seneca Street to Pine Street starting 11:00 PM, 05 08 07 for several days from 11:00pm to 05:00am on Tuesdays, Wednesdays and Thursdays . From milepost 165 to milepost 167",

               "Latitude":"47.614353",

               "Longitude":"-122.329586",

               "Direction":"NB",

               "Severity" :  2,

               "ReportDate":1178604000,

               "UpdateDate":1178608792,

               "EndDate":1178712000},

 

              {"type":"construction",

               "Title":"Construction work, on I-5 NB at UNIVERSITY ST",

               "Description":"I 5 N Construction work, On ramp Blocked on I 5 northbound at University Street starting 10:00 PM, 05 08 07 until further notice from 10:00pm to 05:00am on Tuesdays, Wednesdays and Thursdays . From milepost 165 to milepost 166",

               "Latitude":"47.615975",

               "Longitude":"-122.328988",

               "Direction":"NB",

               "Severity" : 2,

               "ReportDate":1178600400,

               "UpdateDate":1178608793,

               "EndDate":1178712000},

 

              {"type":"incident",

               "Title":"Lane closed, on WA-99 at 4TH AVE",

               "Description":"SR 99 Road construction, right lane closed on SR 99 in both directions from 4TH AVE W to 7TH AVE SE starting 8:00 PM, 05 07 07 for a week from 08:00pm to 06:00am on Mondays, Tuesdays, Wednesdays and Thursdays . From milepost 50 to milepost 51",

              "Latitude":"47.634877",

              "Longitude":"-122.344338",

              "Direction":"N\/A",

              "Severity":2,

              "ReportDate":1178506800,

              "UpdateDate":1178608788,

              "EndDate":1178715600}

            ]

   }

}

</xsl:variable>

 

 

Then,  f:json-document($vStr) evaluates to:

 

<ResultSet>  

   <LastUpdateDate>1178683597</LastUpdateDate>

   <Result>

      <type>construction</type>

      <Title>Construction work, on I-5 NB at SENECA ST</Title>

      <Description>I 5 N Construction work, Left Lane Blocked on I 5 northbound from Seneca Street to Pine Street starting 11:00 PM, 05 08 07 for several days from 11:00pm to 05:00am on Tuesdays, Wednesdays and Thursdays . From milepost 165 to milepost 167</Description>

      <Latitude>47.614353</Latitude>

      <Longitude>-122.329586</Longitude>

      <Direction>NB</Direction>

      <Severity>2</Severity>

      <ReportDate>1178604000</ReportDate>

      <UpdateDate>1178608792</UpdateDate>

      <EndDate>1178712000</EndDate>

   </Result>

   <Result>

      <type>construction</type>

      <Title>Construction work, on I-5 NB at UNIVERSITY ST</Title>

      <Description>I 5 N Construction work, On ramp Blocked on I 5 northbound at University Street starting 10:00 PM, 05 08 07 until further notice from 10:00pm to 05:00am on Tuesdays, Wednesdays and Thursdays . From milepost 165 to milepost 166</Description>

      <Latitude>47.615975</Latitude>

      <Longitude>-122.328988</Longitude>

      <Direction>NB</Direction>

      <Severity>2</Severity>

      <ReportDate>1178600400</ReportDate>

      <UpdateDate>1178608793</UpdateDate>

      <EndDate>1178712000</EndDate>

   </Result>

   <Result>

      <type>incident</type>

      <Title>Lane closed, on WA-99 at 4TH AVE</Title>

      <Description>SR 99 Road construction, right lane closed on SR 99 in both directions from 4TH AVE W to 7TH AVE SE starting 8:00 PM, 05 07 07 for a week from 08:00pm to 06:00am on Mondays, Tuesdays, Wednesdays and Thursdays . From milepost 50 to milepost 51</Description>

      <Latitude>47.634877</Latitude>

      <Longitude>-122.344338</Longitude>

      <Direction>N/A</Direction>

      <Severity>2</Severity>

      <ReportDate>1178506800</ReportDate>

      <UpdateDate>1178608788</UpdateDate>

      <EndDate>1178715600</EndDate>

   </Result>

</ResultSet>

 

 and you can transform nicely this XML document with XSLT now.

The functions f:json-document() and f:json-file-document() are available immediately from the CVS of the FXSL project.

All this pure XSLT magic (and sure, expect more to come) is possible with using the LR Parsing Framework implemented in FXSL.

More to come  soon.

June 09

Solving Sudoku with a single SQL statement and with a single RegEx

If you found mine and Andrew Welch's   XSLT Sudoku solvers grotesque,    then hold on:
 
 
What seems even more bizarre is this one, solving Sudoku within a regular expression.
 
April 17

XSLT Text Processing: Fun with Anagrams

After so many nice experiences with XSLT text processing: dictionary lookup, spelling checking and suggesting candidate words, text justifying. text search and replacement, concordance over large corpora, even parsing LR-1 languages, I almost had decided that there was nothing much left to do in this area.

These days, while some people, who don't know what they are talking about are discussing in the XSL-List "Is XSLT dead?", it suddenly came to me that yes, there was something left: finding anagrams

It turns out that finding anagrams with XSLT is as easy and elegant as shown in the following code:

 

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="http://fxsl.sf.net/"
 >
 <xsl:import href="../f/func-qsort.xsl"/> 

 <xsl:key name="kAnagram" match="w
   use="codepoints-to-string(f:qsort(string-to-codepoints(.)))"
  />

 <xsl:template match="/">
   <xsl:copy-of select="key('kAnagram', 'acert')"/>
 </xsl:template>

</xsl:stylesheet>  


Here, I am reusing the same 46379 English wordforms dictionary, I was using for the spelling checking tasks. In fact, the transformation is applied on it -- the document dictEnglish.xml. Each word has its Anagram Key, which is the string of the sorted sequence of characters comprising this word. It is easy to see that words, which are anagrams to each other do have the same anagram key.

The Anagram Key of a word is specified as the following XPath expression:

  codepoints-to-string(f:qsort(string-to-codepoints(.)))

where string-to-codepoints()  and codepoints-to-string()  are standard XPath 2.0 functions and  f:qsort()  is a function from FXSL.

In the code above the English dictionary is indexed using as key (yes!) the Anagram Key for each word. Then we get all words that have the same Anagram Key as the word "trace". The result is:  

<w>caret</w><w>cater</w><w>crate</w>
<
w>react</w><w>recta</w><w>trace</w>

While it is nice to be able to implement such functionality just with a few lines of code, it is not wise to index the huge English dictionary each time we need to get some anagrams. In fact this indexing operation takes a lot of time -- in the example above it took about 280 seconds.

The next logical step is to persist the indexed document into an Anagram Dictionary and reuse it from here on. Here is the XSLT transformation, which creates the Anagram Dictionary:

<xsl:stylesheet version="2.0" 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
 xmlns:xs="http://www.w3.org/2001/XMLSchema" 
 xmlns:f="http://fxsl.sf.net/" 
 exclude-result-prefixes="f xs" 
 > 
 <xsl:import href="../f/func-standardXSLTXpathFunctions.xsl"/>  

 <!-- To be applied on dictEnglish.xml -->  

 <