![]() |
|
Spaces home XSLT: Riding the challen...ProfileFriendsBlogMore ![]() | ![]() |
|
XSLT: Riding the challengeAll the things you thought impossible to do with XSLT
July 20 The Swiss Army Knife of Data Structures ... in C#Actually, it is not a knife and is known under the name of Finger Tree. Background: "a functional representation of persistent sequences supporting access to the ends in amortized constant time, and concatenation and splitting in time logarithmic in the size of the smaller piece. Representations achieving these bounds have appeared previously, but 2-3 finger trees are much simpler, as are the operations on them. Further, by defining the split operation in a general form, we obtain a general purpose data structure that can serve as a sequence, priority queue, search tree, priority search queue and more." Why the finger tree deserves to be called the Swiss knife of data structures can best be explained by again quoting the introduction of the paper: "The operations one might expect from a sequence abstraction include adding and removing elements at both ends (the deque operations), concatenation, insertion and deletion at arbitrary points, finding an element satisfying some criterion, and splitting the sequence into subsequences based on some property. Many efficient functional implementations of subsets of these operations are known, but supporting more operations efficiently is difficult. The best known general implementations are very complex, and little used. Efficiency and universality are the two most attractive features of finger trees. Not less important is simplicity, as it allows easy understanding, straightforward implementation and uneventful maintenance. Stacks support efficient access to the first item of a sequence only, queues and deques support efficient access to both ends, but not to an randomly-accessed item. Arrays allow extremely efficient O(1) access to any of their items, but are poor at inserting, removal, splitting and concatenation. Lists are poor (O(N)) at locating a randomly indexed item. Remarkably, the finger tree is efficient with all these operations. One can use this single data structure for all these types of operations as opposed to having to use several types of data structures, each most efficient with only some operations. Note also the words functional and persistent, which mean that the finger tree is an immutable data structure. In .NET the IList<T> interface specifies a number of void methods, which change the list in-place (so the instance object is mutable). To implement an immutable operation one needs first to make a copy of the original structure (List<T>, LinkedList<T>, ..., etc). An achievement of .NET 3.5 and LINQ is that the set of new extension methods (of the Enumerable class) implement immutable operations. In the year 2008, Finger Tree implementations have been known only in a few programming languages: in Haskell, in OCaml, and in Scala. At least this is what the popular search engines say. What about a C# implementation? In February Eric Lippert had a post in his blog about finger trees. The C# code he provided does not implement all operations of a Finger Tree and probably this is the reason why this post is referred to by the Wikipedia only as "Example of 2-3 trees in C#", but not as an implementation of the Finger Tree data structure. My modest contribution is what I believe to be the first complete C# implementation of the Finger Tree data structure as originally defined in the paper by Hinze and Paterson (only a few exercises have not been implemented). Programming a Finger Tree in C# was as much fun as challenge. The finger tree structure is defined in an extremely generic way. At first I even was concerned that C# might not be sufficiently expressive to implement such rich genericity. It turned out that C# lived up to the challenge perfectly. Here is a small example of how the code uses multiple types and nested type constraints:
// Types:
// U -- the type of Containers that can be split
// T -- the type of elements in a container of type U
// V -- the type of the Measure-value when an element is measured
public class Split<U, T, V> Another challenge was to implement lazy evaluation (the .NET term for this is "deferred execution") for some of the methods. Again, C# was up to the challenge with its IEnumerable interface and the ease and finesse of using the "yield return" statement. The net result: it was possible to write code like this: public override IEnumerable<T> ToSequence() Another challenge, of course, was that one definitely needs to understand Hinze's and Ross' article before even trying to start the design of an implementation. While the text should be straightforward to anyone with some Haskell and functional programming experience, it requires a bit of concentration and some very basic understanding of fundamental algebraic concepts. In the text of the article one will find a precise and simple definition of a Monoid. My first thought was that such academic knowledge would not really be necessary for a real-world programming task. Little did I know... It turned out that the Monoid plays a central role in the generic specification of objects that have a Measure. I was thrilled to code my own version of a monoid in C#: public class Monoid<T> { T theZero;
public monOp theOp;
public Monoid(T tZero, monOp aMonOp) { theZero = tZero;
theOp = aMonOp; }
public T zero { get { return theZero; } } }
Without going into too-much details, here is how the correct Monoids are defined in suitable auxiliary classes to be used in defining a Random-Access Sequence, Priority Queue and Ordered Sequence:
public static class Size { public static Monoid<uint> theMonoid = new Monoid<uint>(0, new Monoid<uint>.monOp(anAddOp));
public static uint anAddOp(uint s1, uint s2) { return s1 + s2; } }
public static class Prio
{
public static Monoid<double> theMonoid =
new Monoid<double>
(double.NegativeInfinity,
new Monoid<double>.monOp(aMaxOp)
);
public static double aMaxOp(double d1, double d2)
{
return (d1 > d2) ? d1 : d2;
}
}
public class Key<T, V> where V : IComparable
{
public delegate V getKey(T t);
// maybe we shouldn't care for NoKey, as this is too theoretic
public V NoKey;
public getKey KeyAssign;
public Key(V noKey, getKey KeyAssign)
{
this.KeyAssign = KeyAssign;
}
}
public class KeyMonoid<T, V> where V : IComparable
{
public Key<T, V> KeyObj;
public Monoid<V> theMonoid;
public V aNextKeyOp(V v1, V v2)
{
return (v2.CompareTo(KeyObj.NoKey) == 0) ? v1 : v2;
}
//constructor
public KeyMonoid(Key<T, V> KeyObj)
{
this.KeyObj = KeyObj;
this.theMonoid =
new Monoid<V>(KeyObj.NoKey,
new Monoid<V>.monOp(aNextKeyOp)
);
}
}
Yet another challenge was to be able to create methods dynamically, as currying was essentially used in the specification of finger trees with measures. Once again it was great to make use of the existing .NET 3.5 infrastructure. Below is my simple FP static class, which essentially uses the .NET 3.5 Func object and a lambda expression in order to implement currying:
public static class FP { public static Func<Y, Z> Curry<X, Y, Z> { return (y) => func(x, y); } }
And here is a typical usage of the currying implemented above:
public T ElemAt(uint ind) { (new MPredicate<uint> ( FP.Curry<uint, uint, bool>(theLessThanIMethod2, ind) ), 0 ).splitItem.Element; }
Now, for everyone who have reached this point of my post, here is the link to the complete implementation. Be reminded once again that .NET 3.5 is needed for a successful build. In my next posts I will analyze the performance of this Finger Tree implementation and how it fares compared to existing implementations of sequential data structures as provided by different programming languages and environments. June 04 Oleg Tkachenko has joined MicrosoftLong time Microsoft MVP Oleg Tkachenko has joined Microsoft's Visual Studio team. Oleg is well-known for his work on a number of popular XML/XSLT - related open source projects, such as EXSLT.NET, MVP.XML, nXslt.exe, the Iron XSLT project type of Visual Studio 2008, ... , etc. Oleg is also a member of the FXSL open source project. I am glad for Oleg and want to congratulate him on his new role. May 18 The Balisage conference program is availableThe Balisage conference program is available and a quick look shows there are some very interesting presentations. May 14 Access online the books you bought from Amazon
Here's how the announcement looks like:
Now, if only this book could also be made accessible:
March 16 Jeni's "Grouper" (or how to specify and parse additional rules to a grammar) is just a minor task for LBNFIn her recent post "RELAX NG for matching" Jeni Tennison said:
Then she proceeds by making the suggestion that "As a language, RELAX NG is really good at this"
Jeni ends with the following statement:
It is obvious that the solution of this problem is not strictly bound to RELAX NG itself Very fortunately, I know at least one such system: the Labelled BNF Grammar Formalism (LBNF). Jeni's "grouper" tool can be implemented by adding additional rules to the language being specified Here's how the authors of LBNF Markus Forsberg, Aarne Ranta from Chalmers University of Technology
Of course, LBNF is a very nice and cool tool to carry out a number of really important language development tasks, besides the "grouper". November 09 Wide Finder in XSLT --> deriving new requirements for efficiency in XSLT processors.With his twelve posts under the common title of "The Wide Finder Project" Tim Bray formulated a problem, which obviously has since then stirred some people, anxious to prove that their programming language of choice fares well in implementing solutions for this class of problems. While I am not aware of any XSLT implementation that provides explicit or implicit support for parallel processing (with the obvious goal to take advantage of the multi-core processors that have almost reached a "prevalent" status today), I was curious to determine at least two things:
Before going further let me remind that there is a popular (and as we'll see unfounded!) belief that any XSLT solution must be hugely inefficient and that any XSLT code can only be ugly. In fact, the following comment to one of Tim's posts reflects exactly this mindset:
My initial XSLT solution to Tim's problem is below. No optimization attempts have been attempted, not only because I don't have an XSLT processor that utilizes multi-core processor, but also because it seems there's only a limited possibility for optimization (adding more CPU's is not likely to speed up the reading of a huge file from a single drive).
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xsl:output method="text"/> This 22-line-transformation is performed by Saxon 9.0 on a 3GHz DELL desktop. The file Log1000000.txt was constructed using the 10000 lines file provided by Tim and copying it into another file 100 times. The size of this file is about 200MB. The results were produced in 36.175 seconds using about 929MB of RAM. Saxon should be timed using the -repeat:3 command-line option, which performs the transformation three times. Using only the results from the first/single transformation will be misleading, as they include the time for the Java VM startup.
Discussion
To answer the questions:1. How well a good XSLT 2.0 processor and a straightforward solution fare against other languages/solutions? The non-optimized, uniprocessor version of this solution has a time of 36 seconds for processing about 200MB log file. It is likely the timing for the full , five times bigger log file used by Tim Bray, will be about 5 times bigger: 180 sec, or about 3 minutes. 3 minutes for processing 1GB of data is not a bad time for an XSLT transformation, considering that sizes even of a few hundreds of megs are still considered monstrous. XSLT could be in fact one of the winners in the WF competition, had its creators stepped just a little bit further exploiting the natural, non-sequential XSLT processing model. Here I am talking about specifying a single f:parMap() function, which can be defined exactly as the current FXSL f:map() function: <xsl:function name="f:map" as="item()*"> <xsl:param name="pFun" as="element()"/> <xsl:param name="pList1" as="item()*"/> <xsl:sequence select= "for $this in $pList1 return f:apply($pFun, $this)" /> </xsl:function>
The only difference is that f:parMap() adds to the semantics of f:map() the hint to use as many available CPU processors as appropriate when evaluating the "for" expression in the code of the function. Yes, this can be done for any "/" operator or for any "for" expression such as the one used in lines 13 - 14 in our XSLT solution: "for $line in $vLines
and for any <xsl:apply-templates .../> instruction, and for any <xsl:for-each .../> instruction, and for any <xsl:for-each-group ...> instruction and ... and ... and ...
Judging from the published results, this straightforward, non-optimized uniprocessor XSLT solution comes not too-far behind some of the optimized for multi-processing solutions. I hope that in the not so distant future there will be XSLT processors exploiting the inherent non-sequential nature of XSLT processing to implement highly-optimized multi-processing solutions. 2. Where is the XSLT code on the scale of "beautiful code"? By its compactness (22 lines) it is in 3rd place and rivals the 2nd place (17 lines), as we could easily remove 4 lines (3 of them blank) used to add readability. One of Tim's criteria for "beautiful code" is avoiding awkwardness.He speaks about Ruby not needing ending semicolons or variable type declarations. However, Tim's solution still had to use two lines for defining a hash and setting its default to 0. By contrast, the XSLT solution above does not require the programmer to introduce and initialize a special data structure -- the support for grouping is right there in the core of the language. There is simply no way the programmer can do something wrong in declaring or using a hash table.
What could the XSLT processor do better?Both the timing and especially the amount of RAM used indicate how the XSLT processor did its work:
Imagine that the XSLT Processor is really lazy:
Using these rules a truly lazy XSLT processor will only need space large enough for the longest line, and for a hash table to keep the keys and counters for all different articles being grouped. In this case there are just 562 unique values extracted from the strings of $vLines. In this way, the processing -- reading a line, finding the article it contains and feeding this to the hash table -- could be accomplished in streaming mode while reading the text file. Upon reaching the end of the file, there would be very little left to do, and thus almost nothing to be further optimized.
ConclusionI truly believe that the described improvements can be implemented by at least some of the best XSLT 2.0 processors around here. For many years I have been using Saxon and am very grateful to its developer Dr. Michael Kay. The performance efficiency has been constantly growing -- a very good example to follow by all other XSLT vendors and if they do there will be competition, and competition is good for us all. November 04 Closure in XSLTIn a recent post to the xml-dev mailing list, "XSLT question on transitive closures", Rick Jelliffe wrote:
"Am I right in thinking that 1) XPath2 functions don't have have a function for transitive closure (along provided xpaths) 2) SAXON 8 does not have the saxon:closure() extension function that older versions of SAXON had 3) The one to use is probably still Christian Nentwich's code from circa 2001 as adopted into EXSLT ?"
"I think it would be nice to do it properly based on FXSL higher-order functions, which are much more cleanly specified. Perhaps there is already a suitable function in FXSL. The other thing that's needed is the ability to check for cycles. Simply blowing the stack or looping isn't good enough."
David Carlisle replied by providing a solution that would dynamically generate a new XSLT stylesheet and a closure function in it that uses only a specific function and implements its closure. He also said:
"> I think it would be nice to do it properly based on FXSL higher-order True (but I'll leave that for Dimitre :-)"
Thanks to these nice people I felt just like... something had been left for me :o) So, now we have in FXSL the function which, given a function pFun and a starting set of items pstartSet, produces the transitive closure of pFun.
While the complete 47 lines of code can be viewed using the above link, the essence of the implementation is in the following 20 lines:
<xsl:function name="f:closure" as="node()*"> <xsl:param name="pFun" as="element()"/> <xsl:param name="pstartSet" as="node()*"/> <xsl:sequence select="f:closure2($pFun, $pstartSet,$pstartSet)"/> </xsl:function> <xsl:function name="f:closure2" as="node()*"> <xsl:param name="pFun" as="element()"/> <xsl:param name="pCurClosure" as="node()*"/> <xsl:param name="pstartSet" as="node()*"/> <xsl:if test="exists($pstartSet)"> <xsl:variable name="vNew" select= "f:map($pFun,$pstartSet) except $pCurClosure"/> <xsl:sequence select= "$pstartSet | $vNew | f:closure2($pFun,$pCurClosure | $vNew, $vNew)"/> </xsl:if> </xsl:function>
And here is my reply to David's post (code hyperlinked, full code omitted): " > > True (but I'll leave that for Dimitre:-) I am sorry I only read this a few days ago. Below is the code of the FXSL function. While the code is straightforward, the following must be noted: 1. The "set" at present is only a set of nodes. I will probably produce a more general f:closure() function, which operates on any set of items. Then this function should be also passed as parameters a "union" and a "difference" functions. 2. It seems that David's solution would go into an infinite loop for more involved examples (see the second test with reachability of nodes in cyclic graphs below). Therefore, the algorithm was slightly changed and works correctly. The following files can be downloaded from the CVS of the FXSL project: The last transformation should be applied on the following xml file:
The result from running the second test transformation above (two cases of finding all nodes of a given graph, reachable from a specific node. In the second case there is a cycle involving the nodes V2, V6, V7) is: === Reachable from V1 ======= <node id="V1"/>
Due to its generality, the f:closure() function is a useful addition to the FXSL library. Cheers, Dimitre Novatchev
October 13 Kurt Cagle: Bridging XML E4X and JSONUsing this title Kurt Cagle writes: "I'd like to push a proposal to both the XML and AJAX communities, something that I think needs to be taken up by the W3C, the OpenAJAX alliance and JSON.org especially. Establish a set of conventions within JSON that most readily facilitate JSON being used in an XML context. These conventions should be syntactical, things that can be done with hash key naming conventions that can be picked up by a JSON/XML bridge to transform between the two formats."
Hmm... here's what I have to say: Kurt, this was quietly done some time ago. Just have a look at my blog ("Transforming JSON"). M.David. Peterson has a nice web-service based example, which in real time gets the XML out of the JSON produced by the Yahoo's Traffic Service. This example just uses the JSON to XML convertor provided by FXSL -- that is the function f:json-document(), written in pure XSLT. So, no conventions are necessary for transforming JSON to XML ! July 05 Transforming JSONUpdate: (Think that) finally managed to get from Windows Live Spaces the formatting I wanted... Update: Enhanced the JSON Lexer to properly deal with escaped characters within strings. How to handle \b and \f ??? ==================================== Ever wanted to access and manipulate JSON as ordinary XML? To transform it with XSLT? No problem, use the f:json-document() and f:json-file-document() as provided by FXSL. Here is a quick example: Let's have (yes, this is the Yahoo Traffic Service):
| |||||