Nightly thoughts: Writing an extension step for Calabash, to use BaseX

Introduction

Writing an extension for Calabash in Java involves three different things: 1/ the Java class itself, which has to implement the interface XProcStep, 2/ binding a step name to the implementation class, and 3/ declaring the step in XProc.

Java

Let's take, as an example, a step evaluating a query using the standalone BaseX processor. The goal is not to have a fully functional step, nor to have a best-quality-ever step with error reporting and such, but rather to emphasize how to glue all the things together. The step has one input port, named source, and one output port, named result. The step gets the string value of the input port (typically a c:query element) and evaluates it as an XQuery, using BaseX. The result is parsed as an XML document and sent to the output port (it is a parse error if the result of the query is not an XML document or element). Let's start with the Java class implementing the extension step:

/****************************************************************************/
/*  File:       BasexStandaloneQuery.java                                   */
/*  Author:     F. Georges - H2O Consulting                                 */
/*  Date:       2011-08-31                                                  */
/*  Tags:                                                                   */
/*      Copyright (c) 2011 Florent Georges.                                 */
/* ------------------------------------------------------------------------ */


package org.fgeorges.test;

import com.xmlcalabash.core.XProcException;
import com.xmlcalabash.core.XProcRuntime;
import com.xmlcalabash.io.ReadablePipe;
import com.xmlcalabash.io.WritablePipe;
import com.xmlcalabash.library.DefaultStep;
import com.xmlcalabash.runtime.XAtomicStep;
import java.io.StringReader;
import javax.xml.transform.Source;
import javax.xml.transform.stream.StreamSource;
import net.sf.saxon.s9api.DocumentBuilder;
import net.sf.saxon.s9api.SaxonApiException;
import net.sf.saxon.s9api.XdmNode;
import org.basex.core.BaseXException;
import org.basex.core.Context;
import org.basex.core.cmd.XQuery;


/**
 * Sample extension step to evaluate a query using BaseX.
 *
 * @author Florent Georges
 * @date   2011-08-31
 */
public class BasexStandaloneQuery
        extends DefaultStep
{
    public BasexStandaloneQuery(XProcRuntime runtime, XAtomicStep step)
    {
        super(runtime,step);
    }

    @Override
    public void setInput(String port, ReadablePipe pipe)
    {
        mySource = pipe;
    }

    @Override
    public void setOutput(String port, WritablePipe pipe)
    {
        myResult = pipe;
    }

    @Override
    public void reset()
    {
        mySource.resetReader();
        myResult.resetWriter();
    }

    @Override
    public void run()
            throws SaxonApiException
    {
        super.run();

        XdmNode query_doc = mySource.read();
        String query_txt = query_doc.getStringValue();
        XQuery query = new XQuery(query_txt);
        Context ctxt = new Context();
        // TODO: There should be something more efficient than serializing
        // everything and parsing it again...  Plus, if the result is not an XML
        // document, wrap it into a c:data element.  But that's beyond the point.
        String result;
        try {
            result = query.execute(ctxt);
        }
        catch ( BaseXException ex ) {
            throw new XProcException("Error executing a query with BaseX", ex);
        }
        DocumentBuilder builder = runtime.getProcessor().newDocumentBuilder();
        Source src = new StreamSource(new StringReader(result));
        XdmNode doc = builder.build(src);

        myResult.write(doc);
    }

    private ReadablePipe mySource = null;
    private WritablePipe myResult = null;
}

An extension step has to implement the Calabash interface XProcStep. Calabash provides a convenient class DefaultStep that implements all the methods with default behaviour, good for most usages. The only thing we have to do is to save the input and output for later use, and to reset them in case the step object is reused. And of course to provide the main processing in run(). The processing itself, in the run() method, we read the value from the source port, get its string value, execute it using the BaseX API, and parse the result as XML to write it to the result port.

As you can see, there is nothing in the class itself about the interface of the step: its type name, its inputs and outputs, its options, etc. This is done in two different places. First you link the step type to the implementation class, then you declare the step with XProc.

Tell Calabash about the class

Linking the step type to the implementation class is done in a Calabash config file. So you have to create a new config file, and pass it to Calabash on the command line with the option --config (in abbrev -c). The file itself is very simple, and link the step type (a QName) and the class (a fully qualified Java class name):

<xproc-config xmlns="http://xmlcalabash.com/ns/configuration"
              xmlns:fg="http://fgeorges.org/ns/tmp/basex">

   <implementation type="fg:ad-hoc-query"
                   class-name="org.fgeorges.test.BasexStandaloneQuery"/>

</xproc-config>

Declare the step

Finally, declaring the step in XProc is done using the standard p:declare-step. If it contains no subpipeline (that is, if it contains only p:input, p:output and p:option children), then it is considered as a declaration of a step the implementation of which is somewhere else; if it contains a subpipeline, then this is a step type definition, with the implementation defined in XProc itself. The declaration can be copied and pasted in the main pipeline itself, but as with any other language, the best practice is rather to declare it in an XProc library and to import this library (composed only with step declarations) within the main pipeline using p:import. In our case, we define the step type to have an input port source, an output port result (both primary), and without any option:

<p:library xmlns:p="http://www.w3.org/ns/xproc"
           xmlns:fg="http://fgeorges.org/ns/tmp/basex"
           xmlns:pkg="http://expath.org/ns/pkg"
           pkg:import-uri="http://fgeorges.org/tmp/basex.xpl"
           version="1.0">

   <p:declare-step type="fg:ad-hoc-query">
      <p:input  port="source" primary="true"/>
      <p:output port="result" primary="true"/>
   </p:declare-step>

</p:library>

Using it

Now that we have every pieces, we can write an example main pipeline using this new extension step:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
                xmlns:c="http://www.w3.org/ns/xproc-step"
                xmlns:fg="http://fgeorges.org/ns/tmp/basex"
                name="pipeline"
                version="1.0">

   <p:import href="basex-lib.xpl"/>

   <p:output port="result" primary="true"/>

   <fg:ad-hoc-query>
      <p:input port="source">
         <p:inline>
            <c:query>
               &lt;res> { 1 + 1 } &lt;/res>
            </c:query>
         </p:inline>
      </p:input>
   </fg:ad-hoc-query>

</p:declare-step>

To run it, just issue the following command on the command line (where basex-steps.jar is the JAR file you compiled the extension step class into):

> java -cp ".../calabash.jar:.../basex-6.7.1.jar:.../basex-steps.jar" \
       -c basex-config.xml \
       example.xproc

If you use this script, you can then use the following command:

> calabash ++add-cp .../basex-6.7.1.jar \
           ++add-cp .../basex-steps.jar" \
           -c basex-config.xml \
           example.xproc

Packaging

Update: The mechanism described in this section has been implemented, see this blog entry.

If you want to publicly distribute your extension, you have to provide your users with 1/ the JAR file, 2/ the config file and 3/ the library file. Thus the user needs to correctly configure Java with the JAR file, to correctly configure Calabash with the config file, and to use a suitable URI in the p:import/@href in his/her pipeline. This is a lot of different places where the user can make a mistake.

The EXPath Packaging open-source implementation for Calabash does not support Java extension steps yet, but it is planned to support them, in order to handle that configuration part automatically. The goal is to have the library author to define an absolute URI for the XProc library (declaring the steps), which the user uses in p:import, regardless of where it is actually installed (it will be resolved automatically). The details (classpath setting, XProc library resolving, and Calabash config) should then be handled by the packaging support. Once the package of the extension step has been installed in the repository, one can then execute the following pipeline (note the import URI has changed):

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
                xmlns:c="http://www.w3.org/ns/xproc-step"
                xmlns:fg="http://fgeorges.org/ns/tmp/basex"
                name="pipeline"
                version="1.0">

   <p:import href="http://fgeorges.org/tmp/basex.xpl"/>

   <p:output port="result" primary="true"/>

   <fg:ad-hoc-query>
      <p:input port="source">
         <p:inline>
            <c:query>
               &lt;res> { 1 + 1 } &lt;/res>
            </c:query>
         </p:inline>
      </p:input>
   </fg:ad-hoc-query>

</p:declare-step>

by invoking simply the following command:

> calabash example.xproc

Labels: basex, calabash, expath, xproc

1 Comments:

Christian Grün said...: Dear Florent, thanks for this blog entry. In the following, I have listed two quick alternatives for evaluating XQuery expressions in BaseX. The first version directly communicates with the XQuery processor of BaseX and caches the serialized byte stream (bypassing the string conversion):

import org.basex.core.Context;
import org.basex.data.Result;
import org.basex.io.serial.Serializer;
import org.basex.query.QueryException;
import org.basex.query.QueryProcessor;
...

@Override
public void run() throws SaxonApiException {
super.run();

XdmNode query_doc = mySource.read();
String query_txt = query_doc.getStringValue();

ByteArrayOutputStream baos = new ByteArrayOutputStream();
Context ctx = new Context();
QueryProcessor qp = new QueryProcessor(query_txt, ctx);
try {
Serializer ser = qp.getSerializer(baos);
Result res = qp.execute();
res.serialize(ser);
} catch(QueryException ex) {
throw new XProcException(ex);
} catch(IOException ex) {
throw new XProcException(ex);
}
Source src = new StreamSource(new ByteArrayInputStream(baos.toByteArray()));

DocumentBuilder builder = runtime.getProcessor().newDocumentBuilder();
XdmNode doc = builder.build(src);
myResult.write(doc);
}

The second variant communicates with the client/server architecture of BaseX:

import org.basex.core.BaseXException;
import org.basex.server.ClientSession;
...

@Override
public void run() throws SaxonApiException {
super.run();

XdmNode query_doc = mySource.read();
String query_txt = query_doc.getStringValue();

try {
ClientSession cs = new ClientSession("localhost", 1984, "admin", "admin");
final String result = cs.query(query_txt).execute();
Source src = new StreamSource(new StringReader(result));

DocumentBuilder builder = runtime.getProcessor().newDocumentBuilder();
XdmNode doc = builder.build(src);
myResult.write(doc);
} catch (IOException ex) {
throw new XProcException(ex);
} catch (BaseXException ex) {
throw new XProcException(ex);
}
}

In both variants, the result is completely serialized before it is passed on to Saxon's node builder. If the intermediate result gets very large, we could try in a second step to merge the serializer and input stream.

All the best,
Christian; 00:18

<< Home

Nightly thoughts

Sunday, September 04, 2011

Writing an extension step for Calabash, to use BaseX