Monday, December 10, 2007

ODF - Hello World with XQuery

This is the "How to create the simple hello-world document for OpenOffice using XQuery" example. I'm using MarkLogic Server for this of course. If you're interested you can download a free copy using a Community License here. I'm also using OpenOffice 2.3.1, available here.

An OpenOffice document is just a zip file. It's actually a .jar file, but we don't care about that right now. We can unzip it and extract the separate XML parts, just how we would with any zip file. In fact, in Windows, you can just change the extension of your OpenOffice document from .odt to .zip, right-click, select "Extract All", then take a look at the files in the folder.

Similarly, we can create an OpenOffice file by creating the required parts and zipping them up into a package. When we say an ODF or OpenOffice document, we're usually referring to the collection of XML documents that make up the .odt you use with OpenOffice. ODF stands for Open Document Format, and is the name given to the XML that OpenOffice is using to create your documents.

This should all sound very familiar. It's very similar to what I've posted on Office Open XML, and a .docx file in Word; We often say an Office 2007 / Word document, but mean the collection of XML files that make up the .docx.

The minimal .odt document has just 2 parts: content.xml and manifest.xml.
We place the main text and body of our document in content.xml, and place the assorted files that compose the document in the manifest.xml. I'll explore the other files in future posts, they of course have to do with styling your document, meta-information about your document (created by, created date, etc.), images, etc. Since we aren't using any other files in this example, this document will have zero formatting and no meta-information associated with it.

Ok, place the following in a file named openODF.xqy under /Docs of your MarkLogic install. You can then evaluate by opening your browser and navigating to http://localhost:8000/openODF.xqy. Your test document will open directly into OpenOffice Writer. You can then mess with the XQuery and XML to create other types of documents. Good times!

Note: To keep the code readable I had to split a couple of nodes across lines. I was able to cut-and-paste this into a .xqy and evaluate with no problems, but I mention in case you run into any issues.
define function generate-odt(
$docmanifest as node(),
$content as node()
) as binary()
{
let $manifest :=
<parts xmlns="xdmp:zip">

<part>META-INF/manifest.xml</part>
<part>content.xml</part>
</parts>
let $parts := ($docmanifest, $content)
return
xdmp:zip-create($manifest, $parts)
}

let $docmanifest :=
<manifest:manifest
xmlns:manifest="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0">

<manifest:file-entry
manifest:media-type="application/vnd.oasis.opendocument.text"
manifest:full-path="/"/>

<manifest:file-entry manifest:media-type="text/xml" manifest:full-path="content.xml"/>
</manifest:manifest>

let $content :=
<office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
office:version="1.1">

<office:body>
<office:text>
<text:p text:style-name="Standard">
<text:s text:c="5"/>Hello World! This is my first paragraph.
</text:p>
<text:p text:style-name="Standard"/>
<text:p text:style-name="Standard">
<text:s text:c="5"/>This is another paragraph.</text:p>
</office:text>
</office:body>
</office:document-content>

let $package := generate-odt($docmanifest, $content)
let $filename := "hello-world.odt"
let $disposition := concat("attachment; filename=""",$filename,"""")
let $x := xdmp:add-response-header("Content-Disposition", $disposition)
let $x := xdmp:set-response-content-type("application/vnd.oasis.opendocument.text")
return
$package

For those interested in the ODF format, there's a free book, OpenDocument Essentials, as well as the ODF specification.

The content.xml in a nutshell: <office-document> is our root element.
It's first children
are optional and can be <office:scripts>, <office:font-face-decls>, and <office:styles>.

We don't see those here, we'll examine those more in the future. The only required element is <office:body> and this is where the magic happens. It's first child element tells us what type of document we're actually dealing with; we have the choice of:

<office:text>
<office:drawing>
<office:presentation>
<office:spreadsheet>
<office:chart>
<office:image>

We're dealing with text. From there we see it's child element <text:p> , which signifies a paragraph. Now, the only thing funky above is the use of <text:s>, which signifies whitespace. There's a couple of pages on how to handle whitespace in the Essentials book. When you opened the document, you might not have noticed, but each sentence was indented 5 spaces. You can safely remove the <text:s> node for the example above.

Ok, so it's a little more than just a HelloWorld example, but we're not really interested in a one paragraph, one word document. For more fun, we can just start extracting OpenOffice documents and insert the pieces into our XML Server. It's all just XML at the end of the day, and I actually think it's fun to dissect these formats and then transform them into whatever I want. So with ODF and Office Open XML documents in my server, I can write queries to find what I'm looking for and then just deliver the content in any requested format. Sweet!


No comments: