Sunday, January 6, 2008

ODF vs. OpenXML: Win or Lose, We Win

Happy New Year All! Ok, I've been meaning to post this for awhile now, and I recently had a friend ask me about it, so here we go.

I'm very excited about XML, XQuery, and XML servers. As one who's toiled in many a RDBMS trying to do something useful with unstructured/semi-structured data, this tech brings some serious heat to the party. Right now, as many know, there's this war going on for what's going to be the document format of the future. It's like the documents are the spice and Baron Harkonnen can be heard saying "HE WHO CONTROLS THE FORMAT, CONTROLS THE UNIVERSE!". ( That's a Dune movie reference people. )

While the last comment is meant to bring levity to the situation, people really seem to take their document formats very, very seriously. With all that's going on in the world, I'm often amazed at the amount of energy put into championing one format over another, but this is the thing: at the end of the day, Open Office documents and Office 2007 documents are just bags of XML now, so win or lose in this battle, we win.

Once we have the XML, we can make the documents into anything we want them to be. With an XML server, XQuery, and minimal knowledge about the formats, it's ridiculously easy to change a document of one format type into the other or even some completely different format altogether. So today I thought I'd provide a simple example.

We'll start with the Excel 2007 spreadsheet from this previous post and an OpenOffice Calc spreadsheet using the parts I'll provide below. You can either create these using XQuery, or just create the individual files manually and zip them up to create the .xlsx and .ods documents. I've named my documents, creatively enough, as Sheet1.ods and Sheet1.xslx. Once you have the pieces that compose the files, load them into you XML server.

For this example, I'm using MarkLogic Server of course.
I've loaded the pieces into a directory structure, where the the filename is the name of the directory; Each XML file that comprises the package for the document can be found in the directory named for the file it originated from. An example file in my server has the name "/Sheet1.ods/content.xml".

Ok, it looks like there's a lot here, but there isn't really. And it's hard to publish stuff like this on the web, so, I suggest either A) downloading vim and cutting and pasting the XML and XQuery into a .xml or .xqy file. ( In vim, you'll get syntax highlighting.) or B) Crank up the size of the text in the browser. ( In firefox that's: View -> Text Size -> Increase++ )

Create A Calc Spreadsheet

A simple Calc spreadsheet has two pieces, the manifest and a content.xml file.
Save the following as META_INF/manifest.xml:
<manifest:manifest xmlns:manifest="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0">
<manifest:file-entry manifest:media-type="application/vnd.oasis.opendocument.spreadsheet" manifest:full-path="/"/>
<manifest:file-entry manifest:media-type="text/xml" manifest:full-path="content.xml"/>
</manifest:manifest>
Save the following as content.xml:

<office:document-content office:version="1.1" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0">
<office:body>
<office:spreadsheet>
<table:table table:name="Sheet1" table:print="false">
<office:forms form:automatic-focus="false" form:apply-design-mode="false" xmlns:form="form"/>
<table:table-column table:default-cell-style-name="Default"/>
<table:table-column table:default-cell-style-name="Default"/>
<table:table-column table:default-cell-style-name="Default"/>
<table:table-row>
<table:table-cell table:number-columns-repeated="3"/>
</table:table-row>
<table:table-row>
<table:table-cell office:value-type="string">
<text:p>Player Piano</text:p>
</table:table-cell>
<table:table-cell office:value-type="string">
<text:p>Kurt Vonnegut</text:p>
</table:table-cell>
<table:table-cell office:value-type="string">
<text:p>ISBN</text:p>
</table:table-cell>
</table:table-row>
<table:table-row>
<table:table-cell office:value-type="string">
<text:p>Ubik</text:p>
</table:table-cell>
<table:table-cell office:value-type="string">
<text:p>Philip K. Dick</text:p>
</table:table-cell>
<table:table-cell office:value-type="string">
<text:p>ISBN</text:p>
</table:table-cell></table:table-row>
<table:table-row>
<table:table-cell office:value-type="string">
<text:p>Stories of Your Life and Others</text:p>
</table:table-cell>
<table:table-cell office:value-type="string">
<text:p>Ted Chiang</text:p>
</table:table-cell><table:table-cell office:value-type="string">
<text:p>ISBN</text:p>
</table:table-cell>
</table:table-row>
<table:table-row>
<table:table-cell office:value-type="string">
<text:p>World War Z</text:p>
</table:table-cell>
<table:table-cell office:value-type="string">
<text:p>Max Brooks</text:p>
</table:table-cell>
<table:table-cell office:value-type="string">
<text:p>ISBN</text:p>
</table:table-cell>
</table:table-row>
</table:table>
<table:table table:name="Sheet2" table:print="false">
<table:table-column table:default-cell-style-name="Default"/>
<table:table-row>
<table:table-cell/>
</table:table-row>
</table:table>
<table:table table:name="Sheet3" table:print="false">
<table:table-column table:default-cell-style-name="Default"/>
<table:table-row>
<table:table-cell/>
</table:table-row>
</table:table>
</office:spreadsheet>
</office:body>
</office:document-content>

Zip 'em up, and give the file a .ods extension.
If you have OpenOffice installed you'll be able to open and you'll see:











I use very simple examples for the documents, so you can get an understanding of some of the XML that's in there and we can equally focus on the XQuery used to manipulate the docs. However, if you were to open up Calc or Excel and create these spreadsheets using the applications, when you save and unzip the packages for examination, you'll notice the XML generated for the files will look different. If you do that make sure you update the XQuery in the example below accordingly.

Following simple examples like these, as well as creating documents using their respective applications and examining the deltas in the XML published is a great way to learn more about the ODF and Open XML formats, what XML is required for consumption by each application, and what XML is generated upon save.

Your Office 2007 document should look like:









If it looks strange for some reason, it's because I have the ribbon minimized.

Transformation with XQuery

Finally, here's some XQuery. You can test tranforming for both formats by just switching the $test variable accordingly. If you have the files loaded into your Server, named as I've described above, you should be able to just place the following in a .xqy file. Name the file test.xqy and place it under /Docs in your MarkLogic server. You can then evaluate the file by opening your browser and navigating to http://localhost:8000/test.xqy . I have OpenOffice and Office 2007 installed, so I just open into the appropriate application directly. You can easily change this to just insert the XML as another doc so you can validate for yourself in the Server.
declare namespace ms = "http://schemas.openxmlformats.org/spreadsheetml/2006/main"
declare namespace table = "urn:oasis:names:tc:opendocument:xmlns:table:1.0"
declare namespace office = "urn:oasis:names:tc:opendocument:xmlns:office:1.0"
declare namespace text= "urn:oasis:names:tc:opendocument:xmlns:text:1.0"
declare namespace manifest = "urn:oasis:names:tc:opendocument:xmlns:manifest:1.0"
declare namespace form="urn:oasis:names:tc:opendocument:xmlns:form:1.0"

define function generate-simple-ods-odf(
$docmanifest as node(),
$content as node()
) as binary()
{
let $manifest :=
<parts xmlns="xdmp:zip">
<part>META-INF/manifest.xml</part>
<part>content.xml</part>
</parts>
let $parts := ($docmanifest, $content)
return
xdmp:zip-create($manifest, $parts)
}

define function generate-simple-xl-ooxml(
$content-types as node(),
$workbook as node(),
$rels as node(),
$workbookrels as node(),
$sheet1 as node()
) as binary()
{
let $manifest := <parts xmlns="xdmp:zip">
<part>[Content_Types].xml</part>
<part>workbook.xml</part>
<part>_rels/.rels</part>
<part>_rels/workbook.xml.rels</part>
<part>sheet1.xml</part>
</parts>
let $parts := ($content-types, $workbook, $rels, $workbookrels, $sheet1)
return
xdmp:zip-create($manifest, $parts)
}

define function getOpenXMLRowsFromODF($odfrows as element(table:table-row)*) as element(ms:row)*
{
for $i in $odfrows
let $row := $i/table:table-cell/text:p
return
<ms:row>
{
for $text in $row/text() return
<ms:c t="inlineStr">
<ms:is>
<ms:t>{$text}</ms:t>
</ms:is>
</ms:c>
}
</ms:row>
}

define function getODFRowsFromOpenXML($ofcrows as element(ms:row)*) as element(table:table-row)*
{
for $i in $ofcrows
let $row := $i//ms:t
return
<table:table-row>
{
for $text in $row/text() return
<table:table-cell office:value-type="string">
<text:p>{$text}</text:p>
</table:table-cell>
}
</table:table-row>
}

(: ---------------- BEGIN OFFICE OPEN XML EXCEL FILES --------------------:)

let $content-types :=
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default Extension="rels"
ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml" ContentType="application/xml"/>
<Override PartName="/workbook.xml"
ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>
<Override PartName="/sheet1.xml"
ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
</Types>

let $workbook :=
<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
<sheets>
<sheet name="Sheet1" sheetId="1" r:id="rId1" />
</sheets>
</workbook>

let $rels :=
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"
Target="workbook.xml"/>
</Relationships>

let $workbookrels :=
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet"
Target="sheet1.xml"/>
</Relationships>

let $odfrows:= doc("/Sheet1.ods/content.xml")//table:table[1]/table:table-row

let $sheet1 :=
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<sheetData>
{getOpenXMLRowsFromODF($odfrows)}
</sheetData>
</worksheet>

(: ------------------- END OFFICE OPEN XML EXCEL FILES --------------------:)

(: -------------------------- BEGIN ODF CALC FILES ------------------------:)

let $openxml_rows := doc("/Sheet1.xlsx/sheet1.xml")//ms:row

let $docmanifest:=
<manifest:manifest xmlns:manifest="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0">
<manifest:file-entry manifest:media-type="application/vnd.oasis.opendocument.spreadsheet" manifest:full-path="/"/>
<manifest:file-entry manifest:media-type="text/xml" manifest:full-path="content.xml"/>
</manifest:manifest>

let $content :=
<office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" office:version="1.1">
<office:body>
<office:spreadsheet>
<table:table table:name="Sheet1" table:print="false">
<office:forms form:automatic-focus="false" form:apply-design-mode="false"/>
<table:table-column table:default-cell-style-name="Default"/>
<table:table-column table:default-cell-style-name="Default"/>
<table:table-column table:default-cell-style-name="Default"/>
{getODFRowsFromOpenXML($openxml_rows)}
</table:table>
<table:table table:name="Sheet2" table:print="false">
<table:table-column table:default-cell-style-name="Default"/>
<table:table-row>
<table:table-cell/>
</table:table-row>
</table:table>
<table:table table:name="Sheet3" table:print="false">
<table:table-column table:default-cell-style-name="Default"/>
<table:table-row >
<table:table-cell/>
</table:table-row>
</table:table>
</office:spreadsheet>
</office:body>
</office:document-content>

(: ---------------------------- END ODF CALC FILES ------------------------:)

(: let $test := "officeopenxml" :)
let $test := "odf"

let $package := if($test eq "officeopenxml") then
generate-simple-xl-ooxml($content-types, $workbook, $rels, $workbookrels, $sheet1)
else
generate-simple-ods-odf($docmanifest, $content)
let $filename := if($test eq "officeopenxml") then
"ExcelTest.xlsx"
else
"CalcTest.ods"

let $disposition := concat("attachment; filename=""",$filename,"""")
let $x := xdmp:add-response-header("Content-Disposition", $disposition)

let $x := if($test eq "officeopenxml") then
xdmp:set-response-content-type("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
else
xdmp:set-response-content-type("application/vnd.oasis.opendocument.spreadsheet")
return
$package

In the example, we just created new documents, tranforming the content from one spreadsheet into the format of the other. I'm not going to dig too deep into this; You should be able to read the XQuery and see what's going on for yourself. But notice how we just have a variable set to the rows from our original spreadsheet, and when we go to create the new spreadsheet, we just call a function that transforms the rows from one format to the other. It's that simple.

If it helps, I've done the "Hello-World" document example with both OpenOffice and Office 2007, (a couple of times). Those previous posts should help answer some questions if this is all brand new to you. After evaluating the XQuery, you'll see your spreadsheets as follows:




















Neither of those files existed in those formats when we began, and when the next format comes along, we'll just transform our saved XML content into that format too. As long as it's XML, the content can be anything we want it to be. The XQuery looks lengthy because of all the nodes we're constructing, but once we're more familiar with the formats, we can re-use pieces we've already stored. I just want to highlight that the amount of XQuery required for a powerful transformation of our content is really minimal.

08's going to be a great year for XML and XQuery and I'm excited to be working with both. My New Year's resolution is to post here regularly. That doesn't necessarily mean often, but regularly. :)

To learn more about ODF:
Open Document Essentials
The ODF Specification

To learn more about Open XML:
Open XML Explained
The Open XML Specifcations

To learn more about XQuery:
x-query.com (get on the mailing list)
w3c XQuery .10
XQuery tutorial

No comments: