Partial automatization of Word documents makes our client's employees happy. Us as well.
What's the Story?
This blog entry looks at the completion (not generation) of word templates in the backend. What are the tools, pitfalls and solutions.
Defining the Problem
The product of our client requires extensive documentation. Our assignment is to automatize some preparatory work for a specific document. The client uses MS Word for word processing and the solution should be a .NET web application.
- Ultimate Artifact
- Word Document The result must be a word document that may be post-processed by the user.
- Template Given The to-be-processed template is based on a given (corporate) template document.
- Modification Possible We are allowed to modify the template document. However its base is changed from time to time (in a rigid corporate process), so modification should be limited.
- User Input pilots the processing.
- External Sources provide the essential data.
- Logic Some data needs to be processed to resolve the template.
We needed to process different types of content:
- Plain Text Replacing, removing or amending text.
- Document Properties Setting Word references (aka fields, e.g. page-number)
- Tables Updating cell text, adding/removing rows, marking a cell's background color
- Checkboxes Setting boolean values
Selecting a word processing framework
Although MS Word is one of Microsoft's major products, the company doesn't provide a state-of-the-art API to process documents.
Sure, you can use Microsoft's COM interface to apply an installed MS Word (desktop) application, but this would not fit well into our web application architecture. Also, one can use Microsoft's sound Open XML SDK, but we let this one pass as well, as we'd like to avoid the low level complexity of a Word XML. John Atten's blog entry sheds some light on the situation.
Looking for an appropriate processing framework we came up with two – quite different – options. There are more options; noteworthy the biggest framework, Aspose, which we didn't consider for monetary reasons.
- DocX An open source framework, developed by a (then) computer science student from Ireland. It essentially applies the MS OXML SDK to Word.
- Text Control A proprietary .NET framework with 10 years history. The smallest package prices at 2500€.
In a Nutshell: We started out with DocX, switched to Text Control and ended up with DocX.
Framework: Pros & Cons
Processing the Template
Our client gave us a document in the so-called binary word file format (aka
.doc) - which Microsoft considers a legacy format. It was replaced with the OXML format (aka
.docx) with MS Office 2007.
Document Properties (aka Fields) The document given is managed with SAP's DMS. It contains quite a number of DMS related document properties. For our assignment we need to process a few. There seems to be different kinds of properties in Word: proprietary and generic. DocX could only set generic properties.
MS Word Template Format We discovered that document properties were only rendered properly, if the MS Word application created the final instance. Hence we processed a Word template file (aka
.dotx) that is eventually opend by the user's MS Word.
Low-level document processing There are multiple ways to implement checkboxes in MS Word. Unfortunately, neither were supported by DocX. This was our motive to switch to Text Control - before we returned to DocX. However, the feature acted erratic (at least in conjunction with our client's template). And we ended up low-level implementing checkboxes in DocX based on this post.
Markers To identify document objects (e.g. tables, checkboxes, etc.) we used markers in the document. For tables, we simply added an identifier string with delimiters, e.g.
@hist@. We put it in most top-left cell, as we can guarantee its existence. These strings may easily be modified by non-developers. BTW: Word allows to add a font property to omit printing.
- Switch to OXML to process the document with DocX we need an XML format. We used word for conversion.
- Use a Word Template (aka
.dotx) Subsequently the MS Word application of the end user creates the eventual instance of the document. We assume that Microsoft applies the most resiliant engine to render the final document.
- Hard Code Values True, hard coding template variables contradicts the concept of variables (and templates). However, we suggest to check if the template is not too generic and may be "initialized".
- Set Markes We suggest simple identifier string with a predefined delimiter in table or plain text. Checkboxes may be marked and identified with their
- Redo Corrupted Elements We stumbled over a corrupted tables and document properties in the final document. These could be resolved by deleting the object in the template and rebuild it with MS Word.
- Manipulate XML If the framework doesn't cover the desired processing you can still default to XML manipulation (given
Two Approaches: Document Manipulation vs. Generation
Our assignment was to manipulate an existing word document. This quite different from generating a new document from scratch (e.g. an invoice) and essentially denotes the two frameworks different qualities.
Text Control has quite elaborate features to generate new documents from specifically designed templates. Since we could not completely rebuild the existing template, those qualities were irrelevant.
DocX however is an XML manipulator. It doesn't need to parse the whole document (to rebuild it) for isolated changes. Inconsistencies or corruption in other spots of the document may be ignored.
We think this is the most critical aspect, why DocX solved our problem better than Text Control.
OXML Opens a Door to Low-Level Manipulation
We'd like to avoid the complexity of the Word file format - that's why we employ frameworks.
However, as we had to do so for checkboxes the experience was quite positive: The XML manipulation integrates surprisingly smooth with DocX. It's feasible, paricuarly if your concern is limited and you don't have to parse the complete document.
We can only speak for .NET as we haven't tried other platforms. We'd honestly be surprised if other platforms do much better. However, the result is particularly disappointing as in the Windows environment MS Word processing should be a walk in the park.
Image credits for the cover image go to Ivo Novák.