Uneasy, but Feasible: Server-Side MS-Word Processing with .NET

Partial automatization of Word documents makes our client's employees happy. Us as well.
28.09.2016
Mathias Münscher
Tags

Partial automatization of Word documents makes our client’s employees happy. Us as well.

What’s the Story?

This blog entry looks at the completion (not generation) of word templates in the backend. What are the tools, pitfalls and solutions.

Defining the Problem

The product of our client requires extensive documentation. Our assignment is to automatize some preparatory work for a specific document. The client
uses MS Word for word processing and the solution should be a .NET web application.

Key Factors

  • Ultimate Artifact
    • Word Document The result must be a word document that may be post-processed by the user.
  • Template
    • Template Given The to-be-processed template is based on a given (corporate) template document.
    • Modification Possible We are allowed to modify the template document. However its base is changed from time to time (in a rigid corporate process), so modification should be limited.
  • Data
    • User Input pilots the processing.
    • External Sources provide the essential data.
    • Logic Some data needs to be processed to resolve the template.

Document Processing

We needed to process different types of content:

  • Plain Text Replacing, removing or amending text.
  • Document Properties Setting Word references (aka fields, e.g. page-number)
  • Tables Updating cell text, adding/removing rows, marking a cell’s background color
  • Checkboxes Setting boolean values

Solution

Selecting a word processing framework

Although MS Word is one of Microsoft’s major products, the company doesn’t provide a state-of-the-art API to process documents.

Sure, you can use Microsoft’s COM interface to apply an installed MS Word (desktop) application, but this would not fit well into our web application architecture. Also, one can use Microsoft’s sound Open XML SDK, but we let this one pass as well, as we’d like to avoid the low level complexity of a word XML. John Atten’s blog entry sheds some light on the situation.

Looking for an appropriate processing framework we came up with two – quite different – options. There are more options; noteworthy the biggest framework, Aspose, which we didn’t consider for monetary reasons.

  1. DocX An open source framework, developed by a (then) computer science student from Ireland. It essentially applies the MS OXML SDK to Word.
  2. Text Control A proprietary .NET framework with 10 years history. The smallest package prices at 2500€.

In a Nutshell: We started out with DocX, switched to Text Control and ended up with DocX.

Framework: Pros & Cons

DocX Text Control
Pros
  • Free
  • Modern interface
  • Open source with a living community
  • Easily integrated with the .NET package manager NuGet
  • Very feature rich
  • Really quick chat support
Cons
  • Few implemented features
  • Almost no description besides a thin class doc and a (living) forum
  • Awful integration with installer and hassle with license-file
  • Deployment to web-server needs manual adding of files (from C:\Program Files\.. )
  • 90's API-interface (not really OO)
  • Fragmented, incomplete documentation: Essentially scrimpy blog entries from the last 10 years (besides a class doc)

Processing the Template

Our client gave us a document in the so-called binary word file format (aka .doc) - which Microsoft considers a legacy format. It was replaced with the OXML format (aka .docx) with MS Office 2007.

Document Properties (aka Fields) The document given is managed with SAP’s DMS. It contains quite a number of DMS related document properties. For our assignment we need to process a few. There seems to be different kinds of properties in Word: proprietary and generic. DocX could only set generic properties.

MS Word Template Format We discovered that document properties were only rendered properly, if the MS Word application created the final instance. Hence we processed a Word template file (aka .dotx) that is eventually opend by the user’s MS Word.

Low-level document processing There are multiple ways to implement checkboxes in MS Word. Unfortunatly neither were supported by DocX. This was our motive to switch to Text Control - before we returned to DocX. However, the feature acted erratic (at least in conjunction with our client’s template). And we ended up low-level implementing checkboxes in DocX based on this post.

Markers To identify document objects (e.g. tables, checkboxes, etc.) we used markers in the document. For tables, we simply added an identifier string with delimiters, e.g. @hist@. We put it in most top-left cell, as we can guarantee its existence. These strings may easily be modified by non-developers. BTW: Word allows to add a font property to omit printing.

Generalized Proceeding

  1. Switch to OXML to process the document with DocX we need an XML format. We used word for conversion.
  2. Use a Word Template (aka .dotx) Subsequently the MS Word application of the end user creates the eventual instance of the document. We assume that Microsoft applies the most resiliant engine to render the final document.
  3. Hard Code Values True, hard coding template variables contradicts the concept of variables (and templates). However, we suggest to check if the template is not too generic and may be “initialized”.
  4. Set Markes We suggest simple identifier string with a predefined delimiter in table or plain text. Checkboxes may be marked and identified with their Bookmark property.
  5. Redo Corrupted Elements We stumbled over a corrupted tables and document properties in the final document. These could be resolved by deleting the object in the template and rebuild it with MS Word.
  6. Manipulate XML If the framework doesn’t cover the desired processing you can still default to XML manipulation (given .docx / .dotx).

Lessons learned

Two Approaches: Document Manipulation vs. Generation

Our assignment was to manipulate an existing word document. This quite different from generating a new document from scratch (e.g. an invoice) and essentially denotes the two frameworks different qualities.

Text Control has quite elaborate features to generate new documents from specifically designed templates. Since we could not completely rebuild the existing template, those qualities were irrelevant.

DocX however is an XML manipulator. It doesn’t need to parse the whole document (to rebuild it) for isolated changes. Inconsistencies or corruption in other spots of the document may be ignored.

We think this is the most critical aspect, why DocX solved our problem better than Text Control.

###OXML Opens a Door to Low-Level Manipulation
We’d like to avoid the complexity of the Word file format - that’s why we employ frameworks.

However, as we had to do so for checkboxes the experience was quite positive: The XML manipulation integrates surprisingly smooth with DocX. It’s feasible, paricuarly if your concern is limited and you don’t have to parse the complete document.

.NET Disclaimer

We can only speak for .NET as we haven’t tried other platforms. We’d honestly be surprised if other platforms do much better. However, the result is particularly disappointing as in the Windows environment MS Word processing should be a walk in the park.

Image credits for the cover image go to Ivo Novák.