Conrefs and “Shared Content”
XMetal implements a way of managing conrefs that I've heard a few people suggest as a method for getting around the dependency issues raised by conrefs. I think this is only half a solution, and causes other long term problems.
Here's the idea. When you want to conref an element from Topic A into Topic B, you open Topic A, copy the appropriate element into a new XML file, save that file somewhere (let's call this files SharedA.xml), and put a conref into both Topic A and Topic B. The way this is supposed to help is that when you are editing a file that has been marked as a shared file, you know other files depend on it, and you need to be careful about that dependency.
Nice idea, but look at the workflow of updates.
I'm writing Topic B. It has a conref to SharedA.xml. I need to make a change to that conref'd content. I know SharedA.xml is a special file so I need to check the dependencies. How do I do that? The only way I can think of is to do a full text search of my repository (whether that's a filesystem, SCMS, or CMS) for all references to SharedA.xml Fine. I do that and find out that Topic A and Topic B both depend on SharedA.xml. Now what do I do? I have three choices:
- Remove the conref in Topic B and write new material, or create a new shared content file for my special text and change the conref
- Make the change and notify the person responsible for Topic A
- Make the change and check the content in Topic A myself
With option 1, you decrease reuse.
With option 2, you have no process to ensure Topic A is still correct. Who's now responsible to make sure Topic A is still right?
Option 3 implies that I'm skilled enough to know how to write Topic A myself. If I'm not, I might foul the whole thing up, and, thus, there's no way to ensure that my change is correct.
Knowing that the file is a special file is nice, but this group of shared files add some extra problems to project management.
- None of these files are in ditamap files
- Shared files don't expire
- Ownership becomes muddled
And I'm sure I could think of some other issues if I thought about it some more.
Not having a file in a ditamap is just asking for trouble. The best way to manage DITA projects is to use ditamaps. They tell you what files are included in which projects and they include a bunch of metadata about each topic. Imagine you have 50 shared files. You cannot look at your ditamap(s) to figure out where those files are used, so you have to fall back to full text searching to determine whether a file is needed in a project or not.
Point 2 is very similar to point 1. Shared files will never be deleted because you don't know whether they are being used. So, your shared file repository will get bigger, and bigger, and bigger, until your writers can't find anything in it.
Point 3 is more minor, but could be an issue for some companies. If more than one writer is depending on the content in a shared file, how do you determine ownership, and, as a consequence of ownership, who is responsible for maintaining and editing a shared file?
A better solution is to use dependency checking. When a Topic A is opened, the writer is notified that an element in Topic A is conref'd by Topic B. The writer is then clearly responsible for ensuring that the content in Topic A and Topic B makes sense after editing, and no extra non-managed files are created.
Sadly, I don't think there are any non-CMS tools for doing this kind of dependency checking, nor can I think of an easy way to create one.
I applaud XMetal for trying to find a solution to this problem, but I hope that other vendors don't follow their lead in this case.
DITA 2006 Conference
I attended the DITA 2006 conference in Raleigh, NC, last week. Overall, it was a very good conference. Good networking opportunities, some really good meetings with vendors, some really nice presentations.
One of the more interesting things that I noticed was the number of attendees that were surprised, perhaps astounded, that IBM doesn't use a CMS for their content.
I think I've covered this before here, but I think traditional content management systems, which are usually designed to work with binary files that represent documents, need to change to support a topic based system like DITA. CMS vendors need to learn SCMS like Perforce, ClearCase, etc. When you have more than 2,000 XML files, and a large team of writers without strict boundaries, diff and merge, build and test, dependency analysis, branching, and other common SCMS functions are vitally important.
SiberLogic seems to understand that software documentation and software development are moving closer together in their needs. However, I heard some rumblings from the crowd that SiberSafe, SiberLogic's CMS tool, may not yet be ready for prime time. It does have one feature that I think is vital for DITA CMS tools - "Where Used".
When a writer opens a topic for editing, they need to know the side effects of the changes they make. When you are writing code, the build will fail if you screw up a dependency. In DITA, your dependencies can fail in two ways
- Just like code, you can break a dependency.
- Unlike code, you can change the context of an element that makes a dependent topic invalid
I've written about this topic before, so I won't go into it now. Suffice to say any information architect that isn't concerned about this isn't thinking clearly.
Eclipse and DITA
One of the challenges I've found in working with DITA is managing the build process, specifically my ant build files. Since ant build files are in XML, I've been using my XML editors (either oXygen or Stylus Studio) to work on them. Although they help a bit, since ant build files don't have a schema or DTD, they don't help too much.
I started using emacs, and that was easier, mostly because I am more familiar with emacs as a text editor than either of my XML tools. Then I thought I'd take a look at Eclipse. I'm glad I did.
Eclipse takes a little while to set up, but once you've created a project, it's very easy to work with. The biggest advantage of working in Eclipse when working with DITA projects is that you can easily find all your build targets and edit them. DITA 1.0.1 has multiple XML files that store build targets - build.xml, pretargets.xml and conductor.xml. When you are trying to debug the build process, it's very hard to figure out which file has which target. Eclipse helps you with that. Open build.xml, and it lists, in outline form, all the targets available, including those that are in imported files. The imported targets are clearly marked, and you can double-click on them to open that file at the appropriate target.
Since Eclipse understand ant build files, editing them is also much easier. Tasks, parameters (variables? I get confused with ant syntax), and macros are all available from the auto-complete menus. That alone has saved me hours of debug time.
Another advantage is that there is an oXygen plug-in for Eclipse, so if you own oXygen, you can do some pretty sophisticated editing and debugging without every leaving Eclipse.
I still need to figure out how to kick off the ant builds from within Eclipse, but the time I've saved by using Eclipse has been a real help.
Getting eXist to work with Resin
eXist is an XML database that I'm interested in looking at. There's a web servlet front-end to it, so I thought I'd look into doing some things with that. My web hosting service uses Resin for servlets. I appreciate that. Resin has a lot of things going for it. If I do wind up using eXist, I may want to put it on my website, so I installed Resin on my linux box at home to mimic my website.
You can download a war file of eXist, which makes it easy to set it up as a web application for Resin. There's a problem, though. The web.xml file for eXist makes some assumptions that you are using Jetty, the web application container that eXist includes with the stand-alone verison of eXist. So, you need to make a couple of changes to get the war file to work with Resin.
Edit WEB-INF/web.xml, and search for "Jasper". I found two sections:
and
I commented those out, and eXist appears to work okay.
Indexing in XSL
Indexes are a vital part of documentation, whether it's online help or printed documentation.
I could point you to a lot of studies on this, but I'll just give you an example from my experiences. When I was working at Oracle, our users always claimed they liked search the best, but by studying how they used the product, we learned that what they were calling search was actually the index (the index has a search box in it, so they'd type a search in there and get the closest index result). Nearly all of our experienced and intermediate users used the index as their main interface to the online help. They hardly ever used the table of contents, and only went to the full text search when the index really failed them. This behavior is probably changing as full text search gets better and as search tools push indexed lists (like Yahoo!) aside on the web.
So, it's quite surprising that there is no indexing support in XSLT. RenderX has some nice extensions to make things easier, but those extensions are related to formatting the index, not creating it.
Here are the problems I'm running into.
- Uniqueness
- Each index entry should only be output once. If it's a child indexterm, it should only be output once for each parent term. That's hard to do.
- Grouping
- Index terms are grouped. For example, all the index terms that start with "A" are usually grouped in the same area.
- Sorting
- You can't control sort order in
xsl:for-eachloops except by telling it you want a text or numeric sort. What if my text order is different than the one chosen by the XSLT processor? Asian languages often have multiple ways of sorting. Which one do I use? Can I do any of them without having to edit the FO by hand?
In case it's not clear, I'm pretty frustrated by this. Indexing should be easy to support in an almost functional language like XSLT. It's not, though.
Installing DITA OT 1.2beta onDebian Linux
I'm just setting up the DITA Open Toolkit (DITA OT) on my Debian box, and I thought I'd take notes. This isn't really a how-to, but maybe it will help someone.
- Go to the DITA Open Toolkit webpage.
- In the left pane, click on Download
- Two choices - dita-ot releases under the CPL or the dita-ot released under the Apache ASL . Use whichever you prefer, the content of both packages is the same. I chose the Apache license.
- Look at the Linux Install Guide
- Extract the archive somewhere. In its current incarnation DITA is hard to share, so take that into account.
- Debian and Java have challenges. Make sure you install Java, at least 1.4 (there have been some problems with JDK 1.5, more on that later)
- Install (using apt) ant, ant-doc, and ant-optional.
- Install xalan, libxalan2-java and libxalan2-java-doc
- If you want to make PDFs, install fop.
- Go into the DITA directory and type "ant all" and you should be set to go.
Optional things:
If you are doing this for personal use, get RenderX's xep instead of fop. You can't really compare the two - xep is the better solution. Their free edition includes a little footer telling everyone about xep, but that's okay by me. They are giving me something free. I don't mind giving them some advertising for it. People pay $50 for a $5 t-shirt advertising Tommy or DKNY and they feel like they got a good deal.
Install eclipse. Managing ant projects can be a real pain. Take advantage of eclipse for doing that.
If you can afford it, buy oXygen. It's worth the $50 they charge, and their license is really good. Things may have changed, but the last time I tried the free XML editors on Linux, they left a lot to be desired, especially debugging support for XSL transformations. Really, I wanted to use emacs. I've used emacs for years, but I was wasting a lot of time and switched back to oXygen.
Use source control. Install CVS or something similar. It's too hard to manage XML projects without source control.
Unique items in xsl:for-each
xsl:for-each is very useful for looping through elements, especially elements that you have grouped together using the Muenchian Method (BTW, name dropping here, I used to work with Steve Muench, and he's one of the smartest, more dedicated people I've ever met - if you're just starting to develop a Java data-driven application, you owe it to yourself to try ADF, especially if you are considering similar frameworks like Spring).
There's one problem with xsl:for-each, though, it doesn't have any type of uniqueness testing. This is a problem for tasks like indexes, that need multi-level uniqueness. For example, not only do you only want to have one index entry for "validation", you only want to have one child index entry of "validation" for "SQL".
Here's an example XML snippet.
[code lang="xml"]
SQL validation
is ...
...
Validation for SQL queries
...
[/code]
and I need the index to look like this
validation SQL 3, 7 queries 7
where the first indexterm element composes on page 3 and the second is on page 7.
If you are using RenderX XEP's extension for indexing, the resulting FO should look something like this:
[code lang="xml"]
Validation
SQL
queries
[/code]
Using grouping, I can easily ensure that I only process "validation" once. The trouble comes when I loop through all the child indexterms.
Thinking this was easy, I tried a grouping (in this example, assume $parent_term is "validation" and the result of text() is "SQL"):
[code lang="xml"]
descendant::indexterm[generate-id(key('second-level-indexterms',text())[1])=generate-id(.)">
[/code]
but that doesn't work. The key will match all top level indexterms with the value "validation", which is both top-level indexterms above. The descendant::indexterm will find the first child of that term with the value of SQL - that's true for both elements in the example above.
Clearly, I can't look at just the child indexterms of the first indexterm with the value "validation", I need to process all of the child indexterms of each indexterms with the value of "validation". Since these elements are children of different elements, I can't test test their uniqueness using generate-id(). So how do I ensure that I only process one indexterm with the value of "SQL" that is a child of an indexterm "validation"?
I tried a few other things, but I couldn't get any closer. Either I duplicated the "SQL" entry, or I lost the other child entries. Eventually I just added a second transformation step to remove duplicates.
What I'd like to see is an attribute for xsl:sort called "unique" that sorts the entries, but restricts the loop to unique occurrences. It'd be tricky (which part of the node or node-tree has to be unique?), but very valuable.
If you can help me with a solution, I'd really appreciate it. I'd hate to suggest adding to XSLT, when there's a solution already available.
Can’t use variables in xsl:sort – Ugly
Here's what I'd like to be able to do:
[code lang="xml"]
....
[/code]
That doesn't work, though, because xsl:sort doesn't support using variables for the lang attribute. This bites for multi-language situations where you don't know the language until it's passed in at build time. To make matter worse, if the lang attribute isn't specified, the system language, not the input XML file's xml:lang attribute, is used. On top of that, you can't use a conditional, since the xsl:sort has to be immediately after the xsl:for-each!
So, if I'm on an American English computer, I can't make my Japanese index sort using xsl:sort unless I manually change the lang attribute before running the stylesheet.
Ugly.
DITA XML, FOs, and Translations
We've gotten a batch of Japanese translated files back, and I'm trying to turn them into good online help and PDFs.
I've really done a lot of specialization on my DITA transformation architecture, so, be aware that some of the things I write concerning translation may be handled gracefully by the current, vanilla, DITA open toolkit.
Lessons I've learned:
- Make sure your font settings are paramaterized. Not all fonts support all languages. You want to be able to change it based on the language.
- If your XML files don't have the language set (
xml:lang="ja-jp"), and if you are using ant to kick off the processing, make sure you pass DEFAULTLANG to the xslt task in the build target. - Follow the example of the DITA OT team and make sure you put all label text (footer text, note labels, etc.) in a labels file. In DITA OT, those files are in xslcommon
Issues I still have:
- I can't pass a string of Japanese text as an ant parameter from the properties file or from the command-line. The text gets changed from UTF-8 to ASCII. It's annoying because I set the title of the PDF via an ant properties file. If I set the parameter in the ant build.xml file I'm okay, but if I set it in the properties file or via the command-line, it's garbled. Any tips?
DITA Topics and the section element
DITA has two ways of adding section-like structure to a topic - a topic within a topic and a section within a topic (of course, you can also have a section in a topic in a topic, or a topic in a topic in a topic, but you get the idea).
We've used section quite a bit, and it's really nice for online help. It's a way to keep the topic small, add some sections, and indicate structure. For example, if an object has two groups of fields, you could do something like this (note, this isn't really valid, but it's detailed enough to give you the idea):
[code lang="xml"]
This object is composed of two groups of fields: foo and bar.
Then you can add some more content that describes the object.
[/code]
You can do something similar using nested topics.
[code lang="xml"]
This object is composed of two groups of fields: foo and bar.
....
....
Then you can add some more content that describes the object.
[/code]
Notice how I had to add an extra topic, with a required title, if I wanted to include information about the object after foo and bar, something I didn't have to do with a section.
You can see why section is so useful - there are times where you want to set of a piece of information with a sub-title within a topic.
However, this throws everything for a loop when you want to create output that combines multiple topics, some with nested topics and some with sections.
If you combine those two examples, what hierarchy do you apply to the structure choices?
Imagine your output is HTML.
| DITA Element | HTML Element |
|---|---|
| topic/title | H1 |
| topic/section/title | H2 or H3? |
| topic/topic/title | H2 or H3? |
| topic/topic/section/title | H3 or H4? |
Etc.
Assume that you decide the topic/section/title is higher in the hierarchy than the topic/topic/title. You set topic/section/title to an H2, and topic/topic/title to an H3. But what happens if you have a topic/topic/title without a topic/section/title? Do you make topic/topic/title an H2 or an H3? If you make an H2, then it's not consistent with the other topic/topic/titles. If you make it an H3, then your HTML isn't well structured (you don't have an intermediate H2 between the H1 and the H3).
Another issue is the body tag. The default DITA XSL FO generation indents the text based on the body element. A topic/section/title is within the body, but the topic/topic/title is not within that body element. So your output looks like this:
[code]
topic/title
topic/body
topic/section/title
topic/topic/title
[/code]
This implies, to me, that topic/section/titles should be looked at structurally different than topic/titles. In HTML, we should use H1-H6 only for topic titles, never for section titles.
I'm not sure what the solution should be for this. Section, while useful, just seems to break the whole hierarchy of DITA topics, and I think you should have a plan for how to deal with it before creating content using DITA.