Recently in XML, XSL, etc. Category

One of the projects I'm working on is conditional publishing of DITA content using ant. Basically what I want to do is, given a list of files, build the output that is affected when those files change.

We manage our files using Perforce, so the workflow is something like this:

  1. Check out files
  2. Edit/modify content
  3. Run conditional build
  4. Review output
  5. Check-in files

When you check files out of Perforce, they get put into a changelist, so, to get my file list, I need to ask Perforce what files are in the changelist. Perforce has a nice way to do that. The command is,

 p4 opened -c <changelist_number>

The problem is the output includes a bunch of information I don't want.

I get lines like this:

 //doc/main/core/build/build.xml#164 - edit change 1075456 (text) by sanderson@docbuild

what I want is

 //doc/main/core/build/build.xml

or, even better,

 files-in-changelist=/home/sanderson/doc/main/core/build/build.xml

That format is the format ant expects for a property files. Property files are nice, because you can load them up and use that property in ant.

What to do, what to do? In my case, I turn to ant. If I have a build issue, someone else has probably run into it, so, I look there first.

Sure enough, ant has a task called filterchain for exactly these kinds of uses. Here's what I wound up with:

     <target name="getFilesFromPerforceChangelist" if="changelist">
        <property name="ant.regexp.regexpimpl" value="org.apache.tools.ant.util.regexp.JakartaOroRegexp"/>
        <exec executable="p4" output="changelist.txt" append="false" failonerror="true">
            <arg value="opened"/>
            <arg value="-c"/>
            <arg value="${changelist}"/>
        </exec>
        <copy file="changelist.txt" tofile="changelist.properties">
            <filterchain>
                <replaceregex pattern="#.*" replace=","/>
                <replaceregex pattern=".*build" replace="${basedir}"/>
                <striplinebreaks/>
                <prefixlines prefix="files-in-changelist="/>
            </filterchain>
        </copy>
    </target>

Just another reason to love ant.

I needed to put a time-stamp in a file I was generating with XSLT and Saxon. Saxon supports parts of EXSLT, and one of the parts supported is date:date-time. It's kind of challenging to figure out exactly how to make it work, though, at least it was for me. In hopes that I'll save someone else some work, here's how to add a time-stamp using Saxon 6.5.5.

<?xml version="1.0" encoding="UTF-8"?>
  <xsl :stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
    xmlns:saxon="http://icl.com/saxon"
    xmlns:exsl="http://exslt.org/common"
    xmlns:date="http://exslt.org/dates-and-times"
    xmlns:func="http://exslt.org/functions"
    extension-element-prefixes="saxon exsl date func">

 <xsl:template match="/">
  Date-time: <xsl :value-of select="date:date-time()"/>
  Date-year: <xsl :value-of select="date:year()"/>
  Date-month-in-year: <xsl :value-of select="date:month-in-year()"/>
 </xsl:template>
</xsl>

I hope that helps someone.

oXygen and profiling

| No Comments | No TrackBacks

I ran into a little bit of a problem profiling a XSL transformation in oXygen today. Any errors at all in the transformation will cause the profiling to return no value. So, just as a note, make sure your transform doesn't return any errors, not even errors related to being unable to open files using the document function.

MarkLogic on Debian

| No Comments | No TrackBacks

I really like the MarkLogic product offering. Not just because it's fast and stable, but also because they offer a community license (other CMS and XML database vendors take note - if I can't try it, I probably won't recommend it to my company). So, I decided to install it on my Debian box today. Here's my story.

First off, I didn't have the alien or daemon package installed. You'll need those (among others), so make sure you have those installed. I'm not going to list every package you need. If you're using Debian, you can probably figure that out yourself.

  1. Download it from MarkLogic. I'm running on a 32 bit processor, so I downloaded the RedHad 32 bit rpm.
  2. Use alien to convert the rpm to a deb.
  3. Use dpkg -i debfile
  4. mkdir /var/lock/subsys
  5. Edit /etc/init.d/MarkLogic and change the line
    . /etc/rc.d/init.d/functions
    to
    #. /etc/rc.d/init.d/functions
  6. Start the server - /etc/init.d/MarkLogic start
  7. Follow the rest of the install guide
It's really easy, so you have no excuse for not using a fully functional XML database.


Day 2 and 3 at CMS Strategies

| No Comments | No TrackBacks

Day 2 and 3 at the content management strategies conference - quick hits.

The TEXTML team does a brilliant job of marketing their XML database as a CMS. It is, if you do a bunch of implementation work, but out of the box, it isn't a CMS.

DITA is starting to move to other communities, not just technical publication groups. Legal and other industries are just starting to move into it, but that's a great sign for the long term health of DITA.

Scott Wolff gave a very good presentation on the "5 Mistakes to Avoid when Buying a CMS". I won't steal his thunder (I'm sure he'll have the presentation on his web site), but here are the big things I took away from it

  1. Buy a hosted CMS if it meets your needs, if it doesn't, buy on-site CMS, if one fits your needs. If neither of those are possible, then, and only then, build your own.
  2. Start small, and build. Implement a CMS for one group in your company and then add other groups over time.
  3. Not every group needs a CMS

IBM still thinks putting all information that is accessed via conrefs into a single (or multiple) shared files is a good plan. I disagree. It's no better than having a bunch of topics with a single element (the method I discussed in Conrefs and “Shared Content”) that are designed for sharing. Yes, you need to avoid a "spaghetti sharing model", but neither of these methods solves the problem.

Last, but not least - I can be a real loud-mouth when it comes to open forums.

I'm attending the Content Management Strategies conference here in San Francisco (you can take a look at the agenda if you are interested in it).

Some quick comments on day 1.

Lots of people! Many more than last year's conference in Annapolis. The welcome presentation said it was 68% larger, and over 300 people were there.

The vendor area is too crowded. I wouldn't be very happy if I was a vendor.

The keynote presentation was slanted, and a bit dull. There was an assumption that using an enterprise CMS is the only way to fly, yet what a CMS is wasn't defined and the reasons why a CMS is required were not very detailed.

Search and context

| No Comments | No TrackBacks

John Battelle came by to talk to a bunch of us folks at salesforce.com yesterday. He gave an interesting talk that combined points from his book (The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture ) and some other, related comments. One of the things he talked about is a search scenario he recently posted on his blog: Google Launches Biz Local AdWords: It's Just the Start..... Go read the scenario and then come back here (I know his blog is much more interesting than mine, but let's finish this dialog, then go read more there).

The scenario is interesting from my point of view as an information designer that wants to make information access more effective. The truth is, for most things, the problem isn't finding the information, it's finding the appropriate information. John's scenario is very similar to an existing search, but adds two contexts - location (the shopper is in a particular location) and interest (s/he wants to buy a bottle of wine, known from the location [in a grocery store's wine section] but also from the search engine choice).

The context is what makes the search results valuable. I can do this search now while at the grocery store using my Nokia 770. The information I want will be retrieved, but since the search engine won't know the context, I will have to filter it myself to gain the same value. I probably won't be very successful because there is too much information for me. People are amazing pattern matchers, but we cannot effectively handle the volume of content such a search would return.

Traditional documentation does the same thing. We put out there a bunch of information, and the users have to use the table of contents, index, or search to find the content they want, and then they have to filter it for their context. We need to add context if we're going to solve the information glut problem.

Context sensitive help is a step in the right direction. We link from a particular UI to a particular bit of information. That's good, but it's very rare that a UI is simple enough that one help topic can be sure to give the user exactly what they want. Usually the content is too broad, forcing the user to dig into it (which takes away most of the advantages you gain from the context sensitive link). Sometimes the content is too narrow, which is even worse.

Contextual embedded help is the next step. Some software products already do this. Tax software programs are a great example. They give you information on how to complete each task without forcing you to ask for help. They don't forget though, that users often need more help, and they include links to that as well.

One thing that I haven't seen in embedded contextual help is something I call "the onion" for lack of a better term.

Embedded context help needs to be specific enough to be useful, but if it's too specific, it might not help me put this current task in context of the user's goal (quick aside, a task is a specific action, and a goal is what you want to accomplish - for example, figuring out your gross income is a task, completing your taxes is a goal). Each embedded context help topic needs to include a weighted related topics list, that gets more general the further out you go. Depending on the topic, it may also have a related topics list that gets more specific.

Here's an example. My goal is to complete my taxes. My current task is to compute my gross income. When I get to the UI page for doing that, the embedded help topic is "Computing your gross income". It has two sets of links, one a list of more specific tasks ("Calculate earned interest income", "Calculate income from salary", etc.). The second set of links is more general and includes concepts about gross income, other types of income, and a link to the overview of the entire tax preparation process.

It's an onion rather than a ladder because you may move out to a more general layer, but it may be to a topic that isn't a direct hierarchical parent of the topic you are looking at. For example, from computing your gross income, you may move to a conceptual topic on the alternative minimum tax. Related topics, sure, but you wouldn't put them in the same information hierarchy.

There's a huge information design problem that needs to be solved to make this model work, but I think it's worth the effort.

A neat idea for conref dependencies?

| No Comments | No TrackBacks

Still on conrefs and dependencies.

I think I've just come up with a neat idea for handling conrefs in a mythical CMS (a reminder, we don't use a CMS, so I don't know if any CMS already does this or could do this).

When I create a conref in Topic B to an element in Topic A, the CMS includes metadata that indicates which version of Topic A I'm conref'ing from. A writer changes Topic A. I open Topic B for editing, and my CMS tells me, "Topic A/elementname has changed. Do you want to review the change or keep your existing content?" If I review it, I see the standard three pane merge tool view - my current topic with the old content, my topic with the new content, and Topic A. There I can choose to consume the content from Topic A (the conref is unchanged), reject it completely (losing all content for that element), or keep the previous version of the content (the element with the conref gets replaced by an element with the content from the earlier version of Topic A).

The second part of this would be reporting on conrefs. Before publishing, you run a report that notifies you of all the topics with conrefs to versions of topics that are not the most recent version and have not been resolved. This allows you to see at a glance conrefs that may no longer valid.

A neat idea or mad ravings of a hungry geek? You tell me.

Conrefs and “Shared Content”

| No Comments | 1 TrackBack

XMetal implements a way of managing conrefs that I've heard a few people suggest as a method for getting around the dependency issues raised by conrefs. I think this is only half a solution, and causes other long term problems.

Here's the idea. When you want to conref an element from Topic A into Topic B, you open Topic A, copy the appropriate element into a new XML file, save that file somewhere (let's call this files SharedA.xml), and put a conref into both Topic A and Topic B. The way this is supposed to help is that when you are editing a file that has been marked as a shared file, you know other files depend on it, and you need to be careful about that dependency.

Nice idea, but look at the workflow of updates.

I'm writing Topic B. It has a conref to SharedA.xml. I need to make a change to that conref'd content. I know SharedA.xml is a special file so I need to check the dependencies. How do I do that? The only way I can think of is to do a full text search of my repository (whether that's a filesystem, SCMS, or CMS) for all references to SharedA.xml Fine. I do that and find out that Topic A and Topic B both depend on SharedA.xml. Now what do I do? I have three choices:

  1. Remove the conref in Topic B and write new material, or create a new shared content file for my special text and change the conref
  2. Make the change and notify the person responsible for Topic A
  3. Make the change and check the content in Topic A myself

With option 1, you decrease reuse.

With option 2, you have no process to ensure Topic A is still correct. Who's now responsible to make sure Topic A is still right?

Option 3 implies that I'm skilled enough to know how to write Topic A myself. If I'm not, I might foul the whole thing up, and, thus, there's no way to ensure that my change is correct.

Knowing that the file is a special file is nice, but this group of shared files add some extra problems to project management.

  1. None of these files are in ditamap files
  2. Shared files don't expire
  3. Ownership becomes muddled
And I'm sure I could think of some other issues if I thought about it some more.

Not having a file in a ditamap is just asking for trouble. The best way to manage DITA projects is to use ditamaps. They tell you what files are included in which projects and they include a bunch of metadata about each topic. Imagine you have 50 shared files. You cannot look at your ditamap(s) to figure out where those files are used, so you have to fall back to full text searching to determine whether a file is needed in a project or not.

Point 2 is very similar to point 1. Shared files will never be deleted because you don't know whether they are being used. So, your shared file repository will get bigger, and bigger, and bigger, until your writers can't find anything in it.

Point 3 is more minor, but could be an issue for some companies. If more than one writer is depending on the content in a shared file, how do you determine ownership, and, as a consequence of ownership, who is responsible for maintaining and editing a shared file?

A better solution is to use dependency checking. When a Topic A is opened, the writer is notified that an element in Topic A is conref'd by Topic B. The writer is then clearly responsible for ensuring that the content in Topic A and Topic B makes sense after editing, and no extra non-managed files are created.

Sadly, I don't think there are any non-CMS tools for doing this kind of dependency checking, nor can I think of an easy way to create one.

I applaud XMetal for trying to find a solution to this problem, but I hope that other vendors don't follow their lead in this case.

DITA 2006 Conference

| No Comments | No TrackBacks

I attended the DITA 2006 conference in Raleigh, NC, last week. Overall, it was a very good conference. Good networking opportunities, some really good meetings with vendors, some really nice presentations.

One of the more interesting things that I noticed was the number of attendees that were surprised, perhaps astounded, that IBM doesn't use a CMS for their content.

I think I've covered this before here, but I think traditional content management systems, which are usually designed to work with binary files that represent documents, need to change to support a topic based system like DITA. CMS vendors need to learn SCMS like Perforce, ClearCase, etc. When you have more than 2,000 XML files, and a large team of writers without strict boundaries, diff and merge, build and test, dependency analysis, branching, and other common SCMS functions are vitally important.

SiberLogic seems to understand that software documentation and software development are moving closer together in their needs. However, I heard some rumblings from the crowd that SiberSafe, SiberLogic's CMS tool, may not yet be ready for prime time. It does have one feature that I think is vital for DITA CMS tools - "Where Used".

When a writer opens a topic for editing, they need to know the side effects of the changes they make. When you are writing code, the build will fail if you screw up a dependency. In DITA, your dependencies can fail in two ways

  1. Just like code, you can break a dependency.
  2. Unlike code, you can change the context of an element that makes a dependent topic invalid
I've written about this topic before, so I won't go into it now. Suffice to say any information architect that isn't concerned about this isn't thinking clearly.