DITA to WordPress Import Tool

This ‘plu­gin’ is a DITA to WordPress importer. Spe­cific­ally it is a WordPress import mod­ule which will take the two-pane ‘Web Help’ out­put from the DITA Open Toolkit and import the hier­archy of XHTML pages into WordPress. It will import images too, though not as WordPress attachments.

This tool was writ­ten as part of an online help pro­ject in my last job. As an add-on to WordPress to be dis­trib­uted to cus­tom­ers it was licensed under the GNU GPL Ver­sion 2 with expli­cit under­stand­ing of  my employers.

I have retained their copy­right notice as it was writ­ten for them though the concept, ideas and imple­ment­a­tion are all mine.

There is also a zip file for you to down­load con­tain­ing the sample DITA web help files that comes with the DITA-OT.

Feed­back is wel­come. Please use the com­ment box at the below.


Here is the con­tents of the readme almost verbatim:

 

It was writ­ten to import the XHTML out­put of the DITA[1] Open Toolkit[2]. A tool which takes XML top­ics in DITA format and con­verts them to a num­ber of formats, includ­ing PDF, Win Help, and XHTML. It uses the body tag to grab what it needs.

It is very rough and spe­cific to the in-house require­ments of Northg­ate (my last com­pany). It also works on WPMU.

It uses PHP5’s XML manip­u­la­tion, and at least one part requires MYSQL 5 (for sub-selects) and has some quirky stuff in it. For instance import­ing 1200 files in one go on win­dows used to always time out (PHP timeout calc on win­dows uses wallclock time not cpu time), so it can be restar­ted and it will pro­cess from where it left off.

I men­tioned it and DITA over in this post [3] on WP-Docs as part of this con­ver­sa­tion [4].

It expects the XHTML out­put from pro­du­cing ‘web help’ with DITA-OT 1.4. This is a hier­arch­ical tree of XHTML files with a top level two pane frame index file with a table of con­tents in one pane and the help top­ics in the other.

It imports those help top­ics, grabbing the con­tents of the body tag and doing some manip­u­la­tion to get everything to work in WordPress as well as sat­isfy the ori­ginal requirements.

It uses a sta­ging table (auto­mat­ic­ally cre­ated) and can be re-run to update the same top­ics (if you regen­er­ate them). It can also be re-run to con­tinue pro­cessing if there is a fail­ure half way through.

Basic pro­cessing is as follows:

You sup­ply the path to the top of the DITA out­put tree (where index.html is gen­er­ated) If under WPMU, you sup­ply the blog into which you want to import.

It then loads all the files it can find (expli­citly ignor­ing index.html) into the sta­ging table.

The load pro­cess does the fol­low­ing:
* con­verts the paths of any links to other files
* strips out empty anchor tags that DITA-OT gen­er­ates (It adds id’s *and* empty anchors as frag­ment tar­gets!)
* It takes the meta tag ‘descrip­tion’ and uses that as the excerpt
* It takes meta tag keywords and pops them into a page meta tag
* it looks for some spe­cific internal meta tags and saves those (deleted, replaced-by, and prod­name)
* it then finds image ref­er­ences , cop­ies the images to the blog dir­ect­ory and adjusts the paths in the HTML (for WPMU it puts them in the cor­rect blog files dir­ect­ory, for stand­ard WP it puts them in the blog root!)
* it then removes some DITA-OT spe­cific stuff we didn’t want (the short descrip­tion for related links — though it leaves the links)
* it finds a spe­cific span that ought ot be a head­ing and turns itr into one (h3)
* it finds the par­ent of the page (if there is one) and stores it so the hier­archy will work
* it extracts the cleaned body con­tents ready to be the page con­tents
* it grabs the html page title and uses that as the WP page title
* it uses the DITA id as the slug for the page
* we also had a require­ment that the DITA id of the page match the html file­name — I’ve made that optional i which case it uses the file­name as the slug

The next step of the import looks to see which, if any, of the impor­ted page are updates to exist­ing ones (the id/filename will match an exist­ing slug). It will do an update for those not an insert, and it will record the updated ones if they had com­ments to be squir­ted into a post about updates (internal requirement).

Then it will pro­cess those updates. By the way, the WP revi­sion stuff works — it will cre­ate a new revi­sion for each time you update the page.

Next it inserts new pages

Then it has to flush thew rewrite rules. We had great prob­lems with internal links and rewrite rules — so there is prob­ably a bit of belt and braces stuff going on.

Next it revis­its all the pages resolv­ing the par­ents cor­rectly — so that the hier­archy is cre­ated prop­erly. And then it has to flush the rewrite rules again (the paths have changed).

Finally it call update_guids — more belt and braces.

There is an option to empty your posts table before import­ing. You would not nor­mally want to do that! And another option to delete the pages which are still ref­er­enced in the sta­ging table. It’s a clean up after a failed import step less drastic than clean­ing up everything. It is hope­fully not needed now.

In the source file itself there are a couple more set­tings you can adjust. There are two dif­fer­ent debug levels: set $debug and or $debug_extra to true.

The loop size (how many records to pro­cess at once) is adjustable (default 75).

And there is even an option to import posts instead of pages — this is exper­i­mental and prob­ably wouldn’t work. For instance it needs to detect cat­egory meta tags in the XHTML and add them to the post. The add to post code is half there.

[1] http://dita.xml.org/book/getting-started
[2] http://dita.xml.org/wiki/the-dita-open-toolkit
[3] http://comox.textdrive.com/pipermail/wp-docs/2009-January/001890.html
[4] http://comox.textdrive.com/pipermail/wp-docs/2009-January/001862.html

== Install­a­tion ==

Copy this file to the wp-admin/import folder. It is not a plugin.

16 thoughts on “DITA to WordPress Import Tool

  1. I’ve just tested this with the test site and as you prob­ably know it worked quite well. Do you have plans to add any func­tion­al­ity to the plu­gin? I’m not sure what I would have to do to pre­pare a reg­u­lar html site for this to work. The dita.list file with the demo site for example, how do I cre­ate one? is it needed? The demo site also is a frame set with toc, that’s easy enough to cre­ate for an html site in prep for import, but again, is it needed to be a frame set?
    Also the dita.list file refs: user.input.dir=/home/mike/Desktop/DITA/DITA-OT1.4/samples
    which I guess isn’t needed.
    Lastly, is there any way that you could have the images moved to the wp upload dir, maybe all into a new folder called dita_import?

    • Hi.
      I’m glad you were able to get the importer work­ing.
      The dita.lst file is simply an arti­fact of the DITA pro­cess, it is not used. Sim­il­arly the index.html and toc.html are gen­er­ated by the DITA pro­cess to cre­ate a com­plete (if plain) web help sys­tem, there are not used by the importer either.
      The importer very spe­cific­ally expects the XHTML out­put of the DITA pro­cess. The XHTML files must be valid as a XML files, which is why I spe­cify XHTTL, and I expect the urls between pages must be rel­at­ive ones. Oth­er­wise I don’t think there is any­thing spe­cial. If things like the short descrip­tion are not present, it will prob­ably just carry on.

      As for the images, it would be best of they were added as WordPress attach­ments, but so far I haven’t looked into that. I may have a look later today at mak­ing it copy them to wp-content/uploads as a start­ing point.

  2. Pingback: Merging Worlds: DITA and WordPress | I'd Rather Be Writing - Tom Johnson

  3. Hi Mike,
    You’ve done a great job in cre­at­ing a DITA-import util­ity. I watched it work in Tom Johnson’s video.
    I’m learn­ing to use DITA, and your work is a very valu­able tool.
    I’m won­der­ing if there is sens­it­iv­ity to the path of the web sample which you included. I’ve found that ditahelp.php can find the web sample folder if I place it in wp-content. In that case, I type ../wp-content/web in the DITA help dir­ect­ory text field. ditahelp.php does find the dir­ect­ory, but it hangs:
    * Clear­ing sta­ging table.
    * Pro­cessing dir­ect­ory: ../wp-content/web

    Should I place the web sample dir­ect­ory else­where? I’m using Yahoo web­host­ing. ditahelp.php doesn’t seem to recog­nize the DITA help dir­ect­ory path if I place it farther away. I don’t have access to the actual path on the yahoo server, which is why I expressed the path in rel­at­ive terms.

    Thank you for any advice you can offer. Again, great work.
    Bob Kauten

    • Hi Bob,
      I saw your other com­ment about PHP ver­sions being the prob­lem. But you do raise an import­ant point.
      The path in which you upload your DITA out­put must be access­ible to the user or pro­cess the web server runs as. For shared host­ing that pretty much means a pub­licly vis­ible web dir­ect­ory, e.g. wp-content.
      But if you have more con­trol or options, eg. ded­ic­ated server, local, or internal server, then you can be more flexible.

  4. Hi Mike,
    I found the script.log for ditahelp.php, and it says:

    “PHP Fatal error: Can­not instan­ti­ate non-existent class: recurs­ive­dir­ect­ory­iter­ator in /blog/wp-admin/import/ditahelp.php on line 360″

    The recurs­ive­dir­ect­ory­iter­ator class was included in PHP 5.0, and Yahoo web­host­ing is run­ning PHP Ver­sion 4.3.11. I believe that explains the difficulty.

    Thanks,
    Bob Kauten

    • Hi Bob,
      It does men­tion in the readme and the page above that it uses PHP 5. I’m afraid it will only work with that ver­sion. Though it is pos­sible to do the dir­ect­ory walk­ing a dif­fer­ent way in PHP 4, the XML pars­ing stuff won’t work, so there’s no point.

  5. Hi Mike,

    Con­grat­u­la­tions on cre­at­ing a great tool for the DITA com­munity! If you’re inter­ested in dis­trib­ut­ing your DITA2Wordpress tool (or at least hav­ing it lis­ted as a resource) under the open-source DITA2Wiki Pro­ject on Source­Forge, please con­tact me at lisa dot dyer at lom­bardi dot com.

    http://sourceforge.net/projects/dita2wiki/

    The DITA2Wiki Pro­ject pro­motes best prac­tices and tools for mar­ry­ing DITA with wikis and other Web 2.0 apps. Dis­tri­bu­tions cur­rently include the DITA2Confluence tool (which bypasses the DITA OT to gen­er­ate wiki out­put dir­ectly from DITA).

    Cheers,

    - lisa

  6. Hi,

    I’m using this dita importer tool. Each and every step is match­ing with demo video. But finally con­tent is not appear­ing in my blog. How do I rec­tify this?

    Thanks in advance
    Bal

    • Hi Bal,
      The import tool cre­ates pages. Do you have a theme which lists pages? Try switch­ing to the default theme to see whether they appear. They should be in the side­bar on the home page.

  7. Pingback: dita2wordpress – Import Tool installieren und anwenden » Ditalog 0.1

  8. Pingback: Wordpress feat. DITA: Publication d'articles à partir de sources XML | Docster

  9. Pingback: What are you doing with DITA? | DITA Chicks Blog

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>