DITA to WordPress Import Tool

This ‘plugin’ is a DITA to WordPress importer. Specifically it is a WordPress import module which will take the two-pane ‘Web Help’ output from the DITA Open Toolkit and import the hierarchy of XHTML pages into WordPress. It will import images too, though not as WordPress attachments.

This tool was written as part of an online help project in my last job. As an add-on to WordPress to be distributed to customers it was licensed under the GNU GPL Version 2 with explicit understanding of my employers.

I have retained their copyright notice as it was written for them though the concept, ideas and implementation are all mine.

There is also a zip file for you to download containing the sample DITA web help files that comes with the DITA-OT.

Feedback is welcome. Please use the comment box at the below.

Here is the contents of the readme almost verbatim:

It was written to import the XHTML output of the DITA[1] Open Toolkit[2]. A tool which takes XML topics in DITA format and converts them to a number of formats, including PDF, Win Help, and XHTML. It uses the body tag to grab what it needs.

It is very rough and specific to the in-house requirements of Northgate (my last company). It also works on WPMU.

It uses PHP5’s XML manipulation, and at least one part requires MYSQL 5 (for sub-selects) and has some quirky stuff in it. For instance importing 1200 files in one go on windows used to always time out (PHP timeout calc on windows uses wallclock time not cpu time), so it can be restarted and it will process from where it left off.

I mentioned it and DITA over in this post [3] on WP-Docs as part of this conversation [4].

It expects the XHTML output from producing ‘web help’ with DITA-OT 1.4. This is a hierarchical tree of XHTML files with a top level two pane frame index file with a table of contents in one pane and the help topics in the other.

It imports those help topics, grabbing the contents of the body tag and doing some manipulation to get everything to work in WordPress as well as satisfy the original requirements.

It uses a staging table (automatically created) and can be re-run to update the same topics (if you regenerate them). It can also be re-run to continue processing if there is a failure half way through.

Basic processing is as follows:

You supply the path to the top of the DITA output tree (where index.html is generated) If under WPMU, you supply the blog into which you want to import.

It then loads all the files it can find (explicitly ignoring index.html) into the staging table.

The load process does the following:
* converts the paths of any links to other files
* strips out empty anchor tags that DITA-OT generates (It adds id’s *and* empty anchors as fragment targets!)
* It takes the meta tag ‘description’ and uses that as the excerpt
* It takes meta tag keywords and pops them into a page meta tag
* it looks for some specific internal meta tags and saves those (deleted, replaced-by, and prodname)
* it then finds image references , copies the images to the blog directory and adjusts the paths in the HTML (for WPMU it puts them in the correct blog files directory, for standard WP it puts them in the blog root!)
* it then removes some DITA-OT specific stuff we didn’t want (the short description for related links – though it leaves the links)
* it finds a specific span that ought ot be a heading and turns itr into one (h3)
* it finds the parent of the page (if there is one) and stores it so the hierarchy will work
* it extracts the cleaned body contents ready to be the page contents
* it grabs the html page title and uses that as the WP page title
* it uses the DITA id as the slug for the page
* we also had a requirement that the DITA id of the page match the html filename — I’ve made that optional i which case it uses the filename as the slug

The next step of the import looks to see which, if any, of the imported page are updates to existing ones (the id/filename will match an existing slug). It will do an update for those not an insert, and it will record the updated ones if they had comments to be squirted into a post about updates (internal requirement).

Then it will process those updates. By the way, the WP revision stuff works — it will create a new revision for each time you update the page.

Next it inserts new pages

Then it has to flush thew rewrite rules. We had great problems with internal links and rewrite rules – so there is probably a bit of belt and braces stuff going on.

Next it revisits all the pages resolving the parents correctly — so that the hierarchy is created properly. And then it has to flush the rewrite rules again (the paths have changed).

Finally it call update_guids — more belt and braces.

There is an option to empty your posts table before importing. You would not normally want to do that! And another option to delete the pages which are still referenced in the staging table. It’s a clean up after a failed import step less drastic than cleaning up everything. It is hopefully not needed now.

In the source file itself there are a couple more settings you can adjust. There are two different debug levels: set $debug and or $debug_extra to true.

The loop size (how many records to process at once) is adjustable (default 75).

And there is even an option to import posts instead of pages — this is experimental and probably wouldn’t work. For instance it needs to detect category meta tags in the XHTML and add them to the post. The add to post code is half there.

[1] http://dita.xml.org/book/getting-started
[2] http://dita.xml.org/wiki/the-dita-open-toolkit
[3] http://comox.textdrive.com/pipermail/wp-docs/2009-January/001890.html
[4] http://comox.textdrive.com/pipermail/wp-docs/2009-January/001862.html

== Installation ==

Copy this file to the wp-admin/import folder. It is not a plugin.

Short link to this page: https://z1.tl/z3

36 thoughts on “DITA to WordPress Import Tool”

twincascos on Saturday, 7 February 2009 at 13:50 said:
I’ve just tested this with the test site and as you probably know it worked quite well. Do you have plans to add any functionality to the plugin? I’m not sure what I would have to do to prepare a regular html site for this to work. The dita.list file with the demo site for example, how do I create one? is it needed? The demo site also is a frame set with toc, that’s easy enough to create for an html site in prep for import, but again, is it needed to be a frame set?
Also the dita.list file refs: user.input.dir=/home/mike/Desktop/DITA/DITA-OT1.4/samples
which I guess isn’t needed.
Lastly, is there any way that you could have the images moved to the wp upload dir, maybe all into a new folder called dita_import?
- mike on Sunday, 8 February 2009 at 14:13 said:
  Hi.
  I’m glad you were able to get the importer working.
  The dita.lst file is simply an artifact of the DITA process, it is not used. Similarly the index.html and toc.html are generated by the DITA process to create a complete (if plain) web help system, there are not used by the importer either.
  The importer very specifically expects the XHTML output of the DITA process. The XHTML files must be valid as a XML files, which is why I specify XHTTL, and I expect the urls between pages must be relative ones. Otherwise I don’t think there is anything special. If things like the short description are not present, it will probably just carry on.
  As for the images, it would be best of they were added as WordPress attachments, but so far I haven’t looked into that. I may have a look later today at making it copy them to wp-content/uploads as a starting point.
Pingback: Merging Worlds: DITA and WordPress | I'd Rather Be Writing - Tom Johnson
chrys on Monday, 9 February 2009 at 16:42 said:
Thanks. Haven’t taken it out of the package you provided – but, I’m anxious to watch your magic.
Bob Kauten on Tuesday, 10 February 2009 at 07:08 said:
Hi Mike,
You’ve done a great job in creating a DITA-import utility. I watched it work in Tom Johnson’s video.
I’m learning to use DITA, and your work is a very valuable tool.
I’m wondering if there is sensitivity to the path of the web sample which you included. I’ve found that ditahelp.php can find the web sample folder if I place it in wp-content. In that case, I type ../wp-content/web in the DITA help directory text field. ditahelp.php does find the directory, but it hangs:
* Clearing staging table.
* Processing directory: ../wp-content/web
Should I place the web sample directory elsewhere? I’m using Yahoo webhosting. ditahelp.php doesn’t seem to recognize the DITA help directory path if I place it farther away. I don’t have access to the actual path on the yahoo server, which is why I expressed the path in relative terms.
Thank you for any advice you can offer. Again, great work.
Bob Kauten
- mike on Wednesday, 11 February 2009 at 10:26 said:
  Hi Bob,
  I saw your other comment about PHP versions being the problem. But you do raise an important point.
  The path in which you upload your DITA output must be accessible to the user or process the web server runs as. For shared hosting that pretty much means a publicly visible web directory, e.g. wp-content.
  But if you have more control or options, eg. dedicated server, local, or internal server, then you can be more flexible.
Bob Kauten on Tuesday, 10 February 2009 at 23:14 said:
Hi Mike,
I found the script.log for ditahelp.php, and it says:
“PHP Fatal error: Cannot instantiate non-existent class: recursivedirectoryiterator in /blog/wp-admin/import/ditahelp.php on line 360”
The recursivedirectoryiterator class was included in PHP 5.0, and Yahoo webhosting is running PHP Version 4.3.11. I believe that explains the difficulty.
Thanks,
Bob Kauten
- mike on Wednesday, 11 February 2009 at 10:21 said:
  Hi Bob,
  It does mention in the readme and the page above that it uses PHP 5. I’m afraid it will only work with that version. Though it is possible to do the directory walking a different way in PHP 4, the XML parsing stuff won’t work, so there’s no point.
Lisa Dyer on Monday, 23 February 2009 at 18:19 said:
Hi Mike,
Congratulations on creating a great tool for the DITA community! If you’re interested in distributing your DITA2Wordpress tool (or at least having it listed as a resource) under the open-source DITA2Wiki Project on SourceForge, please contact me at lisa dot dyer at lombardi dot com.
http://sourceforge.net/projects/dita2wiki/
The DITA2Wiki Project promotes best practices and tools for marrying DITA with wikis and other Web 2.0 apps. Distributions currently include the DITA2Confluence tool (which bypasses the DITA OT to generate wiki output directly from DITA).
Cheers,
– lisa
bal on Tuesday, 24 February 2009 at 08:27 said:
Hi,
I’m using this dita importer tool. Each and every step is matching with demo video. But finally content is not appearing in my blog. How do I rectify this?
Thanks in advance
Bal
- mike on Thursday, 12 March 2009 at 10:07 said:
  Hi Bal,
  The import tool creates pages. Do you have a theme which lists pages? Try switching to the default theme to see whether they appear. They should be in the sidebar on the home page.
Pingback: dita2wordpress – Import Tool installieren und anwenden » Ditalog 0.1
Pingback: Wordpress feat. DITA: Publication d'articles à partir de sources XML | Docster
Pingback: What are you doing with DITA? | DITA Chicks Blog
Wim Hooghwinkel on Wednesday, 2 November 2011 at 11:25 said:
Hi Mike,
I downloaded end tested your tool and it works as expected. I was wondering if there are any new developments or new insights in using this tool?
- Mike Little on Thursday, 3 November 2011 at 09:31 said:
  Hi Wim,
  No, I haven’t worked on it for years. I don’t work with DITA any more, so the need is not there.
Tom Johnson on Friday, 30 November 2012 at 22:28 said:
I just tried activating the plugin on a new WordPress site and activating it triggered an error. Is there any way you could update the plugin so that it still functions with WordPress? It is a valuable tool.
Tom
Tom Johnson on Monday, 3 December 2012 at 06:14 said:
To get the tool to work, first, recognize that it’s not a plugin.
Upload the ditahelp.php file to the wp-admin folder.
From that same folder, download admin.php
In the admin.php folder, below this line: require_once(ABSPATH . ‘wp-admin/includes/admin.php’);
add this: require_once(‘ditahelp.php’);
You will now see a Dita help import option in Tools in the WordPress dashboard.
Note that this tool doesn’t actually import DITA files. It imports the web help output from DITA. So if you’re exporting a DITA output from Flare, you’ll need to convert the DITA files to web help via the DITA Open Toolkit (which isn’t hard).
APC on Friday, 14 December 2012 at 06:21 said:
Hi Tom,
I tried modifying admin.php file as you have mentioned above.
But it does not work in my case. The Dashboard does not get updated and it throws an error. If you could send me your email id, I could send you a screenshot of the error that I receive.
Regards,
APC
- Mike Little on Friday, 14 December 2012 at 12:07 said:
  Tom and APC,
  You really shouldn’t modify core WordPress files. It leads to maintenance issues as well as the possibility of breaking your whole site.
  As I mentioned to Tom in a private email, I may find time to look at updating this code in the New Year. A lot of it will need re-writing as the way WordPress handles imports has been completely rewritten (so that it can be done via a plugin (which this code isn’t).
  - Tom Johnson on Tuesday, 1 January 2013 at 00:12 said:
    Mike, I would be excited to see you update this plugin. If it’s still in your plans for the early part of 2013, definitely keep us posted with comments on this thread. Thanks again for your efforts with this plugin.
    - APC on Wednesday, 9 January 2013 at 05:11 said:
      Hi Mike,
      Can you please tell us when are you planning to update the code of the ‘plugin’ (DITA to WordPress importer) so that it will be compatible with the latest WordPress version?
      Regards,
      APC
      - Mike Little on Wednesday, 9 January 2013 at 11:00 said:
        As I mentioned earlier, I have no firm plans to do it. If I can find the time, I will.
Pingback: How to Import the Webhelp Output from a Help Authoring Tool into WordPress | I'd Rather Be Writing
Tom Johnson on Monday, 21 January 2013 at 22:39 said:
Mike, I wrote a post on using this Ditahelp.php tool to import a webhelp output from a help authoring tool such as Madcap Flare into WordPress. The import tool you created is really quite cool. Thanks.
Pingback: How to Import the Webhelp Output from a Help Authoring Tool into WordPress » eHow TO...
Tallman Brown on Thursday, 10 April 2014 at 20:35 said:
I was trying to use this importer following Tom Johnson’s posted instructions, but ran into trouble when the importer fails to insert the images.
Here’s what it tells me:
Importing file /inetpub/wwwroot/[sitefolder]/wp-content/uploads/[contentfolder]/[contentsubfolder]/[page].html
need to create directory C:/inetpub/wwwroot/[sitefolder]/C:/inetpub/wwwroot/[sitefolder]/wp-admin/inetpub/wwwroot/[sitefolder]/wp-content/uploads/Skins/Default/Stylesheets/Images
It looks like your pathway variables in your ditahelp.php get messed up on a Windows IIS server where, instead of content being stored in a \public_html\ folder, they’re stored in an \inetpub\ folder. When it tries to write the image paths, it gets lost and can’t do it. This doesn’t seem to be a permissions issue.
My question: I know you’re not developing this importer anymore, but any ideas on how I can adjust your ditahelp.php pathway variables myself so the importer doesn’t break on image insert?
- Tom Johnson on Tuesday, 25 November 2014 at 17:22 said:
  I never tried the import tool on a Windows server. Sorry.
Karen Aidi on Tuesday, 13 May 2014 at 02:56 said:
I am currently authoring the online help for a complex advanced advertising system (enterprise software) in WordPress. I would like to be able to take various snippets of text like concepts or certain features or even entire topics (pages) and output them into a page document that can be re-used or re-purposed into Marketing documents, like a reports feature guide, or a user guide on a particular feature, or a configuration guide for a particular module, etc. Then, I’d like to be able to tie everything with a nice bow and PDF the output.
Is this possible or is it a pipe dream?
- Tom Johnson on Tuesday, 25 November 2014 at 17:21 said:
  WordPress’s strength isn’t with PDF output. If you want that, you should probably begin in DITA and output to HTML and PDF. You could import the HTML into WordPress following my instructions here: Import DITA into WordPress.
  There is another technique for single sourcing with WordPress that I described here: Using WP natively for single source publishing and conditional content.
tiru on Wednesday, 7 October 2015 at 11:15 said:
hii sir im trying to install your plugin dita to wordpress but it is not working. There is an error ( Fatal error: Call to undefined function register_importer()).
i’m using wordpress 4.3.1
- Mike Little on Monday, 12 October 2015 at 13:56 said:
  Unfortunately, WordPress changed the way imports work a few years ago, and this plugin no longer works. Sorry.
JGC on Wednesday, 9 May 2018 at 13:53 said:
Hello, it appears that it is still working nowadays and it’s actually brilliant.
Tested with your sample and only the toc file was imported.
After a couple of attempts just found out that pages are imported one by one: first toc then after clearing toc from the ftp admin and running again the dita importer page 2 is okay and on…
How can I get all the pages in only one import? Would it be possible that a loop is missing in the ditahelp.php?
Thanks for your help!
Mike Little on Monday, 14 May 2018 at 12:52 said:
Hi JGC,
I’m amazed to hear that it still works with current versions of WordPress. However, it does need to make two passes: as it needs to handle the page hierarchy which it cannot do in one pass (in case the child page is loaded before the parent). It also figures out updates versus new pages in the second pass.
The details of how it works are documented above and in the readme file.
- JGC on Wednesday, 16 May 2018 at 12:06 said:
  Hello Mike thanks for your reply,
  Well two passes would be okay but the script just stops after importing page number 1 (see hosted image as my website).
  As a result every single page from the map needs to be imported one by one. (I’m not that php aware but i’ll try to dive into in the coming days.)
  Been facing hard times following another article saying that it works with webhelp (which is quite confusing with the ;chm export from dita ot).
  Please maintain this helpful post: It is still up-to-date thanks a lot cheers!
  - Mike Little on Friday, 1 June 2018 at 19:06 said:
    To be honest, I don’t even have a DITA environment set up any more.
    Unfortunately, I don’t have the time to look into this.
    But feel free to delve into the code, figure out how it works, and hopefully, you will find what is causing your issue. It’s all open source, GPL licensed. So go for it and if you fix it feel free to feed the changes back to me.
    One thing to try is my sample file. If it works with that, then it may be your specific DITA output. However, if it doesn’t work with my sample, then I doubt the code really works properly anymore.