Import HTML Module



June 04, 2007

The Import HTML module is a Drupal module that facilitates migrating from your existing site to a Drupal content management system.

This module has the potential to save a lot of time and energy when importing an existing site. In many cases, migrating to a new website can mean copying and pasting large amounts of content, even though the pages are all formatted in a similar fashion. The import_html module can help automate much of this tedious work.

Testing of Two Websites

At OpenConcept, we focused on testing the Import HTML module on two main websites. The first website was a very straightforward, recently-designed website, with few pages and a consistent format. The second website was a more complex website, with a larger number of pages and less consistency.

In the first example, the simple website was imported without any major problems. Import_html did a good job of importing the information. It only missed one image, from the footer at the bottom of a page.

In the second example, the more complicated website was imported with some degree of success. In most cases, a small amount of CSS styling would be all that was required to bring the website into a functioning mode. In a few cases, the right-hand links were removed from the page. In other places the images disappeared, because the links were not quite right.

A number of errors appeared while running the module on the larger website.  These included:

  • 10 errors in the form: "line 113 column 132 - Error: is not recognized!", which appeared at the top of the page, above Drupal
  • 8 errors in the form: "* user warning" that had to do with HTMLTidy failing to parse the code, or failing to read/parse particular files
  • 12 errors in the form " too long and had to be truncated"
  • 1 of the pages didn't appear to work at all

Overall Comments

Despite some need for tweaking a website after it is imported, we have found that the Import HTML module could be quite useful for importing data from an older website. It could save repetitive copying and pasting. Certainly, it is a great idea. As it is developed further, it will hopefully also continue to improve.

There is definitely still some improvement that could make it work better. For example, importing images wasn't working properly for us.  The module appears to be copying the images to the correct directory, but then providing a link that doesn't make them show up.

The Import_HTML feature works extremely fast.  For example, it took only a couple of seconds to import the larger website and all of its pages.

The module does a decent job of stripping out the content from tables, etc., to create stripped-down css-ready code.  Sometimes it appears to be over-zealous in removing columns, however.

It might make sense to do some pre-processing with other tools, before importing via import_html.  For example, you may want to ensure that you get the right-hand links first.  I imagine it's easier to do this pre-formatting before having the content in Drupal.

Despite some of the difficulties found in using the module, it can still save time, particularly when importing a large site.

Process Notes

You need to first copy the website over to a local directory on the server before using the Import_HTML file on it

We needed to change the permission settings on the files/imported directory to 775 in order to use the module (chmod files/imported 775).

To delete a page, once it has been imported, you need to delete both the page itself and its menu item. Even then it appears that the page still remains in the files/imported directory on the server, though, and would also need to be manually deleted from there.

About The Author

Mike Gifford is the founder of OpenConcept Consulting Inc, which he started in 1999. Since then, he has been particularly active in developing and extending open source content management systems to allow people to get closer to their content. Before starting OpenConcept, Mike had worked for a number of national NGOs including Oxfam Canada and Friends of the Earth.