WordPress import with wpautop

While working on a large-ish post import, about 3,500 posts, I encountered a big defect in the WordPress import utility. The gist of the problem is that double-newlines (aka paragraph breaks) get converted to single newlines. Not all the time, but it is a common occurrence. This causes what would have become an automatic <p>…</p> to be turned into a <br />, which completely screws up the formatting of a post. See “double line breaks changed to single line breaks after importing xml file” at WordPress.org for the gory details. In a post in that thread, I volunteered up a utility that uses WordPress’s own wpautop function to pre-wrap the content with the appropriate <p> tags before feeding the xml file to the import utility. The WordPress components are omitted from the code below, but I’ve stashed the entire fix-wordpress-export-wpauto.php file away for you to download later. What this code does:
  • Read thru the input on STDIN.
  • Convert Windows and MacOS line endings to the one true *nix-style newline.
  • For each post’s content, which is found between <content:encoded>…</content:encoded> XML tags in the input, accumulate that content spanning multiple lines into one buffer.
  • Call wpautop on that buffer to change the content so that paragraph tags are correctly inserted around the text blocks (and only the text blocks) that need them.
  • Write the whole file out to STDOUT.
I hope this utility helps someone, it took a little while to code, preceded by lots of confusion, investigation, and pondering.
<?php

$accum  = 0;
$buffer = '';

while ( $line = fgets( STDIN ) ) {
    $line = preg_replace( '/\r\n/', "\n", $line );
    $line = preg_replace( '/\r/',   "\n", $line );

    $start = false;
    $end   = false;
    if ( preg_match( '/^\s*<content:encoded><!\[CDATA\[/', $line ) ) {
        $line = preg_replace( '/^\s*<content:encoded><!\[CDATA\[/', '', $line ); 
        $start = true; 
    } 
    if ( preg_match( '/\]\]><\/content:encoded>\s*$/i', $line ) ) {
        $line = preg_replace( '/\]\]><\/content:encoded>\s*$/i', '', $line );
        $end = true;
    }

    if ( $start && $end ) {
        echo $line;
    } elseif ( $start ) {
        $accum = true;
        $buffer = $line;
    } elseif ( $end ) {
        $accum = false;
        $buffer .= $line;
        echo '<content:encoded><![CDATA[' . wpautop( $buffer ) . ']]></content:encoded>';
    } else {
        if ( $accum ) {
            $buffer .= $line;
        } else {
            echo $line;
        }
    }
}

exit(0);

4 Responses

  1. Sorry if I’m being dense. This sounds like exactly what I’m looking for but I don’t understand how to use it. Where do I put the file and how do I feed it the XML file?

    1. Unzip the file into the same directory you put your XML file into, then run this on the command line:

      php fix-wordpress-export-wpautop.php < original-export-file.xml > new-export-file.xml

      Then run your import.

Leave a Reply

Your email address will not be published. Required fields are marked *