Ask your WordPress questions! Pay money and get answers fast! Comodo Trusted Site Seal
Official PayPal Seal

Any quick ways to strip XML of invalid characters? WordPress


Not really a WordPress question, but I've got a gig of XML data (a FileMaker export) and now I'm trying to work with it in PHP, but this:

$xml = simplexml_load_file('tastingsdb_120000.xml');

gives me:

Warning: simplexml_load_file(): tastingsdb_120000.xml:1: parser error : PCDATA invalid Char value 5 in /home/lkrubner/dev/tastingnotes/lib/import_filemaker_database.php on line 84

any quick suggestions for purging this string of invalid characters, so I can work with it as XML?


Also, just tried cdata hack, didn't seem to work. Did this:

$bigStringToBeSavedToDisk = str_replace('<DATA>', '<DATA><![CDATA[ ', $bigStringToBeSavedToDisk);
$bigStringToBeSavedToDisk = str_replace('</DATA>', ' ]]></DATA>', $bigStringToBeSavedToDisk);
$bigStringToBeSavedToDisk .= '</document>';
file_put_contents("tastingsdb_".$i.".xml", $bigStringToBeSavedToDisk);
$bigStringToBeSavedToDisk = '<document>';

got this:

Warning: simplexml_load_file(): tastingsdb_120000.xml:1: parser error : Sequence ']]>' not allowed in content in /home/lkrubner/dev/tastingnotes/lib/import_filemaker_database.php on line 86

Answers (1)


Maor Barazany answers:

[[LINK href=""]]tidy html[[/LINK]]

It's a small command line utility that takes invalid html or xml files and makes them valid.

See the documentation inside the link for all the parameters available.

Maor Barazany comments:

You can also try this one - [[LINK href=""]][[/LINK]]

Maor Barazany comments:

The correct form of CDATA would be -
<![CDATA[data here]]>

What is your invalid markup in your xml? maybe you can achieve it with str_replace to the right format of CDATA

Lawrence Krubner comments:

I think I tried this:

<![CDATA[data here]]>

The amount of info is huge, which makes it tough to experiment.

Lawrence Krubner comments:

Is there an example of the command line use of Tidy?

Maor Barazany comments:

Try this one -

tidy -xml -o output.xml -utf8 -f error.log input.xml

-xml is to parse xml
-o output.xml - specify output file
-utf8 - optional, to specify the encoding for both input & output
-f error.log - optional to write errors to file
input.xml - your input file

Lawrence Krubner comments:

The Remove-Invalid-XML-Characters script worked. Though, damn, running it on a 1.2 gigabyte file took 40 minutes. Luckily PHP scripts have no time limits on the command line. I was on the verge of switching over to Groovy to handle this project.