Ask your WordPress questions! Pay money and get answers fast! Comodo Trusted Site Seal
Official PayPal Seal

Any quick ways to strip XML of invalid characters? WordPress

  • SOLVED

Not really a WordPress question, but I've got a gig of XML data (a FileMaker export) and now I'm trying to work with it in PHP, but this:


$xml = simplexml_load_file('tastingsdb_120000.xml');
print_r($xml);


gives me:


Warning: simplexml_load_file(): tastingsdb_120000.xml:1: parser error : PCDATA invalid Char value 5 in /home/lkrubner/dev/tastingnotes/lib/import_filemaker_database.php on line 84


any quick suggestions for purging this string of invalid characters, so I can work with it as XML?

UPDATE:

Also, just tried cdata hack, didn't seem to work. Did this:

$bigStringToBeSavedToDisk = str_replace('<DATA>', '<DATA><![CDATA[ ', $bigStringToBeSavedToDisk);
$bigStringToBeSavedToDisk = str_replace('</DATA>', ' ]]></DATA>', $bigStringToBeSavedToDisk);
$bigStringToBeSavedToDisk .= '</document>';
file_put_contents("tastingsdb_".$i.".xml", $bigStringToBeSavedToDisk);
$bigStringToBeSavedToDisk = '<document>';
}


got this:


Warning: simplexml_load_file(): tastingsdb_120000.xml:1: parser error : Sequence ']]>' not allowed in content in /home/lkrubner/dev/tastingnotes/lib/import_filemaker_database.php on line 86

Answers (1)

2010-10-20

Maor Barazany answers:

[[LINK href="http://tidy.sourceforge.net/"]]tidy html[[/LINK]]


It's a small command line utility that takes invalid html or xml files and makes them valid.

See the documentation inside the link for all the parameters available.


Maor Barazany comments:

You can also try this one - [[LINK href="http://www.phpedit.net/snippet/Remove-Invalid-XML-Characters"]]http://www.phpedit.net/snippet/Remove-Invalid-XML-Characters[[/LINK]]


Maor Barazany comments:

The correct form of CDATA would be -
<![CDATA[data here]]>

What is your invalid markup in your xml? maybe you can achieve it with str_replace to the right format of CDATA


Lawrence Krubner comments:

I think I tried this:

<![CDATA[data here]]>


The amount of info is huge, which makes it tough to experiment.


Lawrence Krubner comments:

Is there an example of the command line use of Tidy?


Maor Barazany comments:

Try this one -



tidy -xml -o output.xml -utf8 -f error.log input.xml


-xml is to parse xml
-o output.xml - specify output file
-utf8 - optional, to specify the encoding for both input & output
-f error.log - optional to write errors to file
input.xml - your input file


Lawrence Krubner comments:

The Remove-Invalid-XML-Characters script worked. Though, damn, running it on a 1.2 gigabyte file took 40 minutes. Luckily PHP scripts have no time limits on the command line. I was on the verge of switching over to Groovy to handle this project.