Not really a WordPress question, but I've got a gig of XML data (a FileMaker export) and now I'm trying to work with it in PHP, but this:
$xml = simplexml_load_file('tastingsdb_120000.xml');
print_r($xml);
gives me:
Warning: simplexml_load_file(): tastingsdb_120000.xml:1: parser error : PCDATA invalid Char value 5 in /home/lkrubner/dev/tastingnotes/lib/import_filemaker_database.php on line 84
any quick suggestions for purging this string of invalid characters, so I can work with it as XML?
UPDATE:
Also, just tried cdata hack, didn't seem to work. Did this:
$bigStringToBeSavedToDisk = str_replace('<DATA>', '<DATA><![CDATA[ ', $bigStringToBeSavedToDisk);
$bigStringToBeSavedToDisk = str_replace('</DATA>', ' ]]></DATA>', $bigStringToBeSavedToDisk);
$bigStringToBeSavedToDisk .= '</document>';
file_put_contents("tastingsdb_".$i.".xml", $bigStringToBeSavedToDisk);
$bigStringToBeSavedToDisk = '<document>';
}
got this:
Warning: simplexml_load_file(): tastingsdb_120000.xml:1: parser error : Sequence ']]>' not allowed in content in /home/lkrubner/dev/tastingnotes/lib/import_filemaker_database.php on line 86
Maor Barazany answers:
[[LINK href="http://tidy.sourceforge.net/"]]tidy html[[/LINK]]
It's a small command line utility that takes invalid html or xml files and makes them valid.
See the documentation inside the link for all the parameters available.
Maor Barazany comments:
You can also try this one - [[LINK href="http://www.phpedit.net/snippet/Remove-Invalid-XML-Characters"]]http://www.phpedit.net/snippet/Remove-Invalid-XML-Characters[[/LINK]]
Maor Barazany comments:
The correct form of CDATA would be -
<![CDATA[data here]]>
What is your invalid markup in your xml? maybe you can achieve it with str_replace to the right format of CDATA
Lawrence Krubner comments:
I think I tried this:
<![CDATA[data here]]>
The amount of info is huge, which makes it tough to experiment.
Lawrence Krubner comments:
Is there an example of the command line use of Tidy?
Maor Barazany comments:
Try this one -
tidy -xml -o output.xml -utf8 -f error.log input.xml
-xml is to parse xml
-o output.xml - specify output file
-utf8 - optional, to specify the encoding for both input & output
-f error.log - optional to write errors to file
input.xml - your input file
Lawrence Krubner comments:
The Remove-Invalid-XML-Characters script worked. Though, damn, running it on a 1.2 gigabyte file took 40 minutes. Luckily PHP scripts have no time limits on the command line. I was on the verge of switching over to Groovy to handle this project.