I recently worked a bit with Nokogiri to parse some XML. I decided to parse the XML behind the map for the Craftsmanship Manifesto. The map is here and the XML behind the map can be found here. I put this on Github, and you can find it here.
I tried parsing the XML the textbook way. bin/run.first.parser.sh calls lib/first_parser.rb. It’s a mess. It seems like you have to call the Element and the Text classes to get an element. At least that is what I remember and what I can gather from the code. I have a lot of comments in there. I always have comments in code that is just for exploration. But it seems like I have to call two classes to get one element. Just wrong.
I then looked into using XPath. bin/run.show.parser.sh calls lib/show_parser.rb, which is the example on the Nokogiri site. I was able to parse it with bin/run.first.path.parser.sh, which calls lib/first_path_parser.rb and bin/run.path.parser.sh which calls lib/path_parser.rb I had a problem with namespaces. I first tried doc.remove_namespaces! but I did not like the idea of disabling namespaces. There was no namespace for the document, so I just prepended “xmlns:” to all the element names and I got it to work.
Eventually I decided to try JSON. I found out about a gem called crack which can convert an XML document to JSON. It is in bin/run.crack.is.whack.sh, which calls lib/crack_is_whack.rb. I was able to try it out with bin/run.first.json.attempt.sh, which calls lib/first_json_attempt.rb. It parses a small version of the file. The whole map is parsed and output to csv with bin/run.json.parser.sh, which calls lib/json_parser.rb
JSON is a lot easier than XML.
Image from Aurora Consurgens, a 15th century manuscript housed at Central Library of Zurich. Image from e-Codices. This image is assumed to be allowed under Fair Use.