I would like to programmatically extract the pseudo-code and instruction encodings from the Power ISA specification to generate an instruction decoder and a reference that I can test against. Are there any plans for releasing the source code of the ISA specification, or at least something more amenable to parsing than pdf?
Currently it seems like the only viable option is to have a program try to extract text from the pdf with a custom parsing algorithm which is very error prone and full of special cases because the published pdf doesn’t really retain the original text reading order or whitespaces or superscript/subscript information, instead it’s basically just a list of instructions to draw particular characters at particular positions in particular fonts.
Using existing tools to convert the pdf to text don’t really help because the text reconstruction is still missing a bunch of semantic information for subscripts/superscripts/bold/italic/etc. Converting the pdf to html doesn’t help either because the generated html is also basically just a list of instructions to draw particular characters at particular positions in particular fonts.
