Releasing Public Source Code or XML/HTML/etc. for the Power ISA specification

I would like to programmatically extract the pseudo-code and instruction encodings from the Power ISA specification to generate an instruction decoder and a reference that I can test against. Are there any plans for releasing the source code of the ISA specification, or at least something more amenable to parsing than pdf?

Currently it seems like the only viable option is to have a program try to extract text from the pdf with a custom parsing algorithm which is very error prone and full of special cases because the published pdf doesn’t really retain the original text reading order or whitespaces or superscript/subscript information, instead it’s basically just a list of instructions to draw particular characters at particular positions in particular fonts.

Using existing tools to convert the pdf to text don’t really help because the text reconstruction is still missing a bunch of semantic information for subscripts/superscripts/bold/italic/etc. Converting the pdf to html doesn’t help either because the generated html is also basically just a list of instructions to draw particular characters at particular positions in particular fonts.

Did you try Docling? It should have no issue parsing the file and extracting what you are looking for. I let others reply for the code itself. :wink:

Thanks for the suggestion, but I’m specifically looking for something reproducible that doesn’t rely on AI so I know the output is reliable and doesn’t have copyright issues and doesn’t require an inordinate amount of computing power, also some of the datasets used to train the models used by docling say they are for research only.