suggestions on which html parser library is recomended for the openmrs

Hi all

For my GSoC project Radiology Reporting Enhancement, i will need to parse html files. After googling with the search query

best java library for parsing html files

The first 6 links talked only about jsoup. It has a very good documentation and after playing with it for a while i think it will be good for my needs. But i would like to know if the general community has a recommendation as to what to use.

Also i would like to know if i would be able to parse these html files just by using the htmlform entry module since we will be using it for the creation of html forms to serve as report templates. I would like to know if it can parse general html files not created with the module.

cc @teleivo @judy @mogoodrich @dkayiwa

We don’t have a particular HTML parsing library in OpenMRS.

The HTML Form Entry module actually uses purely XML, and just uses the javax.xml (e.g. DocumentBuilderFactory).

@darius thanks for your quick reply. I guess i will go on with the jsoup library then since only xml processing that HTML Form Entry module does will not be sufficient for us as the report templates are in html 5.

HTML is a subset of XML, if it’s well formed and the structure is known you can use any XML parser (JAXB, XStream, Dom4J …) easily. I’ve heard that JSoup is used mostly for web scraping.

I have used JSoup in web scraping apps before, it’s a good choice

I’ve looked at how these work and i think jsoup is far easier to use.