Automated Digitization of Paper Records

Hi everyone! I’m a medical student at Brown University and want to share a project that I’m working on with Dr. Hamish Fraser (@hamish), the Brown Center for Biomedical Informatics, and partners at the Mae Tao Clinic in Thailand.

We’re developing a software tool to streamline data collection from paper records by automating the digitization process. The application connects to a user’s USB webcam to capture images of filled records, and then uses optical mark recognition to automatically extract categorical information from the images (ex. checkboxes, multiple choice items, etc.). For freeform elements such as hand-written digits, the user is visually guided through the process of manual entry (we hope to automate this process using optical character recognition in the next update). Here’s a short demo video, in this case configured for the the COVID-19 Case Report Form used by the Rhode Island Department of Health.

We’re interested to hear your thoughts and feedback. In particular:

  1. Do you see this as a potentially valuable tool in your data collection process?
  2. Would you be interested in trying out the software on paper records at your institution?

We’re eager to collaborate with anyone interested in exploring the value of adding automated digitization to their information workflow. If you share with us a template for a paper record of interest (JPEG, PDF, any format really), we can configure the application to parse its layout and start the discussion around testing the software retrospectively on a set of already completed forms.

For context, this software was originally designed for the Health Information System team at the Mae Tao Clinic to help them digitize the paper delivery records in their Reproductive Health department. The results of initial on-site testing were very promising, but in light of recent COVID-19-related developments our pilot phase is currently on hold.

We believe a tool like this could streamline data collection in a number of different settings, particularly in cases where manual entry is a burden and data quality is critical. Looking forward to starting the conversation with you all!


Hi Sud,

This does look interesting!

As a Non-profit, we have quite a few historical records, many going back to the 80’s that we have an ongoing project on to digitize. Primarily we put them into PDF’s and hand enter some meta data into a scanning program. Something like this would be useful.

We are also looking at something along these lines so that when we provide healthcare in austere environments where we can’t always use technology the way we like, we can fall back on paper knowing we can collect the data into a database later.

Couple of thoughts.

Some additional ways to get documents into the system would be helpful. With large numbers of records, using something like a webcam could be time consuming. Using sheet fed scanners, or even something like the Shine Ultra might be a little faster. Also might be helpful to feed it a batch of image or PDF files and have it process those.

We have thought about a similar process that would display the page image while doing the correction you have done, and then also save the page image as part of the file.

Is the program something you’ve developed locally or are you using something like Microsoft’s Cognitive Machine Learning/AI services? I’ve just recently started looking at these, and they do have a form recognizer that is in Preview but should be going GA Shortly.

These cognitive services can run in small containers (Docker type) and be used locally or as part of a larger system. Since they have several of these services available, you might even be able to chain them together to to image clean-up/enhancement before running the image through a form recognizer.

Very interesting. Will follow your progress!!

Ken Richards

1 Like

@docken Thanks for your thoughts! To answer your question about development, we’ve developed this all locally using only open-source packages (Flask, D3, and OpenCV). I didn’t know about these Microsoft services - will give them a look!

I echo your thoughts around other ways to get documents into the system. The tool currently allows you to upload pre-scanned forms as PDF/JPEGs, albeit one form at a time. Batch image upload is one of the next items on our agenda. We’d be happy to take a look at some of the historical records you mentioned if you’re interested in testing it out!

The webcam feature was designed for point-of-care use in Mae Tao Clinic’s reproductive health ward, where community health workers would scan and save delivery records shortly after filling them out or in small batches during their downtime. The clinic is very much the kind of healthcare setting I think you’re describing, where the computers/database systems are a key part of their information workflow, but paper records are still indispensable given concerns around reliability (power, network connectivity, etc.).


The Microsoft services I believe charge on a per page/1000 page type process, but think it decreased while in Preview. So an Open source solution might still be a better bet.

How is the lighting when using the USB camera? Was one of the thoughts I had, although a simple lighting setup just for this purpose wouldn’t be a bad idea. The overhead scanner I’m looking at for other projects I’m working on might be a viable solution.

I’ll look and see what forms I have. Being that they are patient data, there are restrictions with sharing them, but I might have some training documents I can dig up.

Last, althought this discussion is in the OpenMRS forums I didn’t want to assume. Do you use OpenMRS for the actual record storage once scanned? This would fit with what we’ve been looking at. Would give us a way to get historical records in the system, but also move forward with a digital record system.

We are currently reviewing our “Patient Data Ecosystem” and deciding how best to get past and future records in that same system.


Hi Ken,

I agree, an open-source option for this kind of task is conspicuously missing…just one of the underlying motivations for our project :slight_smile:

As expected the software is sensitive to lighting when using the USB camera, but thankfully this hasn’t presented an issue in our testing. At the clinic we had the camera set up at a designated station where the lighting and camera-to-page distance were fixed. Once we got the station set up, minor changes in ambient lighting and other environmental factors weren’t really an issue. Like you suggested, I think a simple lighting setup would probably do the trick.

Using OpenMRS directly for the actual record storage is one of our ultimate goals, but we’re not quite there yet. Currently the tool outputs to CSV, which can configured to match the table structure of your OpenMRS database. From there it looks like people have used a number of creative solutions for importing CSV data, from using the Initializer Module to importing directly using SQL Workbench.

Let us know if you dig up some training documents that you’d like to share! We’d be happy to configure the tool to your forms and scope out whether this could be a viable solution for you. Also, selfishly, a fresh batch of forms would help with our ongoing testing. If things seem promising we’d be happy to look into the best way to connect to your Patient Data Ecosystem, since enabling flexible connectivity will also be critical as we develop this into a more generalizable tool.


1 Like

Thanks Sud!

We just embarked on a review of our Patient Data requirements, and as part of that if we want to move from our current EMR to something else. Open MRS is one we are considering, and we have been discussing this exact same use case.

Currently we use primarily paper, and then manually enter some of the info into our EMR system. Having a way to minimize the manual entry would save us a lot of time. Also for those charts that we don’t necessarily need to collect specific data from, having the ability to have an image of those pages (or photos, X-rays, etc.) would be extremely helpful. Having an all EMR system would be preferable, but we are often spread out over a wide area, and network connectivity between the stations or locations within a larger hospital is not always easy to do.

I will definitely follow this, and will see what we have for practice forms!

  • Ken
1 Like