CSV import of OPD and IPD data is taking very long time

pramidat · November 1, 2018, 5:28am

One of the implementations tried to import historical OPD and IPD data using CSV import functionality

The data to import is around 350,000 lines and Bahmni seems not to be able to do this. Below are few statistics:

 Import the whole file at once. This fails because the file size is bigger than the configured max upload size for Apache.

 Split the file in two, each new file containing ~175,000 lines. This ran for three days before both OpenMRS and OpenELIS crashed with an out of memory exception.

 Upload the first 50,000 lines. This is still running after 3 days, by now OpenMRS uses 11.1 GB of RAM, OpenELIS 4.5 GB and MySQL 2.8 GB and the process advances slower and slower.

Did anyone face similar issue? Do we have any recommended max limit for importing?

angshuonline · November 2, 2018, 8:59am

Why not break out the data files logically?

ramses · November 5, 2018, 10:04am

@angshuonline: what do you mean by this? We want to automise the process, we could of course split the file even further, but then we have to upload hundreds of files by hand.

The csv file contains historical encounter data which we want to have available in the system so that the physicians can consult this without having to go and search for the paper files.

angshuonline · November 7, 2018, 12:15pm

I understand the importance of importing historical data. You wouldn’t need to convince me

Without knowing the file contents and what you are trying to upload, its difficult to give any suggestion for improvement! 50000 lines are not that big, and unless we get to see some sort of representative data, I would be just guessing! Have you done any sort of profiling to figure out where is the time taken?

We used imports a lot in many implementations and although I can’t recall what was the largest file, we didn’t come across this scenario. The import process will try to resolve all sorts of dependencies

What are trying to import?
Have you split up patients and encounter obs data and lab data files?

In general, assuming data dependency is not an issue in the file, and clearly a factor of processing large data

Think of logically splitting this into multiple files. Obviously not by a single patient, but say for manageable number of patients. (If you are saying there would be hundreds of files, I would like to understand on what basis are you saying that?)

assuming this is one time activity (historical data), then even if this requires say uploading 20 files, would it be better to accommodate with whatever is the current capability of the system? handling smaller sizes will also help you narrow down and make fixes in case of errors (and there will be errors!)
a simple script for automating the file submission can take away need for manually uploading effort.

Programmatically enhance the current capability of handling files for huge file size.

the program probably have to look for ways of splitting content and managing chunks server side. Its hard to say that what would be the optimum time because it will depend on data! Even for splitting the content to effectively manage import will depend on size of file.
unlike the current process, AFAIK, where the import happens in the request-response cycle, we will have to rework in terms of introducing jobs running in the background and make necessary changes to track such changes on the front end.
need to handle errors and report effectively. (handle cases with dependencies)

Check if there are some common stupid gotchas in the code!

If you give representative file, we can try to identify if there are any obvious problems in the code.

In the end, you will have to decide on cost/benefit aspect.