-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize CSV export by improving how CSV headers are created. #2695
Comments
The implementation plan needs to cater for:
Example of nested repeats
Example nested repeats export nested_repeats_2024_11_29_08_00_34_821888.csv
|
When looping through the data to build the repeat columns, we also build the attachment uri We can avoid looping through the submissions to build the repeat columns, but it seems we still need to do it to build the attachment URI |
The idea for this issue is to eliminate the first one on columns. We shouldn't have to. |
Noted. After further investigation I confirmed the first loop creates the columns and the second loop relies on the first loop to have the columns already created. We can only eliminate the first loop through this proposal |
We need to keep track of the maximum number of responses for each repeat question from the submissions received. For every new submission, we'll check the repeat questions. For each one of them, we'll check the number of responses in the incoming submission and compare with the current known maximum value. We'll update the current maximum value if the incoming value is greater. The number of occurrences will be stored in a new Example
When generating the CSV, we use Constraints
Solution: If the submission being deleted had the highest number of any of the repeat responses, we delete the key
Solution: When generating the CSV, we check if
Solution: We need to perform atomic updates at the database level and not at the application level
The worst case in terms of performance is that we'll still need to loop through the submissions if |
Why the Could we consider using the |
Could we keep also keep a count of the number of records/submissions that have the highest number of |
Sure, we can use |
I had proposed this but after another thought, a submission that had the highest number of repeat responses can be edited to reduce the repeats. Hence editing a submission should be handled similar to deletion |
I think this will work. This is because there could be other submissions with the same number of repeats and there is no need for deletion unless the counter is 0 |
We later decided not to re-create the entire register if a submission is deleted/updated as this would be performance intensive. The assumption is that there is no harm if there would be extra columns as the values will be |
@kelvin-muchiri this approach has been implemented. Deleted/updated submissions are now downloaded as n/as and not removed from the generated export. |
The current process of CSV exports in Ona Data requires parsing all records to determine all the column headers specifically for repeat data. There is an opportunity to optimise this process such that not all records need to be parsed to determine all the required columns.
All possible columns for a form in Ona Data can be determined from reading the XForm definition of a form in XML or JSON or even from the source XLSForm. Evaluating the number of repeats whenever each submission is received is possible for repeat questions. This number can be updated if a newer, higher number is identified in an accessible manner for each repeat group. During a CSV export, there should be no need to parse through all submissions to determine the depth of a repeat group.
The text was updated successfully, but these errors were encountered: