Optimize CSV export by improving how CSV headers are created. #2695

ukanga · 2024-09-06T14:00:13Z

The current process of CSV exports in Ona Data requires parsing all records to determine all the column headers specifically for repeat data. There is an opportunity to optimise this process such that not all records need to be parsed to determine all the required columns.

All possible columns for a form in Ona Data can be determined from reading the XForm definition of a form in XML or JSON or even from the source XLSForm. Evaluating the number of repeats whenever each submission is received is possible for repeat questions. This number can be updated if a newer, higher number is identified in an accessible manner for each repeat group. During a CSV export, there should be no need to parse through all submissions to determine the depth of a repeat group.

kelvin-muchiri · 2024-11-29T07:58:48Z

The implementation plan needs to cater for:

Example of nested repeats

{
   "_id":2533748,
   "_tags":[
      
   ],
   "_uuid":"2f684d87-d3c9-4bb2-adab-6e18aaa8bf89",
   "_notes":[
      
   ],
   "_edited":false,
   "_status":"submitted_via_web",
   "_version":"202411290755",
   "_duration":"",
   "_xform_id":22511,
   "_attachments":[
      
   ],
   "_geolocation":[
      null,
      null
   ],
   "_media_count":0,
   "_total_media":0,
   "formhub/uuid":"50ac8510ca7f4a4b979b4104f8f5b9fa",
   "_submitted_by":"johndoe",
   "_date_modified":"2024-11-29T07:56:59.824285+00:00",
   "hospital_repeat":[
      {
         "hospital_repeat/hospital":"Aga Khan",
         "hospital_repeat/child_repeat":[
            {
               "hospital_repeat/child_repeat/sex":"male",
               "hospital_repeat/child_repeat/name":"Zakayo",
               "hospital_repeat/child_repeat/birthweight":3.3
            },
            {
               "hospital_repeat/child_repeat/sex":"female",
               "hospital_repeat/child_repeat/name":"Melania",
               "hospital_repeat/child_repeat/birthweight":3.5
            }
         ]
      },
      {
         "hospital_repeat/hospital":"Mama Lucy",
         "hospital_repeat/child_repeat":[
            {
               "hospital_repeat/child_repeat/sex":"female",
               "hospital_repeat/child_repeat/name":"Winnie",
               "hospital_repeat/child_repeat/birthweight":3.1
            }
         ]
      }
   ],
   "meta/instanceID":"uuid:2f684d87-d3c9-4bb2-adab-6e18aaa8bf89",
   "_submission_time":"2024-11-29T07:56:59.491726+00:00",
   "_xform_id_string":"nested_repeats",
   "_bamboo_dataset_id":"",
   "_media_all_received":true
}

Example nested repeats export

nested_repeats_2024_11_29_08_00_34_821888.csv

The implementation should be backwards compatible with existing submissions
The implementation should work with exports for different form versions

kelvin-muchiri · 2024-11-29T10:03:35Z

When looping through the data to build the repeat columns, we also build the attachment uri We can avoid looping through the submissions to build the repeat columns, but it seems we still need to do it to build the attachment URI

kelvin-muchiri · 2024-11-29T11:47:56Z

I noticed we are currently parsing through the submissions twice:

First
Second

We can maybe focus on fixing this first as we work on the large enhancement

ukanga · 2024-11-29T15:33:52Z

The idea for this issue is to eliminate the first one on columns. We shouldn't have to.

kelvin-muchiri · 2024-12-02T07:28:58Z

The idea for this issue is to eliminate the first one on columns. We shouldn't have to.

Noted. After further investigation I confirmed the first loop creates the columns and the second loop relies on the first loop to have the columns already created. We can only eliminate the first loop through this proposal

kelvin-muchiri · 2024-12-02T13:03:37Z

We need to keep track of the maximum number of responses for each repeat question from the submissions received.

For every new submission, we'll check the repeat questions. For each one of them, we'll check the number of responses in the incoming submission and compare with the current known maximum value. We'll update the current maximum value if the incoming value is greater.

The number of occurrences will be stored in a new metadata JSON field in the XFormVersion model.

Example

{
  "repeats_export_cols": {
      "hospital_repeat": 2,
      "child_repeat": 2
  }
}

When generating the CSV, we use repeats_export_cols to build the columns.

Constraints

repeats_export_cols should be updated when a submission is deleted or updated. When a submission is updated, we can follow the same process as if it were a new submission. For deletion, say the deleted submission had the highest number of hospital_repeat, how do determine the successor? Do we traverse all submissions?

Solution: If the submission being deleted had the highest number of any of the repeat responses, we delete the key repeats_export_cols and let it be re-built when generating the CSV export next time

Backward compatibility with existing submissions.

Solution: When generating the CSV, we check if repeats_export_cols is present, if not we build it.

How do we guarantee atomicity and consistency when updating the counts? If 2 processes (2 uWSGI workers or Celery workers) modify hospital_repeat potentially at the same time, this will result in race conditions.

Solution: We need to perform atomic updates at the database level and not at the application level

def update_repeats_export_col(pk, repeat, incoming_max):
    """
    Atomically update a repeat's export number of columns
    """
    with connection.cursor() as cursor:
        cursor.execute(
            """
            UPDATE logger_xformversion
            SET metadata = jsonb_set(
                metadata,
                %s,
                GREATEST(
                    COALESCE((metadata->>%s)::int, 0),
                    %s
                )::text::jsonb
            )
            WHERE id = %s
            """,
            [[repeat], repeat, incoming_max, pk],
        )

The worst case in terms of performance is that we'll still need to loop through the submissions if repeats_export_cols is not present. But this will rarely happen for a form

ukanga · 2024-12-02T16:03:36Z

The number of occurrences will be stored in a new metadata JSON field in the XFormVersion model.

Why the XFormVersion and not the XForm model? We don't currently refer to the XFormVersion during an export.

Could we consider using the MetaData class instead, which we have used historically, for any additional information for the form?

ukanga · 2024-12-02T16:05:55Z

repeats_export_cols should be updated when a submission is deleted or updated. When a submission is updated, we can follow the same process as if it were a new submission. For deletion, say the deleted submission had the highest number of hospital_repeat, how do determine the successor? Do we traverse all submissions?

Solution: If the submission being deleted had the highest number of any of the repeat responses, we delete the key repeats_export_cols and let it be re-built when generating the CSV export next time

Could we keep also keep a count of the number of records/submissions that have the highest number of any repeats? Hence, during a deletion, we reduce the counter and only delete it when the submissions counter is zero?

kelvin-muchiri · 2024-12-03T11:15:56Z

Could we consider using the MetaData class instead, which we have used historically, for any additional information for the form?

Sure, we can use MetaData model. I had chosen the XFormVersion because of the form schema is against a specific version. Hence during export, we check the version being exported and use the repeat metadata for that version

kelvin-muchiri · 2024-12-03T11:20:45Z

When a submission is updated, we can follow the same process as if it were a new submission

I had proposed this but after another thought, a submission that had the highest number of repeat responses can be edited to reduce the repeats. Hence editing a submission should be handled similar to deletion

kelvin-muchiri · 2024-12-03T11:22:16Z

Could we keep also keep a count of the number of records/submissions that have the highest number of any repeats? Hence, during a deletion, we reduce the counter and only delete it when the submissions counter is zero?

I think this will work. This is because there could be other submissions with the same number of repeats and there is no need for deletion unless the counter is 0

kelvin-muchiri · 2025-01-10T08:30:29Z

We later decided not to re-create the entire register if a submission is deleted/updated as this would be performance intensive. The assumption is that there is no harm if there would be extra columns as the values will be n/a

faith-mutua · 2025-01-10T13:19:19Z

We later decided not to re-create the entire register if a submission is deleted/updated as this would be performance intensive. The assumption is that there is no harm if there would be extra columns as the values will be n/a

@kelvin-muchiri this approach has been implemented. Deleted/updated submissions are now downloaded as n/as and not removed from the generated export.

ukanga self-assigned this Sep 6, 2024

kelvin-muchiri assigned kelvin-muchiri and unassigned ukanga Sep 20, 2024

ukanga added this to the Sprint 17 of 2024 milestone Sep 20, 2024

ukanga modified the milestones: Sprint 17 of 2024, Sprint 18 of 2024 Oct 4, 2024

kelvin-muchiri linked a pull request Jan 7, 2025 that will close this issue

CSV export optimisation #2755

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CSV export by improving how CSV headers are created. #2695

Optimize CSV export by improving how CSV headers are created. #2695

ukanga commented Sep 6, 2024

kelvin-muchiri commented Nov 29, 2024 •

edited

Loading

kelvin-muchiri commented Nov 29, 2024

kelvin-muchiri commented Nov 29, 2024 •

edited

Loading

ukanga commented Nov 29, 2024

kelvin-muchiri commented Dec 2, 2024

kelvin-muchiri commented Dec 2, 2024 •

edited

Loading

ukanga commented Dec 2, 2024

ukanga commented Dec 2, 2024

kelvin-muchiri commented Dec 3, 2024 •

edited

Loading

kelvin-muchiri commented Dec 3, 2024 •

edited

Loading

kelvin-muchiri commented Dec 3, 2024

kelvin-muchiri commented Jan 10, 2025

faith-mutua commented Jan 10, 2025

Optimize CSV export by improving how CSV headers are created. #2695

Optimize CSV export by improving how CSV headers are created. #2695

Comments

ukanga commented Sep 6, 2024

kelvin-muchiri commented Nov 29, 2024 • edited Loading

kelvin-muchiri commented Nov 29, 2024

kelvin-muchiri commented Nov 29, 2024 • edited Loading

ukanga commented Nov 29, 2024

kelvin-muchiri commented Dec 2, 2024

kelvin-muchiri commented Dec 2, 2024 • edited Loading

ukanga commented Dec 2, 2024

ukanga commented Dec 2, 2024

kelvin-muchiri commented Dec 3, 2024 • edited Loading

kelvin-muchiri commented Dec 3, 2024 • edited Loading

kelvin-muchiri commented Dec 3, 2024

kelvin-muchiri commented Jan 10, 2025

faith-mutua commented Jan 10, 2025

kelvin-muchiri commented Nov 29, 2024 •

edited

Loading

kelvin-muchiri commented Nov 29, 2024 •

edited

Loading

kelvin-muchiri commented Dec 2, 2024 •

edited

Loading

kelvin-muchiri commented Dec 3, 2024 •

edited

Loading

kelvin-muchiri commented Dec 3, 2024 •

edited

Loading