It is common for hundreds or even thousands of clinical endpoints to be collected from individual subjects, but events from the majority of clinical endpoints are rare. The challenge of analyzing high dimensional sparse data is in balancing analytical consideration for statistical inference and clinical interpretation with precise meaningful outcomes of interest at intra-categorical and inter-categorical levels. Lumping or grouping similar rare events into a composite category has the statistical advantage of increasing testing power, reducing multiplicity size, and avoiding competing risk problems. However, too much or inappropriate lumping would jeopardize the clinical meaning of interest and external validity, whereas splitting or keeping each individual event at its basic clinical meaningful category can overcome the drawbacks of lumping. This practice might create analytical issues of increasing type II errors, multiplicity size, competing risks, and having a large proportion of endpoints with rare events. It seems that lumping and splitting are diametrically opposite approaches, but in fact, they are complementary. Both are essential for high dimensional data analysis. This paper describes the steps required for the lumping and splitting analysis and presents SAS® code that can be used to implement each step.
Shirley Lu, VAMHCS, Cooperative Studies Program
In the regulatory world of patient safety and pharmacovigilance, whether it's during clinical trials or post-market surveillance, SAEs that affect participants must be collected, and if certain criteria are met, reported to the FDA and other regulatory authorities. SAEs are often entered into multiple databases by various users, resulting in possible data discrepancies and quality loss. Efforts have been made to reconcile the SAE data between databases, but there is no industrial standard regarding the methodology or tool employed for this task. Some organizations still reconcile the data manually, with visual inspections and vocal verification. Not only is this laborious and error-prone, it becomes prohibitive when the data reach hundreds of records. We devised an efficient algorithm using SAS® to compare two data sources automatically. Our algorithm identifies matched, discrepant, and unpaired SAE records. Additionally, it employs a user-supplied list of synonyms to find non-identical but relevant matches. First, two data sources are combined and sorted by key fields such as Subject ID , Onset Date , Stop Date , and Event Term . Record counts and Levenshtein edit distances are calculated within certain groups to assist with sorting and matching. This combined record list is then fed into a DATA step to decide whether a record is paired or unpaired. For an unpaired record, a stub record with all fields set as ? is generated as a matching placeholder. Each record is written to one of two data sets. Later, the data sets are tagged and pulled into a comparison logic using hash objects, which enable field-by-field comparison and display discrepancies in clean format for each field. Identical fields or columns are cleared or removed for clarity. The result is a streamlined and user-friendly process that allows for fast and easy SAE reconciliation.
Zhou (Tom) Hui, AstraZeneca