However, in the course of this study, we noticed that gnomAD also has certain limitations, some of which were discussed in detail by the gnomAD research team.
16 First, gnomAD has currently two data sets, V2 and V3, each with its own pros and cons. The larger one, V2, contains primarily exome data from 141,456 individuals, whereas V3 contains genomic data from 76,156 individuals. These structural differences may lead to materially different results. In the current study, among 96 pathogenic variants in gnomAD, 55 appear in either V2 or V3, most of which (
n = 46) are rare with only a single allele. However, a few variants had a relatively large number of heterozygotes (up to 12) in V2 while none appear in V3. Among the 41 variants that appear in both data sets, a reasonable correlation coefficient of 0.70 was obtained. However, there are exceptions. For example, the status of the c.802-8_810del17insGC mutation being the most common
CYP4V2 mutation with a founder effect in the East Asian population has been well established by multiple studies in patients.
19,21,22 Its carrier frequency varies dramatically between the two versions (9.67 * 10
−3 in gnomAD V3 vs. 1.61 * 10
−3 in gnomAD V2, a ∼6-fold difference), a difference that might stem from the variant nature (indel) and its intronic location that might be more accurately called in genome analysis. Indeed, c.802-8_810del17insGC is the most common
CYP4V2 mutation in V3 in the East Asian population, a result that is consistent with its relatively high prevalence among patients with BCD. Second, despite the large sample size of gnomAD, rare mutations are absent. In the current study, 60 of 108 mutations previously reported in patients with BCD are absent from gnomAD, and most (∼63%) of these variants appear in patients of East Asian origin. It is therefore reasonable to predict that the data presented here are more accurate in well-represented populations, such as the European population, but this gap is expected to narrow over time as gnomAD includes more samples with higher-quality data. On the other hand, gnomAD may capture mutations that have not been reported in patients, especially for rare diseases. Based on the current study, gnomAD contains 48 likely pathogenic variants that have not been reported in patients. This is not surprising as new
CYP4V2 mutations are continuously being reported in patients, even in the East Asian population, in whom BCD has been well studied.
19,22 As new mutations are discovered, reanalysis of genomic data can increase genetic diagnosis of patients who harbor mutations that have not been published previously.
23 Third, the small sample size of some populations in gnomAD may result in the absence of mutations from gnomAD, especially for rare ones. For example, V3 has data for the Middle Eastern population based on a sample size of only 158 individuals, leading to a calculated genetic prevalence of zero for BCD, a value that does not reflect disease prevalence in this population, since mutations were reported in the Arab-Muslim,
21 Iranian,
24 Jewish,
25 and Lebanese
26 populations.