返回列表 回复 发帖

从yfull YTree v5.06版本的更新,看yfull计算年龄的漏洞

本帖最后由 风虎云龙 于 2017-9-28 17:49 编辑

之前就发现了yfull计算年龄的这个漏洞,但是一直没机会去暴露,这次有了Y全序数据的加入计算,使得这个漏洞彻底暴露于众。
F155.png
2017-9-26 11:02

细心的爱好者会发现,yfull这次更新,所有有Y全序样本参与计算的类型,都比之前的年龄提高了,有些单倍群,比如F155yfull计算的年龄达到7000多年,高的离谱。那么我们就逐一分析其问题所在。
首先,是不是Y全序测试的质量有问题呢?
那我们就首先对比一下,Y全序测试与big Y FGC Y Elite、全基因组等四大测试数据之间的质量。


产品参数质量对比.png
2017-9-26 11:02


我们从上表看到,在ChrY BAM file size方面,Y全序产品是最大的,平均在1.5G大小,而big y 基本在0.5G,全基因组和FGC Y Elite大约在0.7G这样。这就说明了Y全序产品在ChrY原始数据的体积是最大的。


那体积大在哪,体积大无非是测的长、测的深。我们接着比下面的长度和深度指标。
我们从表中可用看到,在Length coverage方面,Y全序测试是优于big Y,而低于FGC Y Elite、全基因组的。


Length coverage for age(这个相当重要,是计算年龄区段的覆盖长度),Y全序测试是与big Y相似,而低于FGC Y Elite、全基因组的。


重点来了,在Mean depth coverageMedian depth coverage(平均覆盖深度)方面,Y全序测试是远远优于big Y FGC Y Elite、全基因组的,平均是他们的5倍以上。


基于上面的对比,我们可以看出Y全序的质量是全部优于big Y的。并不会因为质量较低带来计算困难。


那我们再看yfull计算年龄的方法。


计算方法.png
2017-9-26 11:02

年龄=(私有位点数量/年龄区段覆盖比例)*突变率+60


我们可以看出在年龄区段覆盖比例接近的情况下,私有位点的数量是计算年龄的最大变量。


那我们接着看yfull认定私有位点的方法:
https://www.yfull.com/faq/what-yfulls-age-estimation-methodology/


他们认定私有位点有两个条件,一是这个SNP要有大于等于3X深度(SNPs with only one or two "reads"are excluded.),只读到1次或2次的突变,他们认为是属于Ambiguous SNP,不参与年龄的计算。第二就是对突变要有质量要求,要求SNPs are excluded if the "readquality" is less than 90%.


符合上述两条标准的私有位点,然后再看他们是否在计算年龄的区间(8,473,821),如果在的话,才会真正参与年龄的计算。


Yfull对这三条的设定,我们可以看到,问题来了。也就是认定私有位点的第一条,yfull的算法是不看各种测试产品的测试参数,而硬性定下的标准,突变要3X以上才被认为是私有位点,这样就会出现不同测试产品,不同平均覆盖深度而带来的问题。


首先,平均覆盖深度低的产品,比如big Y FGC Y Elite、全基因组的产品,平均覆盖深度在30-40X,这样就会有一部分私有位点出现低于3X而未被确定为高质量的私有位点,而不参加年龄的计算。比如我的Big Y数据,有2个私有位点只有2X的覆盖深度而被定为Ambiguous SNP而没参与年龄计算,而通过我爸爸的全序数据,确定了这2个点是高质量私有位点。


Y全序产品的平均覆盖深度在200X,远远大于Big Y30X,这样就会出现yfull用同一认定私有位点的标准,用在不同产品中,出现了这种情况。
这个问题我在分析第一批Y全序数据的时候就发现了,并且摸索了一套根据不同覆盖深度设置认定私有位点的参数方法,经过我们设置的这套参数计算的第一批Y全序的年龄,可以看到非常稳定,可以直接和之前Yfullbig y的产品计算的年龄对比。


Yfull这次更新用了2个月时间,我还以为他们会发现他们计算年龄的这个bug,不过从这次更新看,他们似乎仍然没发现。
1

评分次数

F444交流群:293164533
Y高通测试报名QQ群:602612059
yfull在计算共祖时间上的问题,我们以前就有所意识。简单的说,就是只校正了覆盖率,而没有校正测试深度。这样测试深度较高的样本可能会出现更多的范围内的私有位点。表现为共祖时间更长。

    不过这次的样本依然存在让人疑虑的地方。这次上传到yfull的魔方数据除了F316下的样本之外,还有多个F845的样本。毫无意外的,F845的魔方样本整体上拉高了F845的共祖时间。然而,拉高的时间相对来说是有限的。早期在bigY和FGC检测的两例样本,使用的私有位点分别为41和49。而在魔方检测样本使用的私有位点为52-58之间。增加的比例并不大。

    但F813下这个魔方样本使用的私有位点为89个,对比早期在bigY的样本,只有13到26个,增长了6到4倍。YF10745的平均深度并没有比F845的几个样本更高,出现这么大的差别,不知道是否还有其他的原因?
QQ截图20170926112136.jpg
C-M130交流群:542136235
我是真心没想到啊,yfull的三兄弟,对这次数据的分析竟然没发现这个问题。可能是纯粹的机器跑的算法,对于同一个类型下一个计算出是12297年一个计算出是1785年,竟然没有设置警报,不进行人工分析排查原因。人数再少也不能缺少人工的基本审核异常的机制啊。
问题.png
F444交流群:293164533
Y高通测试报名QQ群:602612059
本帖最后由 风虎云龙 于 2017-9-27 07:51 编辑
yfull在计算共祖时间上的问题,我们以前就有所意识。简单的说,就是只校正了覆盖率,而没有校正测试深度。这样测试深度较高的样本可能会出现更多的范围内的私有位点。表现为共祖时间更长。

    不过这次的样本依然 ...
豢龙氏 发表于 2017-9-26 11:43
实际这个问题就是我上面提到的需要根据不同测试产品的测试参数,不但要做覆盖长度的修正,也要做个覆盖深度的修正。我们摸索到的修正系数,最后YF10745样本的私有位点只有19个。
私有.png
F444交流群:293164533
Y高通测试报名QQ群:602612059
4# 风虎云龙
I'd like to see the comb bed file of Y quanxu.
I think the most critical thing in calculating TMRCA of Y haplogroup is not the mutation rate, coverage or read depth, but the very comb bed file because the number of SNP's varies more differently. It filters out the fake SNPs in ampliconic or palindronic regions and decides the real SNP number of shared and private mutations. The comb bed file of BIGY is different from Y full's. I tried to apply Adamov's comb bed file(actually the same as Y full's)  to some Y haplogroups, but failed to replicate Y full's estimated TMRCA. As you may know, the comb bed file of Y full filters out not only fake SNP's, but also many real SNP's. I wonder how Y quanxu can resolve Type I,II error and make the exact estimation.
4# 风虎云龙
I'd like to see the comb bed file of Y quanxu.
I think the most critical thing in calculating TMRCA of Y haplogroup is not the mutation rate, coverage or read depth, but the very comb  ...
yayul 发表于 2017-9-27 10:38
我们这里讨论的私有位点的认定都是基于yfull确定的年龄梳理区说的,至于梳理区之外的,有些区域非常不稳定,基本不能用于构建树形和计算年龄,也存在部分区域比较稳定的,里面一些突变也可以多次验证。
F444交流群:293164533
Y高通测试报名QQ群:602612059
毛子的回复
Мы рассмотрим эту ситуацию подробно немного позже, но сейчас могу совершенно точно сказать, что образцы новой лаборатории имеют очень низкое качество. Сейчас мы пришли к выводу, что их интерпретация без предварительной обработки невозможна.
Причин низкого качества несколько. Назову некоторые из них:
1) псевдо coverage, то есть на одно нормальное чтение мы имеем как минимум два-три дупликата. Таким образом там где нам кажется есть покрытие 200X на самом деле может быть реальное покрытие 10X. При удалении дупликатов вес файла уменьшается с 1.7Gb до 500Mb!
2) Из-за пункта номер один может неверно работать наш алгоритм поиска приватных и определения их качества.
3) Файлы имеют очень грязное выравнивание, где "снипы" сотнями идут друг за другом с расстоянием в несколько нуклеотидов. При попытке повторного выравнивания мы увидели что в первоначальном файле из этой лаборатории на Y-хромосому были в большом количестве смаппированы риды с других хромосом. Даже с мито! Что говорить про X...
4) Фактическое покрытие в ширину ниже чем у BigY. Многие участки имеют покрытие 1-3 рида. И это не смотря на проблему с дупликатами.
5) Рваное покрытие в ширину. Что может быть следствием первых пунктов.
6) Разная длина ридов и большие soft clipped куски. Что вносит свою изюмину в эти файлы.
и некоторые другие проблемы.
Первую партию мы обрабатывали "как есть", что вероятно и внесло сильное искажение в подсчет возраста ветвей с образцами из новой лаборатории. Сейчас мы пытаемся "улучшить" некоторые образцы из нового батча, но не знаю что выйдет.
Даже при повторном выравнивании и удалении дупликатов в БАМах остается очень много мусора.
三界无安,犹如火宅。众苦充满,甚可怖畏
                            --------《法华经》
是以法从心生。名因法立
                      ------------《宗镜录》
GOOGLE 翻译
We will discuss this situation in detail a little later, but now I can say for sure that the samples of the new laboratory are of very low quality. Now we have come to the conclusion that their interpretation without preliminary processing is impossible.
There are several reasons for poor quality. I will mention some of them:
1) pseudo coverage, that is, for one normal reading, we have at least two or three duplicates. So where there seems to be a 200X cover, in fact there can be a real 10X coverage. When deleting duplicates, the file weight decreases from 1.7Gb to 500Mb!
2) Because of the item number one, our algorithm of searching for private ones and determining their quality may not work properly.
3) The files have a very dirty alignment, where "snaps" by hundreds go one after another with a distance of several nucleotides. When trying to re-align, we saw that in the original file from this laboratory on the Y-chromosome there were a large number of reads from other chromosomes. Even with mito! What to say about X ...
4) The actual coverage is wider than that of BigY. Many sites have a coverage of 1-3 rid. And this is despite the problem with duplicates.
5) Torn covering in width. What can be the consequence of the first points.
6) Different lengths of ridges and large soft clipped pieces. What brings its raisin to these files.
and some other problems.
We processed the first batch "as is", which probably caused a great distortion in the calculation of the age of branches with samples from the new laboratory. Now we are trying to "improve" some samples from the new batch, but I do not know what will happen.
Even with repeated alignment and removal of duplicates in BAM there is a lot of debris left.
三界无安,犹如火宅。众苦充满,甚可怖畏
                            --------《法华经》
是以法从心生。名因法立
                      ------------《宗镜录》
Ради интереса приведу полную статистику по повторному выравниванию одного из образцов.
Количество смаппированных ридов смотрите в третьем столбце.
Было:
chr1 249250621 0 0
chr2 243199373 0 0
chr3 198022430 0 0
chr4 191154276 0 0
chr5 180915260 0 0
chr6 171115067 0 0
chr7 159138663 0 0
chr8 146364022 0 0
chr9 141213431 0 0
chr10 135534747 0 0
chr11 135006516 0 0
chr12 133851895 0 0
chr13 115169878 0 0
chr14 107349540 0 0
chr15 102531392 0 0
chr16 90354753 0 0
chr17 81195210 0 0
chr18 78077248 0 0
chr19 59128983 0 0
chr20 63025520 0 0
chr21 48129895 0 0
chr22 51304566 0 0
chrX 155270560 0 0
chrY 59373566 27856173 11672
chrM 16569 1596 0
* 0 0 0

стало:
1 249250621 5098 2
2 243199373 4352 0
3 198022430 3155 0
4 191154276 3103 1
5 180915260 2767 0
6 171115067 2688 0
7 159138663 2467 0
8 146364022 2122 0
9 141213431 2104 0
10 135534747 2493 0
11 135006516 2192 0
12 133851895 2141 0
13 115169878 1679 0
14 107349540 1436 0
15 102531392 1760 0
16 90354753 1675 2
17 81195210 1328 0
18 78077248 1150 0
19 59128983 1140 0
20 63025520 1064 0
21 48129895 3801 4
22 51304566 808 0
X 155270560 32023 16
Y 59373566 27324933 10524
MT 16569 1607 0
GL000207.1 4262 0 0
GL000226.1 15008 28 0
GL000229.1 19913 887 0
GL000231.1 27386 744 0
GL000210.1 27682 4 0
GL000239.1 33824 8 0
GL000235.1 34474 4532 0
GL000201.1 36148 0 0
GL000247.1 36422 11 0
GL000245.1 36651 28 0
GL000197.1 37175 0 0
GL000203.1 37498 0 0
GL000246.1 38154 0 0
GL000249.1 38502 0 0
GL000196.1 38914 0 0
GL000248.1 39786 0 0
GL000244.1 39929 0 0
GL000238.1 39939 0 0
GL000202.1 40103 0 0
GL000234.1 40531 36883 9
GL000232.1 40652 17588 4
GL000206.1 41001 0 0
GL000240.1 41933 0 0
GL000236.1 41934 57 0
GL000241.1 42152 2 0
GL000243.1 43341 11 0
GL000242.1 43523 0 0
GL000230.1 43691 6 0
GL000237.1 45867 4 0
GL000233.1 45941 13 0
GL000204.1 81310 0 0
GL000198.1 90085 331 0
GL000208.1 92689 8 0
GL000191.1 106433 0 0
GL000227.1 128374 4 0
GL000228.1 129120 79 0
GL000214.1 137718 304 0
GL000221.1 155397 45 0
GL000209.1 159169 0 0
GL000218.1 161147 28203 13
GL000220.1 161802 20763 2
GL000213.1 164239 5 0
GL000211.1 166566 5841 2
GL000199.1 169874 8 0
GL000217.1 172149 429 0
GL000216.1 172294 3416 2
GL000215.1 172545 1 0
GL000205.1 174588 1837 1
GL000219.1 179198 18 0
GL000224.1 179693 86 0
GL000223.1 180455 4 0
GL000195.1 182896 2289 1
GL000212.1 186858 17 0
GL000222.1 186861 265 0
GL000200.1 187035 4 0
GL000193.1 189789 2 0
GL000194.1 191469 23307 15
GL000225.1 211173 5450 3
GL000192.1 547496 12 0
NC_007605 171823 0 0
hs37d5 35477943 304005 176
* 0 0 63
Это очень негативно сказывается на качестве поиска приватных снипов, так как понятно что если сравнивать с Y, то на ридах из X будет очень и очень много "новых снипов".
Как видим более полумиллиона ридов "размазало" по другим хромосомам.


三界无安,犹如火宅。众苦充满,甚可怖畏
                            --------《法华经》
是以法从心生。名因法立
                      ------------《宗镜录》
The source of the problem is not in methodology but as samples. We analyzed samples from different laboratories (FTDNA, FGC, BGI, YSEQ, VERITAS and others) and scientific samples of proper quality too (several hundred). At the same time, we excluded some scientific samples, for example 1000 genomes, Sardines, etc. because that they are of poor quality and give incorrect ages.

The problem in the "new" Chinese samples (from other lab), but this is not a problem of calculating age of YFull! This samples are of VERY poor quality, they have many short reads (less than 100bp) and all the "big" coverage in the TRASH regions. The calculation of the age ***X is not affected, here is the fact that Chinese samples have a larger number of private snps - this is because the filtration system does not cope with the flow of fake-trash variants. After previous batch we realized this and in the next batch we completely change and remapped the BAM files. We rewrite it and removed the short reads, as well as duplicate reads and etc .. With these samples there are big problems, they are inferior in quality than BigY, and even more so FGC. After all the changes their inflated coverage (and huge GB weight) becomes almost the same with the BigY.

There are several reasons for very poor quality... some of them:

"Pseudo-coverage", for one realible reading, we have at least two or three duplicates. So where we seen "200X", in the fact there can be a real "10X" coverage. When we deleting duplicates, the file size decreases from 1.7Gb to 500Mb! From this our algorithm of searching for private snps and determining their quality may not work properly.

The BAM files have a very bad alignment, there is we seen so named "variants" placed by hundreds go one after another with a distance of several nucleotides - this is aka "variety show". When we trying to re-alignment, we saw that in the original file from this new lab on the Y-chromosome there were a large number of reads from other chromosomes. Even from Mito!

The actual "length of coverage" is less than in BigY. Many sites have a coverage of 1-3X. And this is despite that they have the problem with duplicates.

Different lengths of reads and large soft clipped segments... and some other problems too.

We processed the first batch "as is" and we think that which probably caused a great variation in the age estimation of subclades for those chinese samples (from the new lab). Now we are trying to "improve" some raw files for the new batch, but I do not know what will happen.

Even with rerun of alignment and removal of duplicates there is a lot of "trash" left in the BAM file.

This graph (one of the "pure" regions - 9 Mb) shows a change in the coverage of Chinese samples before and after rebuilding, as well as in comparison with the samples of the BigY.



- BLUE (CN-sample before rebuilding)
- GREEN (CN-sample after rebuilding)
- RED (reference BigY sample)

If you notice we do not calculate age for some scientific samples, for example 1000 genomes, Sardines, etc. It's not just that .. the reason is that they are of poor quality and give incorrect ages. The method is described in detail in the article, and alas, it is sensitive to the quality of samples, otherwise it is impossible.

A detailed description of the methodology is here:
Defining a New Rate Constant for Y-Chromosome SNPs based on Full Sequencing Data
https://www.researchgate.net/pub ... ull_Sequencing_Data
(page 79-80)

We exclude 1-2 reads variants (from estimated) because we do not know whether this is a realible mutation or a false mutation because of errors of sequenced or of alignment or of assembly or contamination or other reasons, this applies in particular to "private" snps, but when we find the same variant for another sample (or samples) but already have a large depth of coverage in this position, then we accept such a variant and add it in the tree.

P.S. dear friends, what do you think... it's okay when "Yquanxu" copied "full" the YTree including with the web-design and publishes it without attribution and permission and uses our domain name (in other domain zone) and our brand name thereby misleading users?
Dear colleagues, do not try to duplicate or copy the age calculation algorithm .. it's not so simple as it seems to you - it's not a simple formula but complex calculations associated with all variants and all nodes.
4# 风虎云龙
I'd like to see the comb bed file of Y quanxu.
I think the most critical thing in calculating TMRCA of Y haplogroup is not the mutation rate, coverage or read depth, but the very comb  ...
yayul 发表于 2017-9-27 10:38
无法重复结果,正常。商业学术机构为了保护自己的学术成果和经济利益,会隐藏算法的关键部分。YFULL用来估算年龄的combBED区域长度是8467165,Adamov论文给出的combBED区域长度总和并不是这个值,两者似乎不完全对应
本帖最后由 李大山 于 2017-9-28 18:38 编辑

10# Roman_Centurion

Dear Roman,

First of all, we never use “Yfull” as our domain name. Previously, one enthusiastic customer linked “Yfull.net” to our website, we would be very grateful. It is one kind-hearted personal behavior but may confuse you a lot. If any offended, please don’t mind. At the beginning, we found that it is difficult for Chinese people to analysis their Y chromosome data and some people would like to upload their sequencing data to Yfull and obtain their Y report like the personal position on the Y tree. However, it is inconvenient, especially in the sequencing data transmission and the language barrier. Therefore, that is why we build our Chinese Y chromosome analysis platform and hopefully encourage more people to explore their ancestral information.


On the other hand, Yquanxu now apply the target sequencing by massively parallel sequencing in the self-designed non-recombining MSY regions, the region size is approximately 8Mbp. As your mentions, it is our mistake that previous provided bam is out of removing duplicated reads. But we all know that in Illumina sequencing platform, the more reads we sequence, the higher dup rate we get commonly, especially in this ultra-deep sequencing data. Here, based on your analysised sample, we do apply Picard’s MarkDuplicates to remove the PCR duplicates reads and plot the compare distribution of reads depth before and after the removing progress. We would like to say that the average reads depth is 111X across the target 8Mbp region, it is impossible to be “real 10X coverage”.

屏幕快照 2017-09-28 18.29.14.png
2017-9-28 18:29


Secondly, it is normal that the bam file contains short reads(less than 100bp) because the reads have been trimmed using the reads trimming tools called trimmomatic. Some poor quality bases could be cut off if below a threshold quality. At present, we set the command lines as “java -jar/usr/local/trimmomatic/bin/trimmomatic.jar PE $fq1 $fq2 $line.clean.fq1.gz$line.unpaired.fq1.gz $line.clean.fq2.gz $line.unpaired.fq2.gz LEADING:20  TRAILING:20 SLIDINGWINDOW:8:20 MINLEN:36”.


Besides,I think that it will be ambiguous if we use different quality control analysis pipeline and I am sure that it will lead the result vary widely. For instance, Yfull could remove the SNPs only one and two reads covered. But we should make sure that the variants should be at least 6X reads depth because we know our data is with deep sequencing. Some fake-trash variants generated by your QC process may be removed in our QC process. For this problem, we could like too pen more detailed technical progress and discuss the public acceptable and feasible QC process that can be suitable for various experimental sequencing data.


Allin all, we make 100% effort to meet the data quality not only in the experimental procedure but also the bioinformatics analysis. And we appreciate Yfull helping us improve our progress and please feel free to let us know if you have any other questions.
我就是那个热心肠的男人,毛子
15# ietf 你搞这个域名简直是多此一举。
1

评分次数

李大山, thanks for your reply. We really appreciate your openness and invitation to discuss. As well as the desire to

improve your product, and we in turn want to improve our analysis. When a more laboratories will launched NGS tests,

then these products will become available to a many peoples, thereby we will be able to better understand the history

of our ancestors.

We will study your comments of technical details and prepare our response and our recommendations on the assembly and

mapping of raw data that are based on our experience. We will send you an answer by e-mail.
有碰撞没关系,只要有一种开放、容纳及合作的态度,最后结果就是共同成长与进步。
急功近利者——戒!
给Y-full发邮件反映一下。
分子人类学基客联盟QQ群:463619584。文化,历史与族群QQ群:487327947。
中华源流探寻QQ群:493631709。微博:https://weibo.com/u/2608725421?is_all=1
YFULL更新了,然而F155问题并没有解决,不清楚是还在重新计算中,还是就这样了,已经打算提出撤回样本了。
返回列表
baidu
互联网 www.ranhaer.org