2019年5月bioRxiv生信好文速览
到上个月,距生信人推出月度的bioRxiv生信好文速览栏目已经整整一年了。大约一年前,我们曾在“bioRxiv速览”中对Nature的world view板块刊发的来自伦敦的科学记者Tom Sheldon的一篇文章有过报道,该文作者表示,因为预印本(preprint)未经同行评议,所以与正式发表的文章相比而言可能包含更多错误,而这些错误可能通过预印本被传播、放大,由此Sheldon大声疾呼学界采取措施加强对预印本发布的限制。
一年过去了,上个月,Nature官方为预印本“正名”了。5月15号,Nature杂志以Editorial的形式刊文,正式表示了Nature及其旗下杂志对于预印本的支持!
 
实际上,Nature早在1997年就对预印本有过点评,不过当时的预印本主要实在物理学界罢了。而现今,Nature的编辑认为,是时候表示对预印本,这样一种集发现优权先宣示、接受同行意见、快速展示研究进展于一体的文体表示支持的时候了。
EXO多少人By making early research findings accessible quickly and easily, preprints allow researchers to claim priority of discovery, receive community input and demonstrate evidence of progress for funders and others.
文章作者还表示,这一次Nature电脑上的小喇叭不见了对以下两个以前有些模棱两个的问题加以更新。第一,允许作者对预印本文章选择版权,且不会影响审稿,但需注意,版权选择可能会限制研究成果的分享和传播。第二,作者可以通过媒体报道预印本的研究成果,但与此同时也应强调这些结果并未经过同行评议。Nature的影响力毋庸置疑,当然,也不应忘记当年老牌经典杂志Genetics大概是第一个公开声明支持预印本的生物类学术期刊。
革命烈士诗歌
五年多过去了,预印本的队伍——不论是使用者还是服务器——在迅速壮大蓬勃发展,这
一点从上月刊于elife上的对bioRxiv自成立以来发布的37000余篇preprints的调查报告中可见一斑【1,2】。预印本发展到今天,得益于无数先驱者们的努力,当然也离不开批评者们的声音。它的未来需要学术圈的共同努力。
1. 【Bioinformatics】终于来了:谷歌携深度学习进军基因功能注释,号称大幅提升预测效果和速度
Using Deep Learning to Annotate the Protein Universe
Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate 1/3 of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. To address this, we report a deep learning model that learns the relationship between unaligned amino acid sequences and their functional classification across all 17929 families of the Pfam database. Using the Pfam seed sequences we establish a rigorous benchmark assessment and find a dilated convolutional model that r
educes the error of both BLASTp and pHMMs by a factor of nine. Using 80% of the full Pfam database we train a protein family predictor that is more accurate and over 200 times faster than BLASTp, while learning sequence features it was not trained on such as structural disorder and transmembrane helices. Our model co-locates sequences from unseen families in embedding space, allowing sequences from novel families to be accurately annotated. These results suggest deep learning models will be a core component of future protein function prediction tools.
BTW:本文发布后立即在网上引起广泛关注,也包括不少质疑声音。来自丹麦哥本哈根大学的Lars Juhl Jensen教授表示,谷歌团队在测试集选取时忽略了属于同一家族的蛋白在进化上的关联:
HMMER作者Sean Eddy也表达了相似观点,此外还表示文章里对自己的软件在关于速度的描述有严重偏差:
2. 【Bioinformatics】针对大基因组的从头组装软件Ra
hi jude 歌词
Yet another de novo genome assembler(CC-BY-NC 4.0)
Advances in sequencing technologies have pushed the limits of genome assemblies beyond imagination. The sheer amount of long read data that is being generated enables the assembly for even the largest and most complex organism for which efficient algorithms are needed. We present a new tool, called Ra, for de novo genome assembly
of long uncorrected reads. It is a fast and memory friendly assembler based on sequence classification and assembly graphs, developed with large genomes in mind. It is freely available at github/lbcbsci/ra.
3. 【Bioinformatics】普林斯顿大学John Storey:RNA-seq差异表达实验达到statistical power测序深度需达到多少?
Determining sufficient sequencing depth in RNA-Seq differential expression studies(CC-BY-ND 4.0)
RNA-Seq studies require a sufficient read depth to detect biologically important genes. Sequencing below this threshold will reduce statistical power while sequencing above will provide only marginal improvements in power and incur unnecessary sequencing costs. Although existing methodologies can help assess whether there is sufficient read depth, they are unable to guide how many additional reads should be sequenced to reach this threshold. We provide a new method called superSeq that models the relationship between statistical power and read depth. We apply the superSeq framework to 393 RNA-
Seq experiments (1,021 total contrasts) in the Expression Atlas and find the model accurately predicts the increase in statistical power gained by increasing the read depth. Based on our analysis, we find that most published studies (> 70%) are undersequenced, i.e., their statistical power can be improved by increasing the sequencing read depth. In addition, the extent of saturation is highly dependent on statistical methodology: only 9.5%, 29.5%, and 26.6% of contrasts are saturated when using DESeq2, edgeR, and limma, respectively. Finally, we also find that there is no clear minimum per-transcript read depth to guarantee saturation for an entire technology. Therefore, our framework not only delineates key differences among methods and their impact on determining saturation, but will also be needed even as technology improves and the read depth of experiments increases. Researchers can thus use superSeq to calculate the read depth to achieve required statistical power while avoiding unnecessary sequencing costs.
4. 【Evolution】中山大学施苏华团队:以红树为例,基因组中有多少基因可以在物种间自由交换?
Genes and the species concept - How much of the genomes can be exchanged?(CC-BY-NC-ND 4.0)
In the biological species concept, much of the genomes cannot be exchanged between species1,2. In the modern genic view, species are distinct as long as genes that delineate the morphological, ecological and reproductive differences remain distinct2. The rest (or the bulk) of the genomes should be freely interchangeable. The core of the species concept therefore demands finding out the full potential of introgressions between species. In a survey of two closely related mangrove species (Rhizophora mucronata and R. stylosa) on the coasts of the western Pacific and Indian oceans, we found that the genomes are well delineated in allopatry, echoing their morphological and ecological divergence. The two species are sympatric/parapatric in the Daintree River area of northeastern Australia. In sympatry, their genomes harbor 7,700 and 3,100 introgression blocks, respectively, with each block averaging about 3-4 Kb. These fine-grained and strongly-penetrant introgressions suggest that each species must have evolved many differentially-adaptive (and, hence, non-introgressable) genes that contribute to speciatio
英语六级算分
n. We identify 30 such genes, seven of which are about flower development, within small genomic islets with a mean size of 1.4 Kb. In sympatry, the species-specific genomic islets account for only a small fraction (< 15%)="" of="" the="" genomes="" while="" the="" rest="" appears="">
>教育储蓄的起存金额