Hc's Blog

Reading--SNP calling from HTS

Posted on 2020-05-13 Edited on 2020-05-19

A beginner guide to SNP calling from high-throughput DNA-sequencing data

2012, Human Genetics

the objective is to identify genetic variants such as single nucleotide polymorphism (SNP) from high-throughput DNA sequencing (HTS) data.
pipeline: 1. quality control 2. mapping of short reads to the reference genome 3. visualization and post-processing of the alignment including base quality recalibration 4. SNP calling procedure along with filtering of SNP candidates

换电脑重配置博客

Posted on 2020-05-13 Edited on 2020-05-16

Yes, I have done it elegantly and efficiently.

软件安装

安装 git
安装 Node.js

COVID-19 pandemic research

Posted on 2020-05-06 Edited on 2020-10-22

Suspected close contacts as the pilot indicator of the growth trend of confirmed population during the COVID-19 pandemic: A simulation approach

Sisi Huang, Anding Zhu, Yan Wang, Yancong Xu, Lu Li, Dexing Kong

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7268865/

Abstract

Background:
Regarding to the actual situation of the new coronavirus disease 2019 epidemic, social factors should be taken into account and the increasing growth trend of confirmed populations needs to be explained. A proper model needs to be established, not only to simulate the epidemic, but also to evaluate the future epidemic situation and find a pilot indicator for the outbreak.

Methods:
The original susceptible-infectious-recover model is modified into the susceptible-infectious-quarantine-confirm-recover combined with social factors (SIDCRL) model, which combines the natural transmission with social factors such as external interventions and isolation. The numerical simulation method is used to imitate the change curve of the cumulative number of the confirmed cases and the number of cured patients. Furthermore, we investigate the relationship between the suspected close contacts (SCC) and the final outcome of the growth trend of confirmed cases with a simulation approach.

Results:
This article selects four representative countries, that is, China, South Korea, Italy, and the United States, and gives separate numerical simulations. The simulation results of the model fit the actual situation of the epidemic development and reasonable predictions are made. In addition, it is analyzed that the increasing number of SCC contributes to the epidemic outbreak and the prediction of the United States based on the population of the SCC highlights the importance of external intervention and active prevention measures.

Conclusions:
The simulation of the model verifies its reliability and stresses that observable variable SCC can be taken as a pilot indicator of the coronavirus disease 2019 pandemic.

**Keywords: ** COVID-19, SIR model, social factors, numerical simulation, suspected close contacts, confirmed case, temporary hospital

The numerical simulation of SIDCRL model shows it gives an excellent fit of the realistic data. Then it is derived from the simulation result that the increasing number of SCC contributes to the epidemic outbreak, which highlights the importance of external intervention and the active prevention measures in all countries. The paper is well-written and the new model and the corresponding simulation results are interesting both theoretically and practically and present a new direction to investigate the COVID-19 epidemic for the related scientists.

calculus_lecture

Posted on 2020-04-19

下面是这次微积分辅学的讲义，关于个人的一点分享以及偏导数、泰勒展开的一些例题和notes，参考的是谢惠民的《数学分析习题讲义》。

微积分讲义0419

Windows Subsystem for Linux

Posted on 2020-04-17

想来想去还是喜欢Windows系统，换Mac太贵了又没什么很好的性价比，直接装Linux系统作为主系统也不合适，Win还是有其优越之处，再说本地上用Linux命令也不会去跑很大的程序，简单点就好了不必作为电脑主系统。
原先是用虚拟机，但有一个问题，文件同步太不方便。现在试了WSL，windows下的Linux子系统，至少初体验不错。

Tips: shift+右键就可以在当前目录进入Linux终端，超棒的哎！

trivial0321

Posted on 2020-03-21 Edited on 2021-10-24

进大学快三年，个人的一些分享也许会对大家尤其是大学生群体哈有点帮助吧哈哈。当然个人的想法不一定对所有人适用。
有些是别人分享给我的，有些是自己的习惯。我觉得分享的内容都挺亲民的，没什么高大上的，轻松扫一眼说不准有收获呗。

Short Read Alignment

Posted on 2020-03-17

我们如何将基因组转换为可以快速匹配数百万条reads的表示形式？

这里介绍一种方式，或说一种数据结构：Full-text Minute-size index (FM Index / BWT)

参考基因序列经过BWT变换后，通过FM Index和FL mapping能够实现reads的快速匹配。

给定参考基因和一组reads，至少能找到一个“良好”的局部比对，或说找到一个read在参考基因序列中的位置。
怎样的比对结果是“良好”的？
- 错配越少越好
- 低质量的碱基错配要比高质量的碱基错配更好

Comparative Genomic Analysis

Posted on 2020-03-16 Edited on 2021-10-24

承接上一篇Global Alignment of Protein Sequence马尔科夫链的部分继续。马尔科夫链在学随机过程或者计算机模拟的时候都会学到，这里主要讲述它在基因序列上的应用。

马尔科夫链

将多个试验结果按时间标记为一系列“前后相继”的状态:
也称为离散时间马尔可夫链(discrete-time Markov chain): 描述从状态到状态的转换的随机过程;
马尔可夫性质(无记忆性): 下一状态的概率分布只能由当前状态决定,在时间序列中它前面的事件均与之无关
推广到连续时间状态的情形，统称：Markov 过程

这里面需要掌握的是

1.状态转化概率构成的转移矩阵

一个行向量表示状态概率，一个矩阵为转移矩阵，则下一状态概率为，可以画一下状态转移图理解。这个矩阵每一行的和为。

2.平稳分布

迭代关系，有，则为其极限分布，记
Perron-Frobenius 定理

如果概率转移矩阵满足，那么有

（1）𝑃存在特征值为且对应的左特征向量严格为正，且唯一
（2）如果此特征向量被归一化则进一步有

需要注意的是这个左特征向量存在的条件很低，可是说总是存在的，但是这不一定是任意初始向量的迭代极限，参考马氏链定理的条件：非周期的转移概率矩阵，任何两个状态是连通的。是的唯一非负解，称为马氏链的平稳分布。

3.细致平稳条件

如果非周期马氏链的转移矩阵和分布满足

则是马氏链的平稳分布，上式被称为细致平稳条件。

下面看一下其在DNA序列进化的应用。

Global Alignment of Protein Sequence

Posted on 2020-03-15 Edited on 2021-10-24

这篇文章主要讲述了不同的BLAST方法，为什么要用氨基酸序列进行比对，如何处理gap惩罚，并用动态规划的方法全局比对找到最优解，然后回溯获得比对结果，同时可以应用到半全局比对和局部比对。这其中需要注意的是PAM matrix的一些特点，这个评分矩阵的设计非常有内涵，最后提了一下我们需要通过DNA序列进化来知道如何设计这样的评分系统的是合理的，而DNA进化序列实际上是一条马尔科夫链。

基因结构

Posted on 2020-03-13 Edited on 2020-03-15

本文主要介绍了基因结构，包含开放阅读框open reading frames（ORF），内含子intron，外显子exon，编码基因 coding sequence（CDS），非翻译区untranslated region（UTR）,互补DNA complementary DNA（cDNA），核糖体结合位点 ribosome binding site（RBS）。

对于真核生物而言，一个基因经转录产生mRNA，在剪接过程中除去内含子intron，保留外显子exon。能翻译为蛋白的exon区域是CDS区域，不能翻译的为5’和3’非翻译区UTR区域。以mRNA或microRNA（miRNA））为模板逆转录合成的DNA为cDNA，它仅包含外显子（包含5’UTR，3’UTR），不含内含子。

intron和exon是针对转录而言的，CDS和UTR是针对翻译而言的。

这些基因结构信息用GTF file保存。