Introduction and References
Install and run
Quick examples
Options
Custom Annotations
cepip Perl version
FAQ
|
cepip: Context-dependent epigenomic weighting for regulatory variant prioritization
|
Majority of trait/disease associated variants identified by genome wide association studies (GWASs) locate in the regulatory regions. Since gene regulation is highly context-specific, it remains challenging to fine-map and prioritize functional regulatory variants in a particular cell/tissue type and apply them to disease-associated genes detection. By connecting large-scale epigenome profiles to expression quantitative trait loci (eQTLs) in a wide range of human tissues/cell types, we identify combination of several critical chromatin features that predict variant regulatory potential. We develop a joint likelihood framework to measure the regulatory probability of genetic variants in a context-dependent manner. We show our model is superior to existing cell type-specific methods and exhibit significant GWAS signal enrichment. Using phenotypically relevant epigenomes to weight GWAS SNPs, we discover more disease-associated genes owing to regulatory changes and improve the statistical power in gene-based association test.
|
References
1. Mulin Jun Li, Miaoxin Li, Zipeng Liu, Yan Bin, Zhicheng Pan, Dingge Ying, Jean-Pierre A. Kocher, Zhengyuan Xia, Pak Chung Sham, Jun S. Liu, Junwen Wang. cepip: context-dependent epigenomic weighting for prioritization of regulatory variants and disease-associated genes Genome Biology (2017) 18:52
|
Install and run
Java Runtime Environment (JRE) version 6.0 or above is required for cepip. It can be downloaded from the Java web site.
Installing the JRE is very easy in Windows OS and Mac OS X.
In Linux, you have more work to do. Details of the installation can be found http://www.java.com/en/download/help/linux_install.xml.
In Ubuntu, if you have an error message like: "Exception in thread "AWT-EventQueue-0" java.awt.HeadlessException ... " , please install the Sun Java Running Environment (JRE) first.
To install the Sun JRE on Ubuntu(10.04), please use the following commands:
sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get install sun-java6-jre sun-java6-plugin sun-java6-fonts
Detailed explanation of above commands can be found at http://www.ubuntugeek.com/how-install-sun-java-runtime-environment-jre-in-ubuntu-10-04-lucid-lynx.html.
For Mac OS, the JRE 1.6 has been available at http://developer.apple.com/java/download/ since April 2008. Mac OS users may need update the Java application to run cepip. A potential problem is that this update does not replace the existing installation of J2SE 5.0 or change the default version of Java. Similar to the Linux OS, the Java_Home environmental variable has to be configured to initiate cepip.
|
20170318 update: We have integrated our JAVA version cepip into KGGSeq package, now it supports whole genome scoring in an efficient manner (require mannually download dbNCFP wohle genome annotation file). Please refer to the manual for package configuration and running parameters (--regulatory-causing-predict and --cell). The Perl version of cepip is still supported for easy execution of small list of SNVs (< 10K) and reseacher who do not want to download big program-dependent annotation (Please refer to Perl version manual).
20161213 update: We introduce a Perl version of cepip for remote random access of allele-specific composite score annotations and cell type-specific epigenomic signal annotations. Without downloading the huge annotation file, the user can quickly execute cepip (require Tabix). However, the running speed could be slow using this Perl version. Please run it under small input. For the first-time user, please download the Perl package from cepip_PL_v0.12.zip (Mac OS X / Linux).
20161024 update: We allow users to custom their epigenomic annotations. Please download and replace previous jar file from cepip.jar. For the first-time user, please download the full package from cepip.zip. We also provide a demo of custom annotations custom_annotations.zip, please refer to the instrcution in this site for how to use custom annoations.
|
Simply decompress the archive and run the following command.
java -Xms256m -Xmx1300m -jar ./cepip.jar <arguments>
The arguments -Xms256m and -Xmx1300m set the initial and maximum Java heap sizes for cepip as 256 megabytes and 1.3 gigabytes respectively. Specifying a larger maximum heap size can speed up the analysis. A higher setting like -Xmx2g or even-Xmx5g is required when there is a large number of variants, say 5 million. The number, however, should be less than the size of physical memory of a machine.
Note <arguments >can be saved in a flat text file.
|
Quick example
Variants file is needed (example.vcf). This example use VCF variant format without genotype, and other formats are also supported!
Note All files were included in the examples folder of cepip.
Run the command below:
java -Xms256m -Xmx1300m -jar cepip.jar examples/params.example.txt
We now walk through the parameter file "params.composite.example.txt" before going into the results. Lines starting with hash sign # are comments. Detailed interpretation for each argument in the parameter file is included in 'Options' part.
#one argument per line
#I.Environmental setting
#--no-resource-check \
#II. Specify the input files
--vcf-file examples/example.vcf \
#III. Output setting
--out example_result \
#IV. Filtering and Prioritization
--db-score dbncfp \
--regulatory-causing-predict all \
--cell GM12878 \
Part (I): Specify general environmental setting
Arguments in this part are used to set general cepip running environment, including resource path and program update
Part (II): Specify the input files
Arguments in this part are used to specify the input files which support various data format, and they are compulsory for running cepip.
Part (III): Output setting
Arguments in this part are used to set the output file name and type, which can be produced simultaneously by cepip.
Part (IV): Prioritization
Arguments in this part are used to apply a set of models (selected regulatory variant scores, cell type definition, etc.) to enable better prioritization of the regulatory.
Notes Most of the above arguments are optional, so user can mask some lines by # or delete the lines. Under this circumstance, user can have a systematic view of the impact for each level or even steps. And it will be easier to produce this parameter file by cepip command generator we provide.
|
NOTE
Currently, cepip compiled an integrative database for eight latest algorithms in the noncoding variants functional prediction field, including CADD, DANN, fathmm-MKL, FunSeq, FunSeq2, GWAS3D, GWAVA and SuRFR. To make a universal format for all collected algorithms, cepip now only support all 1000 Genomes Project phase 1 biallelic variants.
VCF format WITHOUT sample genotypes
cepip can also accept VCF data without genotype information. But there are less usable functions for this type data.
java -jar cepip.jar --vcf-file path/to/file1
In addition, cepip also support other popular variants and/or genotypes formats which are popular in next generation studies.
ANNOVAR format
java -jar cepip.jar --annovar-file path/to/file
NOTE
cepip can recognize an extended ANNOVAR format in which a head row and and multiple columns for comments are allowed.
Example:
chr startpos endpos ref alt comment1 comment2 comment3
1 69428 69428 T G T 92 129
1 69476 69476 T C T 1 0
|
cepip can flexibly output different formats of the prioritization and annotation results for either final validation or further analysis by third-party tools.
Output file path and file name prefix: --outjava -jar cepip.jar --vcf-file path/to/file1 --out path/to/prefixname
Specify path and prefix name of outputs. It is "./cepip" by default.
Text format: This is by defaultjava -jar cepip.jar --vcf-file path/to/file1
By default, cepip output results in TEXT format in a file named cepip.flt.txt, in which the fields (or columns) are delimited by the tabs.
Produce SeattleSeq input: --o-seattleseq
java -jar cepip.jar --vcf-file path/to/file1 --o-seattleseq
Generate an extra copy of output for the prioritized variants in SeattleSeq input format, which can be further annotated by SeattleSeq.
Produce ANNOVAR input: --o-annovar
java -jar cepip.jar --vcf-file path/to/file1 --o-annovar
Generate an extra copy of output for the prioritized variants in ANNOVAR input format, which can be further annotated by ANNOVAR.
Produce VCF input: --o-vcf
java -jar cepip.jar --vcf-file path/to/file1 --o-vcf
Generate an extra copy of output for the prioritized variants in VCF format, which can be further analyzed by other tools.
|
One attractive feature of cepip is that it combines our previous composite model (Mulin Jun Li et al. 2016. bioinformatics. PRVCS) for functional impact scores from multiple algorithms (i.e., CADD and GWAVA) to predict whether variants are regulatory or not. In addition, cepip summarized a set of consolidated chromatin features using cell type-specific chromatin marks for uniformly processed expression quantitative trait loci (eQTLs) dataset, and can measure the probability of regulatory causality for given variants in selected condition.
Notecepip also support context-dependent prioritization using cell type-specific epigenomic signature.
Predict regulatory variants : --db-score dbncfp --regulatory-causing-predict all --cell
java -jar cepip.jar --vcf-file path/to/file1 --db-score dbncfp --regulatory-causing-predict all --cell GM12878
Assigning a cell type, cepip uses a logit model to measure the probability of regulatory causality for given variants in selected condition. It finally combine the above composite model and context-dependent model into an unified model to estimate the posterior probability of regulatory potential.
cepip can support 16 ENCODE cell types as follows:
Coding |
Tissue |
Description |
A549 |
Epithelium |
epithelial cell line derived from a lung carcinoma tissue. (PMID: 175022), "This line was initiated in 1972 by D.J. Giard, et al. through explant culture of lung carcinomatous tissue from a 58-year-old caucasian male." - ATCC, newly promoted to tier 2: not in 2011 analysis. |
CD14 |
Monocytes |
Monocytes-CD14+ are CD14-positive cells from human leukapheresis production, from donor RO 01746 (draw 1 ID is RO 01746, draw 2 ID is RO 01826), newly promoted to tier 2: not in 2011 analysis. |
GM12878 |
Blood |
B-lymphocyte, lymphoblastoid, International HapMap Project - CEPH/Utah - European Caucasion, Epstein-Barr Virus. |
H1-hESC |
Embryonic stem cell |
embryonic stem cells. |
HMEC |
Breast |
mammary epithelial cells. |
HSMM |
Muscle |
skeletal muscle myoblasts. |
HSMMT |
Muscle |
HSMM cell derived skeletal muscle myotubes cell line. |
HUVEC |
Vessel |
umbilical vein endothelial cells. |
HeLa-S3 |
Cervix |
cervical carcinoma. |
HepG2 |
Liver |
hepatocellular carcinoma. |
IMR90 |
Lung |
fetal lung fibroblasts, newly promoted to tier 2: not in 2011 analysis. |
K562 |
Blood |
leukemia, "The continuous cell line K-562 was established by Lozzio and Lozzio from the pleural effusion of a 53-year-old female with chronic myelogenous leukemia in terminal blast crises." - ATCC. |
NH-A |
Brain |
astrocytes (also called Astrocy). |
NHDF |
Skin |
dermal fibroblasts from temple / breast |
NHEK |
Skin |
epidermal keratinocytes. |
NHLF |
Lung |
lung fibroblasts. |
NoteBy default, cepip will use GM12878 for context-dependent prioritization.
cepip can also support 127 RoadMap human reference epigenomes:
Note Epigenomes of some specific marks are missing in some tissues/cells, which may affect the prediction. We will use imputed epigenomes to handle this missing data problem soon!
java -jar cepip.jar --vcf-file path/to/file1 --db-score dbncfp --regulatory-causing-predict all --cell E116
Coding |
Lineage Group |
Epigenome Mnemonic |
Epigenome Name |
Anatomy |
Type |
E001 |
ESC |
ESC.I3 |
ES-I3 Cell Line |
ESC |
CellLine |
E002 |
ESC |
ESC.WA7 |
ES-WA7 Cell Line |
ESC |
CellLine |
E003 |
ESC |
ESC.H1 |
H1 Cell Line |
ESC |
CellLine |
E004 |
ES-deriv |
ESDR.H1.BMP4.MESO |
H1 BMP4 Derived Mesendoderm Cultured Cells |
ESC_DERIVED |
CellLineDerived |
E005 |
ES-deriv |
ESDR.H1.BMP4.TROP |
H1 BMP4 Derived Trophoblast Cultured Cells |
ESC_DERIVED |
CellLineDerived |
E006 |
ES-deriv |
ESDR.H1.MSC |
H1 Derived Mesenchymal Stem Cells |
ESC_DERIVED |
CellLineDerived |
E007 |
ES-deriv |
ESDR.H1.NEUR.PROG |
H1 Derived Neuronal Progenitor Cultured Cells |
ESC_DERIVED |
CellLineDerived |
E008 |
ESC |
ESC.H9 |
H9 Cell Line |
ESC |
CellLine |
E009 |
ES-deriv |
ESDR.H9.NEUR.PROG |
H9 Derived Neuronal Progenitor Cultured Cells |
ESC_DERIVED |
CellLineDerived |
E010 |
ES-deriv |
ESDR.H9.NEUR |
H9 Derived Neuron Cultured Cells |
ESC_DERIVED |
CellLineDerived |
E011 |
ES-deriv |
ESDR.CD184.ENDO |
hESC Derived CD184+ Endoderm Cultured Cells |
ESC_DERIVED |
CellLineDerived |
E012 |
ES-deriv |
ESDR.CD56.ECTO |
hESC Derived CD56+ Ectoderm Cultured Cells |
ESC_DERIVED |
CellLineDerived |
E013 |
ES-deriv |
ESDR.CD56.MESO |
hESC Derived CD56+ Mesoderm Cultured Cells |
ESC_DERIVED |
CellLineDerived |
E014 |
ESC |
ESC.HUES48 |
HUES48 Cell Line |
ESC |
CellLine |
E015 |
ESC |
ESC.HUES6 |
HUES6 Cell Line |
ESC |
CellLine |
E016 |
ESC |
ESC.HUES64 |
HUES64 Cell Line |
ESC |
CellLine |
E017 |
IMR90 |
LNG.IMR90 |
IMR90 fetal lung fibroblasts Cell Line |
LUNG |
CellLine |
E018 |
iPSC |
IPSC.15b |
iPS-15b Cell Line |
IPSC |
CellLine |
E019 |
iPSC |
IPSC.18 |
iPS-18 Cell Line |
IPSC |
CellLine |
E020 |
iPSC |
IPSC.20B |
iPS-20b Cell Line |
IPSC |
CellLine |
E021 |
iPSC |
IPSC.DF.6.9 |
iPS DF 6.9 Cell Line |
IPSC |
CellLine |
E022 |
iPSC |
IPSC.DF.19.11 |
iPS DF 19.11 Cell Line |
IPSC |
CellLine |
E023 |
Mesench |
FAT.MSC.DR.ADIP |
Mesenchymal Stem Cell Derived Adipocyte Cultured Cells |
FAT |
CellLineDerived |
E024 |
ESC |
ESC.4STAR |
Cell Line |
ESC |
CellLine |
E025 |
Mesench |
FAT.ADIP.DR.MSC |
Adipose Derived Mesenchymal Stem Cell Cultured Cells |
FAT |
PrimaryCell |
E026 |
Mesench |
STRM.MRW.MSC |
Bone Marrow Derived Cultured Mesenchymal Stem Cells |
NECTIVE |
PrimaryCell |
E027 |
Epithelial |
BRST.MYO |
Breast Myoepithelial Primary Cells |
BREAST |
PrimaryCell |
E028 |
Epithelial |
BRST.HMEC.35 |
Breast variant Human Mammary Epithelial Cells (vHMEC) |
BREAST |
PrimaryCell |
E029 |
HSC & B-cell |
BLD.CD14.PC |
Primary monocytes from peripheral blood |
BLOOD |
PrimaryCell |
E030 |
HSC & B-cell |
BLD.CD15.PC |
Primary neutrophils from peripheral blood |
BLOOD |
PrimaryCell |
E031 |
HSC & B-cell |
BLD.CD19.CPC |
Primary B cells from cord blood |
BLOOD |
PrimaryCell |
E032 |
HSC & B-cell |
BLD.CD19.PPC |
Primary B cells from peripheral blood |
BLOOD |
PrimaryCell |
E033 |
Blood & T-cell |
BLD.CD3.CPC |
Primary T cells from cord blood |
BLOOD |
PrimaryCell |
E034 |
Blood & T-cell |
BLD.CD3.PPC |
Primary T cells from peripheral blood |
BLOOD |
PrimaryCell |
E035 |
HSC & B-cell |
BLD.CD34.PC |
Primary hematopoietic stem cells |
BLOOD |
PrimaryCell |
E036 |
HSC & B-cell |
BLD.CD34.CC |
Primary hematopoietic stem cells short term culture |
BLOOD |
PrimaryCell |
E037 |
Blood & T-cell |
BLD.CD4.MPC |
Primary T helper memory cells from peripheral blood 2 |
BLOOD |
PrimaryCell |
E038 |
Blood & T-cell |
BLD.CD4.NPC |
Primary T helper naive cells from peripheral blood |
BLOOD |
PrimaryCell |
E039 |
Blood & T-cell |
BLD.CD4.CD25M.CD45RA.NPC |
Primary T helper naive cells from peripheral blood |
BLOOD |
PrimaryCell |
E040 |
Blood & T-cell |
BLD.CD4.CD25M.CD45RO.MPC |
Primary T helper memory cells from peripheral blood 1 |
BLOOD |
PrimaryCell |
E041 |
Blood & T-cell |
BLD.CD4.CD25M.IL17M.PL.TPC |
Primary T helper cells PMA-I stimulated |
BLOOD |
PrimaryCell |
E042 |
Blood & T-cell |
BLD.CD4.CD25M.IL17P.PL.TPC |
Primary T helper 17 cells PMA-I stimulated |
BLOOD |
PrimaryCell |
E043 |
Blood & T-cell |
BLD.CD4.CD25M.TPC |
Primary T helper cells from peripheral blood |
BLOOD |
PrimaryCell |
E044 |
Blood & T-cell |
BLD.CD4.CD25.CD127M.TREGPC |
Primary T regulatory cells from peripheral blood |
BLOOD |
PrimaryCell |
E045 |
Blood & T-cell |
BLD.CD4.CD25I.CD127.TMEMPC |
Primary T cells effector/memory enriched
from peripheral blood |
BLOOD |
PrimaryCell |
E046 |
HSC & B-cell |
BLD.CD56.PC |
Primary Natural Killer cells from peripheral blood |
BLOOD |
PrimaryCell |
E047 |
Blood & T-cell |
BLD.CD8.NPC |
Primary T killer naive cells from peripheral blood |
BLOOD |
PrimaryCell |
E048 |
Blood & T-cell |
BLD.CD8.MPC |
Primary T killer memory cells from peripheral blood |
BLOOD |
PrimaryCell |
E049 |
Mesench |
STRM.CHON.MRW.DR.MSC |
Mesenchymal Stem Cell Derived Chondrocyte Cultured Cells |
NECTIVE |
PrimaryCell |
E050 |
HSC & B-cell |
BLD.MOB.CD34.PC.F |
Primary hematopoietic stem cells G-CSF-mobilized Female |
BLOOD |
PrimaryCell |
E051 |
HSC & B-cell |
BLD.MOB.CD34.PC.M |
Primary hematopoietic stem cells G-CSF-mobilized Male |
BLOOD |
PrimaryCell |
E052 |
Myosat |
MUS.SAT |
Muscle Satellite Cultured Cells |
MUSCLE |
PrimaryCell |
E053 |
Neurosph |
BRN.CRTX.DR.NRSPHR |
Cortex derived primary cultured neurospheres |
BRAIN |
PrimaryCell |
E054 |
Neurosph |
BRN.GANGEM.DR.NRSPHR |
Ganglion Eminence derived primary cultured neurospheres |
BRAIN |
PrimaryCell |
E055 |
Epithelial |
SKIN.PEN.FRSK.FIB.01 |
Foreskin Fibroblast Primary Cells skin01 |
SKIN |
PrimaryCell |
E056 |
Epithelial |
SKIN.PEN.FRSK.FIB.02 |
Foreskin Fibroblast Primary Cells skin02 |
SKIN |
PrimaryCell |
E057 |
Epithelial |
SKIN.PEN.FRSK.KER.02 |
Foreskin Keratinocyte Primary Cells skin02 |
SKIN |
PrimaryCell |
E058 |
Epithelial |
SKIN.PEN.FRSK.KER.03 |
Foreskin Keratinocyte Primary Cells skin03 |
SKIN |
PrimaryCell |
E059 |
Epithelial |
SKIN.PEN.FRSK.MEL.01 |
Foreskin Melanocyte Primary Cells skin01 |
SKIN |
PrimaryCell |
E061 |
Epithelial |
SKIN.PEN.FRSK.MEL.03 |
Foreskin Melanocyte Primary Cells skin03 |
SKIN |
PrimaryCell |
E062 |
Blood & T-cell |
BLD.PER.MONUC.PC |
Primary mononuclear cells from peripheral blood |
BLOOD |
PrimaryCell |
E063 |
Adipose |
FAT.ADIP.NUC |
Adipose Nuclei |
FAT |
PrimaryTissue |
E065 |
Heart |
VAS.AOR |
Aorta |
VASCULAR |
PrimaryTissue |
E066 |
Other |
LIV.ADLT |
Liver |
LIVER |
PrimaryTissue |
E067 |
Brain |
BRN.ANG.GYR |
Brain Angular Gyrus |
BRAIN |
PrimaryTissue |
E068 |
Brain |
BRN.ANT.CAUD |
Brain Anterior Caudate |
BRAIN |
PrimaryTissue |
E069 |
Brain |
BRN.CING.GYR |
Brain Cingulate Gyrus |
BRAIN |
PrimaryTissue |
E070 |
Brain |
BRN.GRM.MTRX |
Brain Germinal Matrix |
BRAIN |
PrimaryTissue |
E071 |
Brain |
BRN.HIPP.MID |
Brain Hippocampus Middle |
BRAIN |
PrimaryTissue |
E072 |
Brain |
BRN.INF.TMP |
Brain Inferior Temporal Lobe |
BRAIN |
PrimaryTissue |
E073 |
Brain |
BRN.DL.PRFRNTL.CRTX |
Brain Dorsolateral Prefrontal Cortex |
BRAIN |
PrimaryTissue |
E074 |
Brain |
BRN.SUB.NIG |
Brain Substantia Nigra |
BRAIN |
PrimaryTissue |
E075 |
Digestive |
GI.CLN.MUC |
Colonic Mucosa |
GI_COLON |
PrimaryTissue |
E076 |
Sm. Muscle |
GI.CLN.SM.MUS |
Colon Smooth Muscle |
GI_COLON |
PrimaryTissue |
E077 |
Digestive |
GI.DUO.MUC |
Duodenum Mucosa |
GI_DUODENUM |
PrimaryTissue |
E078 |
Sm. Muscle |
GI.DUO.SM.MUS |
Duodenum Smooth Muscle |
GI_DUODENUM |
PrimaryTissue |
E079 |
Digestive |
GI.ESO |
Esophagus |
S |
PrimaryTissue |
E080 |
Other |
ADRL.GLND.FET |
Fetal Adrenal Gland |
ADRENAL |
PrimaryTissue |
E081 |
Brain |
BRN.FET.M |
Fetal Brain Male |
BRAIN |
PrimaryTissue |
E082 |
Brain |
BRN.FET.F |
Fetal Brain Female |
BRAIN |
PrimaryTissue |
E083 |
Heart |
HRT.FET |
Fetal Heart |
HEART |
PrimaryTissue |
E084 |
Digestive |
GI.L.INT.FET |
Fetal Intestine Large |
GI_INTESTINE |
PrimaryTissue |
E085 |
Digestive |
GI.S.INT.FET |
Fetal Intestine Small |
GI_INTESTINE |
PrimaryTissue |
E086 |
Other |
KID.FET |
Fetal Kidney |
KIDNEY |
PrimaryTissue |
E087 |
Other |
PANC.ISLT |
Pancreatic Islets |
PANCREAS |
PrimaryTissue |
E088 |
Other |
LNG.FET |
Fetal Lung |
LUNG |
PrimaryTissue |
E089 |
Muscle |
MUS.TRNK.FET |
Fetal Muscle Trunk |
MUSCLE |
PrimaryTissue |
E090 |
Muscle |
MUS.LEG.FET |
Fetal Muscle Leg |
MUSCLE_LEG |
PrimaryTissue |
E091 |
Other |
PLCNT.FET |
Placenta |
PLACENTA |
PrimaryTissue |
E092 |
Digestive |
GI.STMC.FET |
Fetal Stomach |
GI_STOMACH |
PrimaryTissue |
E093 |
Thymus |
THYM.FET |
Fetal Thymus |
THYMUS |
PrimaryTissue |
E094 |
Digestive |
GI.STMC.GAST |
Gastric |
GI_STOMACH |
PrimaryTissue |
E095 |
Heart |
HRT.VENT.L |
Left Ventricle |
HEART |
PrimaryTissue |
E096 |
Other |
LNG |
Lung |
LUNG |
PrimaryTissue |
E097 |
Other |
OVRY |
Ovary |
OVARY |
PrimaryTissue |
E098 |
Other |
PANC |
Pancreas |
PANCREAS |
PrimaryTissue |
E099 |
Other |
PLCNT.AMN |
Placenta Amnion |
PLACENTA |
PrimaryTissue |
E100 |
Muscle |
MUS.PSOAS |
Psoas Muscle |
MUSCLE |
PrimaryTissue |
E101 |
Digestive |
GI.RECT.MUC.29 |
Rectal Mucosa Donor 29 |
GI_RECTUM |
PrimaryTissue |
E102 |
Digestive |
GI.RECT.MUC.31 |
Rectal Mucosa Donor 31 |
GI_RECTUM |
PrimaryTissue |
E103 |
Sm. Muscle |
GI.RECT.SM.MUS |
Rectal Smooth Muscle |
GI_RECTUM |
PrimaryTissue |
E104 |
Heart |
HRT.ATR.R |
Right Atrium |
HEART |
PrimaryTissue |
E105 |
Heart |
HRT.VNT.R |
Right Ventricle |
HEART |
PrimaryTissue |
E106 |
Digestive |
GI.CLN.SIG |
Sigmoid Colon |
GI_COLON |
PrimaryTissue |
E107 |
Muscle |
MUS.SKLT.M |
Skeletal Muscle Male |
MUSCLE |
PrimaryTissue |
E108 |
Muscle |
MUS.SKLT.F |
Skeletal Muscle Female |
MUSCLE |
PrimaryTissue |
E109 |
Digestive |
GI.S.INT |
Small Intestine |
GI_INTESTINE |
PrimaryTissue |
E110 |
Digestive |
GI.STMC.MUC |
Stomach Mucosa |
GI_STOMACH |
PrimaryTissue |
E111 |
Sm. Muscle |
GI.STMC.MUS |
Stomach Smooth Muscle |
GI_STOMACH |
PrimaryTissue |
E112 |
Thymus |
THYM |
Thymus |
THYMUS |
PrimaryTissue |
E113 |
Other |
SPLN |
Spleen |
SPLEEN |
PrimaryTissue |
E114 |
ENCODE |
LNG.A549.ETOH002.CNCR |
A549 EtOH 0.02pct Lung Carcinoma Cell Line |
LUNG |
CellLine_Cancer |
E115 |
ENCODE |
BLD.DND41.CNCR |
Dnd41 TCell Leukemia Cell Line |
BLOOD |
CellLine_Cancer |
E116 |
ENCODE |
BLD.GM12878 |
GM12878 Lymphoblastoid Cell Line |
BLOOD |
CellLine |
E117 |
ENCODE |
CRVX.HELAS3.CNCR |
HeLa-S3 Cervical Carcinoma Cell Line |
CERVIX |
CellLine_Cancer |
E118 |
ENCODE |
LIV.HEPG2.CNCR |
HepG2 Hepatocellular Carcinoma Cell Line |
LIVER |
CellLine_Cancer |
E119 |
ENCODE |
BRST.HMEC |
HMEC Mammary Epithelial Primary Cells |
BREAST |
CellLine |
E120 |
ENCODE |
MUS.HSMM |
HSMM Skeletal Muscle Myoblasts Cell Line |
MUSCLE |
CellLine |
E121 |
ENCODE |
MUS.HSMMT |
HSMM cell derived Skeletal Muscle Myotubes Cell Line |
MUSCLE |
CellLine |
E122 |
ENCODE |
VAS.HUVEC |
HUVEC Umbilical Vein Endothelial Cells Cell Line |
VASCULAR |
CellLine |
E123 |
ENCODE |
BLD.K562.CNCR |
K562 Leukemia Cell Line |
BLOOD |
CellLine |
E124 |
ENCODE |
BLD.CD14.MONO |
Monocytes-CD14+ RO01746 Cell Line |
BLOOD |
CellLine |
E125 |
ENCODE |
BRN.NHA |
NH-A Astrocytes Cell Line |
BRAIN |
CellLine |
E126 |
ENCODE |
SKIN.NHDFAD |
NHDF-Ad Adult Dermal Fibroblast Primary Cells |
SKIN |
CellLine |
E127 |
ENCODE |
SKIN.NHEK |
NHEK-Epidermal Keratinocyte Primary Cells |
SKIN |
CellLine |
E128 |
ENCODE |
LNG.NHLF |
NHLF Lung Fibroblast Primary Cells |
LUNG |
CellLine |
E129 |
ENCODE |
BONE.OSTEO |
Osteoblast Primary Cells |
BONE |
CellLine |
The option will append cell type-specifc regulatory potential (Cell_P) and combined probability (Combine_P) to the output file:
Notecepip allows to adjust prediction score in the composite, including CADD, DANN, fathmm-MKL, FunSeq, FunSeq2, GWAS3D, GWAVA and SuRFR. Here named "dbncfp" database.
Predict regulatory variants: --db-score dbncfp --regulatory-causing-predict
java -jar cepip.jar --vcf-file path/to/file1 --db-score dbncfp --regulatory-causing-predict 1,3,4,5,6,8,10,11
Use at most 8 existing algorithms for 11 available functional impact scores (listed below) to RE-predict whether a single nucleotide variant (SNV) or Indel will potentially be regulatory. By default, cepip uses all algorithms (8 scores) for a combinatorial prediction. However, the iterative searching shows the best combination for 4 scores can achieve top performance (CADD_cscore, GWAS3D, SuRFR, GWAVA_TSS).
Figure: Receiver operating characteristic (ROC) and area under the curves (AUC) of individual scores and combined score by composite model
Note: Figures from our paper.
On the other hand, one can FIX the prediction using a specified subset or full set of the 4 impact scores by option like --regulatory-causing-predict 1,6,8,10
The coding for the functional impact scores used in'--regulatory-causing-predict' options is listed below:
Coding |
Method |
Description |
1 |
CADD_CScore |
"Raw" CADD scores come straight from the CADD model, and are interpretable as the extent to which the annotation profile for a given variant suggests that that variant is likely to be "observed" (negative values) vs "simulated" (positive values). These values have no absolute unit of meaning and are incomparable across distinct annotation combinations, training sets, or model parameters. However, raw values do have relative meaning, with higher values indicating that a variant is more likely to be simulated (or "not observed") and therefore more likely to have deleterious effects. |
2 |
CADD_PHRED |
Since the CScores do have relative meaning, one can take a specific group of variants, define the rank for each variant within that group, and then use that value as a "normalized" and now externally comparable unit of analysis. CADD scored and ranked all ~8.6 billion SNVs of the GRCh37/hg19 reference and then "PHRED-scaled" those values by expressing the rank in order of magnitude terms rather than the precise rank itself. |
3 |
DANN |
DANN uses the same feature set and training data as CADD to train a deep neural network (DNN). DNNs can capture nonlinear relationships among features and are better suited than SVMs for problems with a large number of samples and features. |
4 |
FunSeq |
FunSeq filters mutations overlapping 1000 Genomes variants and then prioritizes those in regions under strong selection (sensitive and ultrasensitive), breaking TF motifs, and those associated with hubs. It can score the deleterious potential of variants in single or multiple genomes. The scores for each noncoding variant vary from 0 to 6, with 6 corresponding to maximum deleterious effect. When multiple tumor genomes are given as input, FunSeq also identifies recurrent mutations in the same element. |
5 |
FunSeq2 |
FunSeq2 is originally to annotate and prioritize somatic alterations integrating various resources from genomic and cancer studies. The framework consists of two components: (1) data context from uniformly processing large-scale datasets; and (2) a high-throughput variant prioritization pipeline. FunSeq2 can also be used to prioritize noncoding genetic variants. |
6 |
GWAS3D |
GWAS3D systematically assesses the genetic variants that could affect regulatory elements, by integrating annotations from cell type-specific chromatin states, epigenetic modifications, sequence motifs and cross-species conservation. It combines the original GWAS signal, risk haplotype, binding affinity significance and conservation information to prioritize the leading variants, and infer the putative causal variant in the LD of leading variant. |
7 |
GWAVA_Region |
GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was composed of all 1KG variants in the 1 kb surrounding each of the HGMD variants. |
8 |
GWAVA_TSS |
GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was matched for distance to the nearest TSS genome-wide. |
9 |
GWAVA_Unmatched |
GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was constructed from a random selection of SNVs from across the genome in order to sample overall background. |
10 |
SuRFR |
SuRFR integrates functional annotation and prior biological knowledge to prioritise candidate functional variants by regression model. It introduces novel training and validation datasets that i) capture the regional heterogeneity of genomic annotation better than previously applied approaches, and ii) facilitate understanding of which annotations are most important for discriminating different classes of functionally relevant variants from background variants. |
11 |
FATHMM-MKL |
FATHMM-MKL uses MKL classifier to predict the functional consequences of both coding and non-coding sequence variants from various genomic annotations and weights the significance of each component annotation source. |
Note Predictions at variants with missing scores at specified methods will use population mean of corresponding method!
The option will append the corresponding score of selected prediction methods to the output file, as well as the bayes factor (BF) and composite probability (Composite_P):
|
cepip also supports custom annotations which are defined by user. We suggest the user prepare all chromatin features that was used in our prediction model including DNase, H3K4me1, H3K4me2, H3K4me3, H3K36me3, H3K9me3, H3K79me2, H3K27me3. If the user can not provide certain marks, cepip also require to keep empty files with fixed nomination. To use this custom annotations function, please following instructions:
1. The annotation should be sorted ENCODE narrowPeak format (https://genome.ucsc.edu/FAQ/FAQformat#format12);
2. The annotation file should start with the tissue/cell type name (eg. cellA) and append with "-[MarkName].narrowPeak.sorted", such as "cellA-DNase.narrowPeak.sorted";
3. Using gunzip to compress the above narrowPeak file. The final annotation file for certain mark in specific cell is "cellA-DNase.narrowPeak.sorted.gz";
4. All required mark gz files should be prepared for each custom tissue/cell type, including DNase, H3K4me1, H3K4me2, H3K4me3, H3K36me3, H3K9me3, H3K79me2 and H3K27me,3 even if some mark are not available currently (compress empty narrowPeak file for unavailable marks).
5. Please put all mark gz files into the cepip annotation reource folder, which locates in "[cepip path]/resources/hg19/all_cell_signal/";
6. We currently only support hg19.
java -jar cepip.jar --vcf-file path/to/file1 --db-score dbncfp --regulatory-causing-predict all --cell customCellName
|
Command line:perl cepip_PL.pl -i examples/example.vcf -f vcf -t /usr/bin/tabix -r ftp://147.8.193.36/PRVCS/v1.1/dbNCFP_whole_genome_SNVs.bgz -s 1,3,4,5,6,8,10,11 -a ftp://147.8.193.36/cepip/cell_signal/ -c HepG2 -o cepip.out
By default, cepip use Tabix to visit remote compiled dataset for random access. You have to install Tabix and assign the progaram path to script by "-t".
Options:
cepip_PL.pl -- Perl version of cepip for prediction cell type-specific regulatory variant
-h output help information to screen
-i the input variants file
-f the format of input file, supporting VCF and ANNOVAR format; default: vcf
-t the path of executable tabix program; default: /user/bin/tabix
-r the path of dbNCFP reference database; default: ftp://147.8.193.36/PRVCS/v1.1/dbNCFP_whole_genome_SNVs.bgz
-s the selected tool scores; default: 1,3,4,5,6,8,10,11
1 CADD_cscore
2 CADD_PHRED
3 DANN_score
4 FunSeq_score
5 FunSeq2_score
6 GWAS3D_score
7 GWAVA_region_score
8 GWAVA_TSS_score
9 GWAVA_unmatched_score
10 SuRFR_score
11 Fathmm_MKL_score
-p the casual distribution folder; default: resources/all_causal_distribution
-n the neutral distribution folder; default: resources/all_neutral_distribution
-a the path of tissue/cell type reference epigenome; default: ftp://147.8.193.36/cepip/cell_signal/
-c the dependent tissue/cell type; default: E116
-o the path of output file; default: cepip1.flt.txt
|
1) Why cepip does not read my VCF file?
If you use standard VCF output from GATK pipeline, it usually contains variants on mitochondrial DNA. However, mitochondrial DNA is not annotated by gene feature database. Therefore cepip currently only accept VCF file excluding variants on ChrM.
2) Whether cepip supports rare variants or somatic variants?
Our composite model only supports the genetic variants from 1000 Genomes Project phase 1 since we currently prefer to make all collected eight methods work well. We will support all possible SNVs in the human genome very soon. For context-dependent model, it can be apply to any variants in the human genome.
3) Can I use ANNOVAR and cepip together on my dataset?
cepip is quite flexible for interacting with other sequence-oriented analytical programs/software (including SamTools, ANNOVAR, etc). It can accept various input formats, and output different kinds of sequence data. In case of ANNOVAR, cepip can read ANNOVAR-formatted sequence variants, and write the final remaining variants in ANNOVAR format.
4) Can I run cepip on my laptop? Is it time consuming to run a complete cepip process?
Normally, cepip run well and fast with >=1 GB RAM memory. Hence current laptop are certainty affordable for running cepip. The whole process need only <10 minutes, unless first downloading time.
5) How do I report an error or bugs to cepip?
You are welcomed to write an email to mulin0424.li@gmail.com or limx54@yahoo.com. |
|