VarNote Usage Documentation Quick Tutorial System Requirements Installation and Test Data Available ToolsNEW Quick Start 1. Index Database 2. Fast counting of intersection 3. Data retrieval with single region 4. Intersection 5. Annotation 6. Annotation upon intersection result 7. Run with a config file 8. VarNote Prioritization NEW 9. VarNote Toolkits NEW File Format File Overview Query and Database File Format 1. VCF Format 2. VCF-Like Format 3. BED-Like Format 4. BED-Like Allele Format 5. Coord-Only Format 6. Coord-Allele Format 7. TAB Format VarNote Tools Documentation Input File Options Global Options VarNote Index Index IndexInfo VarNote Query Count RandomAccess Intersect VarNote Annotation Annotation AnnotationIntersectFile Run with config file IntersectConfig AnnotationConfig VarNote PrioritizationNEW VarNote-REG VarNote-PAT VarNote-CAN VarNote Toolkits NEW IndexRefGenotype LDBatch LDVariant LDPair rsIDConversion Annotation Extraction Rule Position Resolving Rule Cite Varnote
Quick Tutorial
System Requirements
Java Runtime Environment (JRE) version 8.0 or above is required for VarNote. To check your java version by open your terminal application and run the following command:
java -version
You are expected to see java version "1.8.x" or above. If not, you may need to update your version; see the Oracle Java website to download and install the latest JDK.
Installation and Test Data
The VarNote command-line tools are provided as an executable JAVA program. You can download the Latest Release of VarNote jar file from here and a sample data for testing from here. Please also find the VarNote source code from https://github.com/mulinlab/VarNote.
NEWTest data and demo script for VarNote prioritization/Toolkits functions can be downloaded from here
Available Tools
USAGE: java -jar /path/to/VarNote.jar <program name> [-h]
Program Summary Table:
--------------------------------------------------------------------------------------
VarNote Index: Tools that generates index for the compressed database file.
Index To generate VarNote index (".vanno" and ".vanno.vi") for compressed (block gzip) annotation database file.
IndexInfo To query related information (such as header, format, meta information or sequence name) stored in the VarNote index file of each annotation database.
--------------------------------------------------------------------------------------
VarNote Query: Tools that quickly retrieve data lines from database(s).
Count To quickly count intersected records in annotation database(s).
RandomAccess To quickly retrieve (by independent random access) intersected records from indexed annotation database(s) given a genomic region like "chrN:beginPos-endPos".
Intersect To quickly retrieve (by random-sweep algorithm) intersected records from indexed annotation database(s) given query intervals/variants.
IntersectConfig Run VarNote Intersect program with a config file.
--------------------------------------------------------------------------------------
VarNote Annotation: Tools that identifies desired annotation fields from database(s).
Annotation To quickly extract (by random-sweep algorithm) desired annotation fields from indexed annotation database(s) given query intervals/variants.
AnnotationIntersectFile To quickly extract desired annotation fields from an existing VarNote intersection file.
AnnotationConfig Run VarNote Annotation program with a config file.
--------------------------------------------------------------------------------------
NEWVarNote Prioritization: To facilitate researchers to execute the genome-scale regulatory variant annotation and prioritization locally as well as compare results with different parameters, several local pipelines are implemented as online version.
VarNote-REG To prioritize causal regulatory variants in the LD of each GWAS signal and provide combined ranking scores based on multiple tissue/cell type-specific prediction methods.
VarNote-PAT To prioritize candidate pathogenic regulatory variants from whole genome sequencing variants of inherited diseases.
VarNote-CAN To prioritize likely cancer driver regulatory mutation given personal cancer genome profile.
--------------------------------------------------------------------------------------
NEWVarNote Toolkits: Several useful tools that ease the analysis of genetic sequencing data.
rsIDConversion To efficiently interconvert dbSNP ID/genomic position.
IndexRefGenotype To index the whole-genome genotypes (such as 1000 Genomes Project genotype VCF file) for efficient linkage disequilibrium (LD) calculation.
LDPair To efficiently calculate LD for a single pair of variants.
LDVariant To efficiently calculate all LD variants for a variant.
LDBatch To efficiently calculate all LD variants for a list of variants.
Quick Start
1. Index Database
The first step after downloading VarNote is to index the annotation database file in test data. VarNote Index program will generate the index files
Note
- The annotation database file must be a
TAB-delimited genome position file compressed bybgzip program (http://www.htslib.org/doc/bgzip.html) (end with.gz or.bgz ). - The annotation database file must be
position-sorted (first by sequence name and then by leftmost coordinate). - The index files should be used together with the original database file.
Info
- VarNote also provides a random-sweep searching for
Tabix index(for users who don't want to reindex a large database). - With performance loss, using Tabix index will be
~10X times slower than VarNote index. We strongly suggest user to index it with VarNote Index program before query.
Command line usage example:
# Moving VarNote-XXX.jar into test data folder and rename it to VarNote.jar.
mv VarNote.jar /path/to/test_data
cd /path/to/test_data
# List all programs of VarNote.
java -jar VarNote.jar
# Displays options specific to Index.
java -jar VarNote.jar Index
# Sorting and using htslib bgzip program to compress annotation database file before index.
sort -k1,1 -k2,2n roadmap.bed > roadmap.sort.bed
/path/to/htslib-1.X/bgzip roadmap.sort.bed
# index VCF file format
java -jar VarNote.jar Index -I 1000G_p3.sort.vcf.gz
java -jar VarNote.jar Index -I cosmic.sort.vcf.gz
# index BED-like file format
java -jar VarNote.jar Index -I roadmap.sort.bed.gz
# index TAB file format
java -jar VarNote.jar Index -I:coordAllele dbscSNV.sort.tab.gz
java -jar VarNote.jar Index -I:tab,c=1,b=2,e=2,ref=3,alt=4 dbscSNV.sort.tab.gz
2. Fast counting of intersection
To quickly count data lines from the indexed databases that intersect with genomic features defined in the query file.
java -jar VarNote.jar Count -Q q2.sort.bed \
-D:db,tag=1000g 1000G_p3.sort.vcf.gz \
-D:db,tag=roadmap roadmap.sort.bed.gz
Result file: q2.sort.bed.count.gz
Info
-D option could be used multiple times to specify multiple annotation databases.- The command on multiple lines could also be written on a single line.
Note
- Count is 2+ times faster than other programs.
- Count program only support VarNote index.
3. Data retrieval with single region
To quickly retrieve data lines from the indexed databases that intersect with the specified genomic region like "chr:beginPos-endPos"
#Query a genomic locus
java -jar VarNote.jar RandomAccess -Q 1:2298288-2298289 -D 1000G_p3.sort.vcf.gz
#Query a genomic region
java -jar VarNote.jar RandomAccess -Q 1:959100-959200 \
-D:db,tag=1000g 1000G_p3.sort.vcf.gz \
-D:db,tag=cosmic,index=TBI cosmic.sort.vcf.gz
4. Intersection
To quickly retrieve data lines from the indexed databases that intersect with genomic features defined in the query file.
#Multiple databases using exact mode and different Index (1000g using VarNote and cosmic using TBI).
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
-D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
-D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz
#Multiple databases using exact mode within left join
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
-D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
-D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
-loj true
#Intersect mode (using 4 threads to run, default is 1)
java -jar VarNote.jar Intersect -Q q2.sort.bed --maxVariantLength 50000 \
-D roadmap.sort.bed.gz \
-T 4
#Multiple databases using different mode (roadmap using intersect mode and 1000g using exact mode)
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
-D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
-D:db,tag=roadmap,mode=0 roadmap.sort.bed.gz \
-T 4 \
-O /path/to/test_data/q1.twomode.overlap.gz
#Intersection using remote database with VarNote index. Querying remote database is relatively slow, please be patient
java -jar VarNote.jar Intersect -Q:tab,c=1,b=2,e=2,ref=3,alt=4 q3.sort.tab \
-D:db,tag=gnomAD,mode=1 http://202.113.53.226/VarNoteDB/VarNoteDB_AF_gnomAD_Genome.vcf.gz
Result files: q1.sort.vcf.overlap.gz q2.sort.bed.overlap.gz q1.twomode.overlap.gz q3.sort.tab.overlap.gz
Note
Intersect mode : perform common interaction operation according to query and database formats.Exact match mode : force the program only to consider the chromosome position of database records that exactly match the corresponding chromosome position of query.
5. Annotation
Annotation extraction for a list of intervals or variants.
!!!Please reading Annotation Extraction Rule first.
Note
- User should define an annotation configuration file (set with option
-A ) to specify the fields to extract. - If
-A is not set, the program will search configuration file namedQueryFileName.annoc in query folder first. - If annotate configuration file is not found, the program will extract
all fields in databases by default. - If query format is VCF, default output format will be set to VCF; If query format is BED or TAB, default output format will be set to BED; User can change output format with
-OF option.
# Annotation without configuration file
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
-D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
-D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
-O ./q1.sort.vcf.allfields.anno.gz \
-T 4
# Annotation with configuration file
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
-D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
-D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
-A config/all_dbs.annoc
# Change output format with -OF
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
-D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
-A config/all_dbs.annoc \
-O ./q1.sort.bed.anno.gz \
-OF BED
# Annotation using remote database with VarNote index
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
-D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
-D:db,tag=gnomAD,mode=1 http://202.113.53.226/VarNoteDB/VarNoteDB_AF_gnomAD_Genome.vcf.gz \
-A config/all_dbs.annoc \
-O ./q1.sort.vcf.remote.anno.gz
Result files: q1.sort.vcf.allfields.anno.gz q1.sort.vcf.anno.gz q1.sort.bed.anno.gz
6. Annotation upon intersection result
Annotation an
Note
- Extraction information directly from an
OVERLAP file could save a lot of time in query step. - User can change annotation configuration file to get different results, which is convenient and very fast.
java -jar VarNote.jar AnnotationIntersectFile -I q1.sort.vcf.overlap.gz -A config/all_dbs.annoc
Result files: q1.sort.vcf.anno.gz
7. Run with a config file
Run
java -jar VarNote.jar IntersectConfig -I config/intersect.full.options.confg
java -jar VarNote.jar IntersectConfig -I config/intersect.required.options.confg
java -jar VarNote.jar AnnotationConfig -I config/anno.full.options.confg
Result files: q1.sort.vcf.overlap.gz
8. VarNote Prioritization NEW
- disease-causal regulatory variants for GWAS results, VarNote-REG;
- pathogenic regulatory variants for rare inherited diseases, VarNote-PAT;
- driver regulatory variants for cancers, VarNote-CAN.
Note: Please download
# The data has been successfully tested on java version 8/9/11.
cd ./advanced_funs_test_data # make sure VarNote.jar is located in the advanced_funs_test_data folder
# open script to learn how to run the program for specific job.
bash script/01-index.sh # index the whole-genome genotypes (such as 1000 Genomes Project genotype VCF file) for efficient LD calculation.
bash script/02-REG-genomic-Input.sh # prioritize causal regulatory variants in the LD of each GWAS signal with VCF input.
bash script/02-REG-genomic-Input-comma.sh # prioritize causal regulatory variants in the LD of each GWAS signal with variant information delimited by comma.
bash script/02-REG-rsID-Input.sh # prioritize causal regulatory variants in the LD of each GWAS signal with dbSNP variant list.
bash script/03-PAT.sh # prioritize candidate pathogenic regulatory variants from whole genome sequencing variants of inherited diseases.
bash script/04-CAN.sh # prioritize likely cancer driver regulatory mutation given personal cancer genome profile.
!!!! Note All the database in the advanced_funs_test_data are for test, please download and replace it with the full version before you use it for your data.
9. VarNote Toolkits NEW
VarNote also implements several commonly-used tools for efficiently processing genetic variant information, such as
Note: Please download
# The data has been successfully tested on java version 8/9/11.
cd ./advanced_funs_test_data # make sure VarNote.jar is located in the advanced_funs_test_data folder
bash script/01-index.sh # index the whole-genome genotypes (such as 1000 Genomes Project genotype VCF file) for efficient LD calculation.
bash script/05-LD.sh # efficient calculate LD information for a list of variants.
bash script/06-rsID-Conversion.sh # efficient interconvert dbSNP ID/genomic position.
!!!! Note All the database in the advanced_funs_test_data are for test, please download and replace it with the full version before you use it for your data.
File Format
File Overview
Most programs of VarNote requires the following file type as input:
- Query File:
A file contains a list of variants to query.
- Annotation Database:
An
indexed annotation file, indexed byVarNote (recommand) orTabix . - Overlap File:
The result of
Intersect program, and can be used to runAnnotationIntersectFile . - Annotation Configuration File:
A configuration file which defines the fields for data extraction in
Annotation program.
Below, we summarizes the input files required for each program.
Index | Query | Annotation | |||||
---|---|---|---|---|---|---|---|
Index | IndexInfo | Count | RandomAccess | Intersect | Annotation | AnnotationIntersectFile | |
Query File | |||||||
Annotation Database | |||||||
Overlap File | |||||||
Annotation Configuration File |
File types support for each File.
File Type | Plain Text | gzip(.gz, support original file size smaller than 4 Gb) | bgzip(.gz) |
---|---|---|---|
Query File | |||
Annotation Database | |||
Annotation Configuration File |
Note
- The annotation database file should be
bgzip format andposition-sorted . - The query file could be
plain text ,gzip andbgzip format,position-sorted . - For compressed format, bgzip is strongly recommended.
- For gzip, up to
4Gb original file size is supported currently.
Query and Database File Format
Both query and annotation database file accepts flexible types of genomic format including
1. VCF Format
See details about VCF format.
USAGE: java -jar /path/to/VarNote.jar <program name> -I xxx.vcf.gz java -jar /path/to/VarNote.jar <program name> -I:vcf xxx.vcf.gz
2. VCF-Like Format
The first five columns of VCF-Like format are the same as the VCF format and other columns are optional.
Actually VCF-Like format is a
USAGE: java -jar /path/to/VarNote.jar <program name> -I:vcfLike xxx.tab.gz
3. BED-Like Format
The first three column of BED-like format must be the
USAGE: java -jar /path/to/VarNote.jar <program name> -I:bed xxx.bed.gz
4. BED-Like Allele Format
The first five column of BED-like Allele format must be the
USAGE: java -jar /path/to/VarNote.jar <program name> -I:bedAllele xxx.tab.gz
5. Coord-Only Format
The first two column of Coord-Only format must be the
Actually Coord-Only format is a
USAGE: java -jar /path/to/VarNote.jar <program name> -I:coordOnly xxx.tab.gz
6. Coord-Allele Format
The first two column of Coord-Allele format must be the
Actually Coord-Allele format is a
USAGE: java -jar /path/to/VarNote.jar <program name> -I:coordAllele xxx.tab.gz
7. TAB Format
Tab-separated data file that contains genomic locations.
The TAB format should be used with attributes:
- Required:
c ,b ,e - Optional:
ref ,alt ,0 ,ci ,sep
USAGE: java -jar /path/to/VarNote.jar <program name> -I:tab,c=1,b=4,e=5,0=true xxx.tab.gz
More Examples
# Index an annotation database of bgzip BED-like file.
java -jar VarNote.jar Index -I:bed database1.sorted.bed.gz
# Count a VCF file.
java -jar VarNote.jar Count -I:vcf q1.sorted.vcf
# Intersect
java -jar VarNote.jar Intersect -Q:coordAllele query.sort.tab -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz
VarNote Tools Documentation
Input File Options
The following options are relevant to input file
Option | Description |
---|---|
--input,-I:TagArgument (in --query-file,-Q:TagArgument (in |
TagArgument: arguments with tag and attributes Usage: -I: Possible tags: {vcf, vcfLike, bed, bedAllele, coordOnly, coordAllele, tab} Possible attributes for all tags: {sep, ci} Possible attributes for Attributes should be used with tag, here are the description of attributes: |
--has-header,-header:Boolean |
Indicate whether input file contains a header line for defining column names. If |
--header-path,-HP:File | Path of external file to include the header and meta lines. |
--d-files,-D: |
TagArgument: arguments with tag and attributes Usage: -I:db, Possible tags: {db} Possible attributes: {index, mode, tag}. Attributes should be used with tag, here are the description of attributes:
|
Global Options
The following standard options are relevant to most VarNote tools:
Option | Description |
---|---|
--log:Boolean | Whether to print log. Default value: true. Possible values: {true, false} |
--use-jdk-inflater,-UJI:Boolean | Use the JDK Inflater instead of the IntelInflater for reading index. Default value: false. Possible values: {true, false} |
--use-jdk-deflater,-UJD:Boolean | Use the JDK Deflater instead of the IntelDeflater for writing index. Default value: false. Possible values: {true, false} |
Tool-Specific Documentation
Below, you will find detailed documentation of all the options that are specific to each tool.
VarNote Index
Index
VarNote index function is a necessary step to build index system for fast retrieval. Since most of existing annotation databases are indexed by Tabix, VarNote also provides a random-sweep searching based on Tabix index. This may imply that VarNote can faithfully process existing annotation resources without re-indexing them. However, for large-scale and frequently used annotation datasets such as CADD, gnomAD and dbNSFP, we strongly suggest to use VarNote index system for gained speed.
Usage example:
# Index an annotation database of VCF format.
java -jar VarNote.jar Index -I 1000G_p3.sort.vcf.gz
# Index an annotation database of BED-like format.
java -jar VarNote.jar Index -I roadmap.sort.bed.gz
# Index an annotation database of TAB format.
java -jar VarNote.jar Index -I:coordAllele dbscSNV.sort.tab.gz
java -jar VarNote.jar Index -I:tab,c=1,b=2,e=2,ref=3,alt=4 dbscSNV.sort.tab.gz
Arguments:
Option | Description |
---|---|
--input,-I:TagArgument | The path of the TAB-delimited genome position file compressed by bgzip program. The file must be position-sorted, first by sequence name and then by leftmost coordinate.
Possible Tags: {vcf, vcfLike, bed, bedAllele, coordOnly, coordAllele, tab} Required. Please refer File Format for more details |
--has-header,-header:Boolean | Please refer File Format for more details |
--header-path,-HP:File | Please refer File Format for more details |
--out-folder,-O:String | Output directory. By default, the output files will be written into the same folder as the input file. Default value: null |
--skip,-S:Integer | Skip first INT lines(including comment lines) in the data file. Default value: 0. |
Global Options: --use-jdk-deflater,-UJD:Boolean --log:Boolean |
IndexInfo
To query related information (such as header, format, meta information or sequence name) stored in the VarNote index file of each annotation database.
Usage example:
# Input database file must have been indexed by VarNote before query.
java -jar VarNote.jar IndexInfo -LC true -PH true -PM true -I 1000G_p3.sort.vcf.gz
java -jar VarNote.jar IndexInfo -I 1000G_p3.sort.vcf.gz -RH "CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO"
Arguments:
Option | Description |
---|---|
--input,-I:File | The indexed annotation database file, must be indexed by VarNote. Required. |
--list-chroms,-LC:Boolean | List the sequence names stored in the index file. Default value: false. Possible values: {true, false} |
--print-header,-PH:Boolean | Print header line(s). Default value: false. Possible values: {true, false} |
--print-format,-PF:Boolean | Print database format. Default value: false. Possible values: {true, false} |
--print-meta-data,-PM:Boolean | Print meta information lines. Default value: false. Possible values: {true, false} |
--reheader,-RH:String | Replace the column names with a comma-separated string containing the column names. Columns name should be separated by comma and included with double quotation. |
Global Options: --log:Boolean |
VarNote Query
Count
To quickly count intersected records in annotation database(s).
Usage example:
java -jar VarNote.jar Count -Q q2.sort.bed \
-D:db,tag=1000g 1000G_p3.sort.vcf.gz \
-D:db,tag=roadmap roadmap.sort.bed.gz
Arguments:
Option | Description |
---|---|
--query-file,-Q:TagArgument | Path of query file (support Please refer File Format for more details |
--has-header,-header:Boolean | Please refer File Format for more details |
--header-path,-HP:File | Please refer File Format for more details |
--d-files,-D:TagArgument | Local path or http/ftp address of indexed annotation database(s). Either Please refer File Format for more details |
--out-file,-O:String | Output file path. By default, output file will be written into the same folder as the input file. Default value: null. |
--thread,-T:Integer | Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1. |
Global Options: --use-jdk-inflater,-UJI:Boolean --log:Boolean |
RandomAccess
To quickly retrieve (by independent random access) intersected records from indexed annotation database(s) given a genomic region like "chrN:beginPos-endPos".
Usage example:
java -jar VarNote.jar RandomAccess -Q 1:959100-959200 \
-D:db,tag=1000g 1000G_p3.sort.vcf.gz \
-D:db,tag=cosmic,index=TBI cosmic.sort.vcf.gz
Arguments:
Option | Description |
---|---|
--q-region,-Q:String | Region specified as the format |
--d-files,-D:TagArgument | Local path or http/ftp address of indexed annotation database(s). Either Please refer File Format for more details |
--is-label,-L:Boolean | A flag to determine whether or not to print database name as the first column of the result. Default value: true. Possible values: {true, false} |
Global Options: --use-jdk-inflater,-UJI:Boolean --log:Boolean |
Intersect
To quickly retrieve (by random-sweep algorithm) intersected records from indexed annotation database(s) given query intervals/variants.
Usage example:
#Multiple databases using exact mode and different Index (1000g using VarNote and cosmic using TBI).
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
-D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
-D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz
#Multiple databases using exact mode within left join
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
-D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
-D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
-loj true
#Intersect mode (using 4 threads to run, default is 1)
java -jar VarNote.jar Intersect -Q q2.sort.bed --maxVariantLength 50000 \
-D roadmap.sort.bed.gz \
-T 4
Arguments:
Option | Description |
---|---|
--query-file,-Q:TagArgument | Path of query file (support Please refer File Format for more details |
--has-header,-header:Boolean | Please refer File Format for more details |
--header-path,-HP:File | Please refer File Format for more details |
--d-files,-D:TagArgument | Local path or http/ftp address of indexed annotation database(s). Either Please refer File Format for more details |
--thread,-T:Integer | Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1. |
--out-file,-O:String | Output file path. By default, output file will be written into the same folder as the input file. Default value: null. |
--out-mode,-OM:Integer | Output recording mode (default 2). 0 for "only output query records"; 1 for "only output matched database records"; 2 for "output both query records and matched database records" Default value: 2. |
--is-loj,-loj:Boolean | A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false} |
--maxVariantLength,-MVL:Integer | Indicator of the max length of query interval/variant. Default value: 50. |
--allowLargeVariants,-ALV:Boolean | Indicator to allow large query intervals/variants or not Default value: false. Possible values: {true, false} |
--is-remove-comment,-RC:Boolean | A flag to determine whether or not to remove the comment lines(start with '@') in the
output file. Note that the comment lines are required for the VarNote |
--is-zip,-Z:Boolean | A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false} |
Global Options: --use-jdk-inflater,-UJI:Boolean --log:Boolean |
VarNote Annotation
Annotation
To quickly extract (by random-sweep algorithm) desired annotation fields from indexed annotation database(s) given query intervals/variants. It allows feature extraction using both interval-level overlap and variant-level exact matching. It also has an annotation mode supporting allele-specific variant annotation for SNV/Indel and region-specific annotation for large variations.
Usage example:
comment># Annotation without configuration file
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
-D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
-D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
-O ./q1.sort.vcf.allfields.anno.gz \
-T 4
# Annotation with configuration file
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
-D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
-D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
-A config/all_dbs.annoc
# Change output format with -OF
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
-D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
-A config/all_dbs.annoc \
-O ./q1.sort.bed.anno.gz \
-OF BED
Arguments:
Option | Description |
---|---|
--query-file,-Q:TagArgument | Path of query file (support Please refer File Format for more details |
--has-header,-header:Boolean | Please refer File Format for more details |
--header-path,-HP:File | Please refer File Format for more details |
--d-files,-D:TagArgument | Local path or http/ftp address of indexed annotation database(s). Either Please refer File Format for more details |
--anno-config,-A:File | Path of annotation extraction configuration file. The program annotates query variant with all
information in database(s) if without this option.
If |
--thread,-T:Integer | Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1. |
--out-file,-O:String | Output file path. By default, output file will be written into the same folder as the input file. Default value: null. |
--out-format,-OF:AnnoOutFormatOutput | Output format. Default value: null. Possible values: {VCF, BED} |
--is-loj,-loj:Boolean | A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false} |
--is-zip,-Z:Boolean | A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false} |
--force-overlap,-FO:Boolean | Force overlap mode. Force the program to omit REF and ALT matching and allele specific feature extraction. Default value: false. Possible values: {true, false} |
--vcf-header-for-bed,-VH:File | VCF output header file. This file is required when the format of query file is BED or TAB-delimited, but the format of annotation output is VCF. Default value: null. |
Global Options: --use-jdk-inflater,-UJI:Boolean --log:Boolean |
AnnotationIntersectFile
To quickly extract desired annotation fields from an existing VarNote intersection file, the
Usage example:
java -jar VarNote.jar AnnotationIntersectFile -I q1.sort.vcf.overlap.gz -A config/all_dbs.annoc
Arguments:
Option | Description |
---|---|
--input,-I:File | VarNote intersection file path. Required. |
--anno-config,-A:File | Path of annotation extraction configuration file. The program annotates query variant with all
information in database(s) if without this option.
If |
--out-file,-O:String | Output file path. By default, output file will be written into the same folder as the input file. Default value: null. |
--out-format,-OF:AnnoOutFormatOutput | Output format. Default value: null. Possible values: {VCF, BED} |
--is-loj,-loj:Boolean | A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false} |
--is-zip,-Z:Boolean | A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false} |
--force-overlap,-FO:Boolean | Force overlap mode. Force the program to omit REF and ALT matching and allele specific feature extraction. Default value: false. Possible values: {true, false} |
--vcf-header-for-bed,-VH:File | VCF output header file. This file is required when the format of query file is BED or TAB-delimited, but the format of annotation output is VCF. Default value: null. |
Global Options: --use-jdk-inflater,-UJI:Boolean --log:Boolean |
Run with config file
IntersectConfig
Run VarNote Intersect program from a config file.
Usage example:
java -jar VarNote.jar IntersectConfig -I config/intersect.full.options.confg
java -jar VarNote.jar IntersectConfig -I config/intersect.required.options.confg
Arguments:
Option | Description |
---|---|
--input,-I:File | Config file path. Required. |
AnnotationConfig
Run VarNote Annotation program from a config file.
Usage example:
java -jar VarNote.jar AnnotationConfig -I config/anno.full.options.confg
Arguments:
Option | Description |
---|---|
--input,-I:File | Config file path. Required. |
VarNote Prioritization
VarNote-REG
To prioritize causal regulatory variants in the LD of each GWAS signal and provide combined ranking scores based on multiple tissue/cell type-specific prediction methods.
Pre-requirement:
Please download several dependent annotation databases (include
- VarNoteDB_FA_regBase_prediction (hg19/hg38)
- VarNoteDB_FP_Roadmap_127Epi (hg19/hg38)
- VarNoteDB_FA_CellTypeScore (hg19/hg38)
- VarNoteDB_Reference (hg19/hg38)
NOTE: Since VarNote supports remote databases (with VarNote index or Tabix index) via passing http/ftp URL, however, the speed heavily relies on the network environment.
Usage example:
## -D:db {FitCons2, GenoSkylinePlus , FUNLDA , GenoNet } prediction scores are used for tissue/cell type-specific prioritization (required)
## -D:db pos2snp_b153.gz is used to convert genomic position to rsID (required)
## -F {FitCons2, GenoSkylinePlus, FUNLDA, GenoNet} are used to compute Combined Rank (required)
## -ID is used to specify the tissue/cell type (required)
## -LDE is used to indicate enabling LD extension, -LDE should be used with -B -LDC (optional)
## -GR and -G are used to annotate variant effect (optional)
java -jar VarNote.jar REG -Q:coordAllele $DIR/input/reg.demo.txt \
-D:db,tag=regbase $DBPath/VarNoteDB_FA_regBase_prediction.gz \
-D:db,tag=roadmap $DBPath/VarNoteDB_FP_Roadmap_127Epi.bed.gz \
-D:db,tag=fitCons2 $DBPath/fitcons2.merge.gz \
-D:db,tag=genoSkylinePlus $DBPath/GenoSkylinePlus.merge.gz \
-D:db,tag=funlda $DBPath/FUN-LDA.merge.gz \
-D:db,tag=genoNet $DBPath/GenoNet_all.gz \
-D:db,tag=pos2snp $DBPath/pos2snp_b153.gz \
-F FitCons2 \
-F GenoSkylinePlus \
-F FUNLDA \
-F GenoNet \
-F FUNLDA \
-ID E116 \
-LDE true \
-B $DBPath/1kg.phase3.v5.shapeit2.eur.hg19.chr1.vcf.gz.bit \
-LDC 0.9 \
-GR $DBPath/hg19_ensembl.ser \
-G hg19 \
-T 4
Arguments: * is required option
Option | Description |
---|---|
--query-file,-Q:TagArgument * | Path of query file (support Please refer File Format for more details |
--has-header,-header:Boolean | Please refer File Format for more details |
--header-path,-HP:File | Please refer File Format for more details |
--d-files,-D:TagArgument * |
The following databases are required for VarNote-REG program:
This argument must be specified at least once. Required. Please refer File Format for more details |
--funcs,-F:RegFunc * | Context-dependent regulatory variant prediction model used to compute combine rank.
This argument must be specified at least once. Required. Possible values: {FitCons2, GenoSkylinePlus, FUNLDA, GenoNet}
This argument must be specified at least once. Required |
--CellID,-ID:CellType * | Tissue/Cell type to extract. Please refer EID in
https://github.com/mdozmorov/genomerunner_web/wiki/Roadmap-cell-types for valid Cell IDs
Required. Possible values: {E001, E002, E003, E004, E005, E006, E007, E008, E009, E010,
E011, E012, E013, E014, E015, E016, E017, E018, E019, E020, E021, E022, E023, E024, E025,
E026, E027, E028, E029, E030, E031, E032, E033, E034, E035, E036, E037, E038, E039, E040,
E041, E042, E043, E044, E045, E046, E047, E048, E049, E050, E051, E052, E053, E054, E055,
E056, E057, E058, E059, E061, E062, E063, E065, E066, E067, E068, E069, E070, E071, E072,
E073, E074, E075, E076, E077, E078, E079, E080, E081, E082, E083, E084, E085, E086, E087,
E088, E089, E090, E091, E092, E093, E094, E095, E096, E097, E098, E099, E100, E101, E102,
E103, E104, E105, E106, E107, E108, E109, E110, E111, E112, E113, E114, E115, E116, E117,
E118, E119, E120, E121, E122, E123, E124, E125, E126, E127, E128, E129}
Required |
--ldExtension,-LDE:Boolean | Set true to enable LD extension. Default value: true. Possible values: {true, false} |
--bitFile,-B:String | Compressed genotypes file, like 1000G (.bit and .bit.idx file, generate by IndexRefGenotype program),
required for --ldExtension option. Default value: null. Please download build up bit files (include Note: There are 5 population including EUR, EAS, AFR, SAS, AMR, please download the files for correct population. |
--ldCutoff,-LDC:Double | LD Cutoff (a float value between [0.5, 1], default is 0.8). Default value: 0.8. |
--ldWindow,-LDW:Integer | LD Window (an integer value between [1, 1000] kilobase (KB), default is 100KB). Default value: 100. |
--geneRefFile,-GR:String | Path of gene annotation file (a file with the extension .ser). Default value: null.
Please download build up gene reference file (.ser files) including |
--Genome,-G:GenomeAssembly | Genome reference assembly version, required for --geneRefFile option. Default value: null. Possible values: {hg19, hg38} |
--thread,-T:Integer | Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1. |
--out-file,-O:String | Output file path. By default, output file will be written into the same folder as the input file. Default value: null. |
--is-loj,-loj:Boolean | A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false} |
--is-zip,-Z:Boolean | A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false} |
--maxVariantLength,-MVL:Integer | Indicator of the max length of query interval/variant. Default value: 50. |
--allowLargeVariants,-ALV:Boolean | Indicator to allow large query intervals/variants or not Default value: false. Possible values: {true, false} |
Global Options: --use-jdk-inflater, -UJI:Boolean --log:Boolean |
VarNote-PAT
To prioritize candidate pathogenic regulatory variants from whole genome sequencing variants of inherited diseases.
Pre-requirement:
Please download several dependent annotation databases (include
- VarNoteDB_FA_regBase (hg19/hg38)
- VarNoteDB_AF_gnomAD_Genome (hg19/hg38)
- VarNoteDB_FP_Roadmap_127Epi (hg19/hg38)
- VarNoteDB_FA_dbNSFP (hg19/hg38)
- VarNoteDB_Reference (hg19/hg38)
NOTE: Since VarNote supports remote databases (with VarNote index or Tabix index) via passing http/ftp URL, however, the speed heavily relies on the network environment.
Usage example:
## -Q indicates the path of WGS/WES data file (VCF or GZIP compressed VCF format) (required)
## -P indicates the path of pedigree file (required)
## -IM is used to indicate the mode of inheritance (required)
## -F is used to specify the selected scores to compute Combined Rank (required)
## -GR and -G are used to annotate variant effect (required)
## -D:db VarNoteDB_FA_regBase.gz is used to extract function scores (required)
## -D:db VarNoteDB_FP_Roadmap_127Epi.bed.gz is used to filter variant on tissue/cell type-specific epigenomic marks (required when --CellID is specified)
## -D:db VarNoteDB_AF_gnomAD_Genome.vcf.gz is used to filter allele frequency (required)
## -D:db pos2snp_b153.gz is used to convert genomic position to rsID (required)
java -jar VarNote.jar PAT -Q $DIR/input/pat.demo.filter.vcf.gz \
-P $DIR/input/pat.demo.ped \
-D:db,tag=regbase $DBPath/VarNoteDB_FA_regBase.gz \
-D:db,tag=roadmap $DBPath/VarNoteDB_FP_Roadmap_127Epi.bed.gz \
-D:db,tag=dbNSFP $DBPath/VarNoteDB_FA_dbNSFP.gz \
-D:db,tag=gnomad $DBPath/VarNoteDB_AF_gnomAD_Genome.vcf.gz \
-D:db,tag=pos2snp $DBPath/pos2snp_b153.gz \
-IM AUTOSOMAL_DOMINANT \
-GF $configPath/region_exclude.txt \
-GT REGION_EXCLUDE \
-VF $configPath/VF_exclude.txt \
-F Eigen \
-F FATHMM_MKL \
-F FATHMM_XF \
-F GenoCanyon \
-GR $DBPath/hg19_ensembl.ser \
-G hg19
-T 4
Arguments: * is required option
Option | Description |
---|---|
--query-file,-Q: * | The path of WGS/WES VCF file (VCF or Gzip compressed VCF format)
Required |
--pedigree,-P:File * | The path of pedigree file, refer to PLINK or GATK PED file format.
Required |
--d-files,-D:TagArgument * |
The following databases are required for VarNote-PAT program:
This argument must be specified at least once. Required. Please refer File Format for more details |
--inheritanceMode,-IM:ModeOfInheritance * | Mode Of Inheritance Required. Possible values: {AUTOSOMAL_DOMINANT, AUTOSOMAL_RECESSIVE,
X_RECESSIVE, X_DOMINANT, ANY}
Required |
--funcs,-F:RegFunc * | Prediction model used to compute combined rank.
This argument must be specified at least once. Required. Possible values: {CADD,
Eigen, FATHMM_MKL, FATHMM_XF, GenoCanyon, LINSIGHT, ReMM}
This argument must be specified at least once. Required |
--geneRefFile,-GA:String * | Path of gene annotation file (a file with the extension .ser). Required Please download build up gene reference files (.ser files) including |
--Genome,-G:GenomeAssembly * | Genome reference assembly version. Possible values: {hg19, hg38} Required |
--variantEffectFile,-VF:File | File contains variant effect to exclude. Default value: null. |
--genomicFile,-GF:File | File contains genomic region or gene to exclude/include. Default value: null. |
--regionFileType,-GT:GenomicRegionType | Types of region/gene to exclude/include, should be used with '-GF' option. Default value: null. Possible values: {GENE_INCLUDE, GENE_EXCLUDE, REGION_INCLUDE, REGION_EXCLUDE, ALL_GENE_INCLUDE, ALL_GENE_EXCLUDE} |
--distance,-DT:Integer | Distance to upstream and downstream of gene (kilobase, KB). Default value: 5. |
--afCutoff,-AC:Double | Cutoff of allele frequency to filter out germline variant. Default value: 0.005. |
--CellID,-ID:CellType | Tissue/Cell type to filter. Please refer EID in https://github.com/mdozmorov/genomerunner_web/wiki/Roadmap-cell-types for valid Cell ID Default value: null. Possible values: {E001, E002, E003, E004, E005, E006, E007, E008, E009, E010, E011, E012, E013, E014, E015, E016, E017, E018, E019, E020, E021, E022, E023, E024, E025, E026, E027, E028, E029, E030, E031, E032, E033, E034, E035, E036, E037, E038, E039, E040, E041, E042, E043, E044, E045, E046, E047, E048, E049, E050, E051, E052, E053, E054, E055, E056, E057, E058, E059, E061, E062, E063, E065, E066, E067, E068, E069, E070, E071, E072, E073, E074, E075, E076, E077, E078, E079, E080, E081, E082, E083, E084, E085, E086, E087, E088, E089, E090, E091, E092, E093, E094, E095, E096, E097, E098, E099, E100, E101, E102, E103, E104, E105, E106, E107, E108, E109, E110, E111, E112, E113, E114, E115, E116, E117, E118, E119, E120, E121, E122, E123, E124, E125, E126, E127, E128, E129} |
--CellMark,-CM:CellMark | Chromatin state to filter. '-CM' should be used with '-ID' option. This argument may be specified 0 or more times. Default value: null. Possible values: {DNase, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9me3} |
--GTQuality,-GQ:Integer | Min value of Genotyping Quality. Default value: 20. |
--VQuality,-VQD:Integer | Min value of Variant Confidence/Quality by Depth. Default value: 2. |
--thread,-T:Integer | Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1. |
--out-file,-O:String | Output file path. By default, output file will be written into the same folder as the input file. Default value: null. |
--is-loj,-loj:Boolean | A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false} |
--is-zip,-Z:Boolean | A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false} |
--maxVariantLength,-MVL:Integer | Indicator of the max length of query interval/variant. Default value: 50. |
--allowLargeVariants,-ALV:Boolean | Indicator to allow large query intervals/variants or not Default value: false. Possible values: {true, false} |
Global Options: --use-jdk-inflater, -UJI:Boolean --log:Boolean |
VarNote-CAN
To prioritize likely cancer driver regulatory mutation given personal cancer genome profile.
Pre-requirement:
Please download several dependent annotation databases (include
- VarNoteDB_FA_regBase (hg19/hg38)
- VarNoteDB_AF_gnomAD_Genome (hg19/hg38)
- VarNoteDB_FP_Roadmap_127Epi (hg19/hg38)
- VarNoteDB_TA_COSMIC_NonCoding (hg19/hg38)
- VarNoteDB_Reference (hg19/hg38)
NOTE: Since VarNote supports remote databases (with VarNote index or Tabix index) via passing http/ftp URL, however, the speed heavily relies on the network environment.
Usage example:
## -D:db VarNoteDB_FA_regBase_prediction.gz is used to extract regBase-CAN score (required)
## -D:db VarNoteDB_FP_Roadmap_127Epi.bed.gz is used to filter variant on tissue/cell type-specific epigenomic marks (required when --CellID is specified)
## -D:db VarNoteDB_TA_COSMIC_NonCoding.vcf.gz is used to filter somatic mutation recurrence (required)
## -D:db VarNoteDB_AF_gnomAD_Genome.vcf.gz is used to filter allele frequency (required)
## -D:db pos2snp_b153.gz is used to convert genomic position to rsID (required)
## -GA and -G are used to annotate variant effect (required)
java -jar VarNote.jar CAN -Q $DIR/input/can.demo.vcf.gz \
-D:db,tag=regbase $DBPath/VarNoteDB_FA_regBase_prediction.gz \
-D:db,tag=roadmap $DBPath/VarNoteDB_FP_Roadmap_127Epi.bed.gz \
-D:db,tag=cosmic $DBPath/VarNoteDB_TA_COSMIC_NonCoding.vcf.gz \
-D:db,tag=gnomad $DBPath/VarNoteDB_AF_gnomAD_Genome.vcf.gz \
-D:db,tag=pos2snp $DBPath/pos2snp_b153.gz \
-VF $configPath/VF_exclude.txt \
-GA $DBPath/hg19_ensembl.ser \
-G hg19
Arguments: * is required option
Option | Description |
---|---|
--query-file,-Q:TagArgument * | Path of query file (support Please refer File Format for more details |
--has-header,-header:Boolean | Please refer File Format for more details |
--header-path,-HP:File | Please refer File Format for more details |
--d-files,-D:TagArgument * |
The following databases are required for VarNote-CAN program:
This argument must be specified at least once. Required. Please refer File Format for more details |
--geneRefFile,-GA:String * | Path of gene annotation file (a file with the extension .ser). Required Please download build up gene reference files (.ser files) including |
--Genome,-G:GenomeAssembly * | Genome reference assembly version. Possible values: {hg19, hg38} Required |
--recurRate,-RR:Integer | Min value of recurrence rate of COSMIC. Default value: 1. |
--afCutoff,-AFC:Double | Cutoff of allele frequency to filter out germline variant. Default value: 0.005. |
--variantEffectFile,-VF:File | File contains variant effect to exclude. Default value: null. |
--genomicFile,-GF:File | File contains genomic region or gene to exclude/include. Default value: null. |
--regionFileType,-GT:GenomicRegionType | Types of region/gene to exclude/include, should be used with '-GF' option. Default value: null. Possible values: {GENE_INCLUDE, GENE_EXCLUDE, REGION_INCLUDE, REGION_EXCLUDE, ALL_GENE_INCLUDE, ALL_GENE_EXCLUDE} |
--distance,-DT:Integer | Distance to upstream and downstream of gene (kilobase, KB). Default value: 5. |
--CellID,-ID:CellType | Tissue/Cell type to filter. Please refer EID in https://github.com/mdozmorov/genomerunner_web/wiki/Roadmap-cell-types for valid Cell ID Default value: null. Possible values: {E001, E002, E003, E004, E005, E006, E007, E008, E009, E010, E011, E012, E013, E014, E015, E016, E017, E018, E019, E020, E021, E022, E023, E024, E025, E026, E027, E028, E029, E030, E031, E032, E033, E034, E035, E036, E037, E038, E039, E040, E041, E042, E043, E044, E045, E046, E047, E048, E049, E050, E051, E052, E053, E054, E055, E056, E057, E058, E059, E061, E062, E063, E065, E066, E067, E068, E069, E070, E071, E072, E073, E074, E075, E076, E077, E078, E079, E080, E081, E082, E083, E084, E085, E086, E087, E088, E089, E090, E091, E092, E093, E094, E095, E096, E097, E098, E099, E100, E101, E102, E103, E104, E105, E106, E107, E108, E109, E110, E111, E112, E113, E114, E115, E116, E117, E118, E119, E120, E121, E122, E123, E124, E125, E126, E127, E128, E129} |
--CellMark,-CM:CellMark | Chromatin state to filter. '-CM' should be used with '-ID' option. This argument may be specified 0 or more times. Default value: null. Possible values: {DNase, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9me3} |
--thread,-T:Integer | Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1. |
--out-file,-O:String | Output file path. By default, output file will be written into the same folder as the input file. Default value: null. |
--is-loj,-loj:Boolean | A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false} |
--is-zip,-Z:Boolean | A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false} |
--maxVariantLength,-MVL:Integer | Indicator of the max length of query interval/variant. Default value: 50. |
--allowLargeVariants,-ALV:Boolean | Indicator to allow large query intervals/variants or not Default value: false. Possible values: {true, false} |
Global Options: --use-jdk-inflater, -UJI:Boolean --log:Boolean |
VarNote Toolkits
IndexRefGenotype
To index the whole-genome genotypes (such as 1000 Genomes Project genotype VCF file) for efficient linkage disequilibrium (LD) calculation.
Usage example:
java -jar VarNote.jar IndexRefGenotype -I 1kg.phase3.v5.shapeit2.eur.hg19.vcf.gz
Arguments: * is required option
Option | Description |
---|---|
--input,-I:File * | Path of VCF file with individual phased genotypes. Required. |
NOTE: VarNote provides indexed genotypes for 1000 Genomes Project phase3, please download the indexed files from VarNoteDB_Reference/LD/ (hg19/hg38)
LDBatch
To efficiently calculate all LD variants for a list of variants.
Usage example:
java -jar VarNote.jar LDBatch -B $DBPath/1kg.phase3.v5.shapeit2.eur.hg19.vcf.gz.bit -Q:coordAllele $DIR/input/reg.demo.txt
Arguments: * is required option
Option | Description |
---|---|
--bit-file,-B:File* | Bit file of indexed reference population genotypes. Required. Please download build up 1000G bit files (include Note: There are 5 population including EUR, EAS, AFR, SAS, AMR, please download the files for correct population. |
--query-file,-Q:TagArgument * | Path of query file (support Please refer File Format for more details |
--has-header,-header:Boolean | Please refer File Format for more details |
--header-path,-HP:File | Please refer File Format for more details |
--ld-window-kb,-D:Integer | LD window (kilobase, KB). Default value: 100. |
--cutoff,-C:Double | LD cutoff (default: 0.8). Default value: 0.8. |
--thread,-T:Integer | Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1. |
--out-file,-O:String | Output file path. By default, output file will be written into the same folder as the input file. Default value: null. |
--is-loj,-loj:Boolean | A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false} |
--is-zip,-Z:Boolean | A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false} |
Global Options: --help,-h:Boolean --log:Boolean |
LDVariant
To efficiently calculate all LD variants for a variant.
Usage example:
java -jar VarNote.jar LDVariant -B $DBPath/1kg.phase3.v5.shapeit2.eur.hg19.vcf.gz.bit -Q 1:3325912-C-A
Arguments: * is required option
Option | Description |
---|---|
--bit-file,-B:File* | Bit file of indexed reference population genotypes. Required. Please download build up 1000G bit files (include Note: There are 5 population including EUR, EAS, AFR, SAS, AMR, please download the files for correct population. |
--query-loc,-Q:String* | Genomic feature specified as the format "chr:pos-ref-alt" for variant or "chr:beginPos-endPos" for region. Required. |
--is-region,-R:Boolean | Indicator whether the input genomic feature is a region or not. The program will enumerate all known variants for LD calculations if the input is a genomic region. Default value: false. Possible values: {true, false} |
--ld-window-kb,-D:Integer | LD window (kilobase, KB). Default value: 100. |
--cutoff,-C:Double | LD cutoff. Default value: 0.8. |
Global Options: --help,-h:Boolean |
LDPair
To efficiently calculate LD for a single pair of variants.
Usage example:
java -jar VarNote.jar LDPair -B $DBPath/1kg.phase3.v5.shapeit2.eur.hg19.vcf.gz.bit -P1 1:3325912-C-A -P2 1:3326796-C-G
Arguments: * is required option
Option | Description |
---|---|
--bit-file,-B:File* | Bit file of indexed reference population genotypes. Required. Please download build up 1000G bit files (include Note: There are 5 population including EUR, EAS, AFR, SAS, AMR, please download the files for correct population. |
--P1,-P1:String* | SNP A (format chr:pos-ref-alt) Required. |
--P2,-P2:String* | SNP B (format chr:pos-ref-alt) Required. |
Global Options: --help,-h:Boolean |
rsIDConversion
To efficiently interconvert dbSNP ID/genomic position.
Pre-requirement:
Please download several dependent annotation databases (include
NOTE: Since VarNote supports remote databases (with VarNote index or Tabix index) via passing http/ftp URL, however, the speed heavily relies on the network environment.
Usage example:
java -jar VarNote.jar rsIDConversion -I $DIR/input/reg.demo.txt -F CoordAllele -M POS2SNP -D:db,tag=pos2snp $DBPath/pos2snp_b153.gz
java -jar VarNote.jar rsIDConversion -I $DIR/input/reg.demo.rsid.txt -M SNP2POS -D:db,tag=merge $DBPath/merged_b153.gz -D:db,tag=snp2pos $DBPath/snp2pos_b153.gz
Arguments: * is required option
Option | Description |
---|---|
--input,-I:File * | Path of input file to convert. Required. |
--d-files,-D:TagArgument * |
SNP2POS requires merged_b153.gz and snp2pos_b153.gz databases; POS2SNP requires pos2snp_b153.gz databases; Please download databases from the VarNoteDB_Reference/rs2pos/ folder This argument must be specified at least once. Required. Please refer File Format for more details |
--Mode,-M:Snp2PosMode * | Mode 0 converts rsID to genomic position, and mode 1 converts genomic position (with ref and alt) to rsID. Required. Possible values: {SNP2POS, POS2SNP} Required. |
--FileType,-F:QuickFileType | File type of input file for POS2SNP Mode. Default value: null. Possible values: {VCF, VCFLike, CoordOnly, CoordAllele} |
--MatchRefAlt,-FM:Boolean | Force to macth ref and alt for POS2SNP Mode. Default value: false. Possible values: {true, false} |
--thread,-T:Integer | Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1. |
--out-file,-O:String | Output file path. By default, output file will be written into the same folder as the input file. Default value: null. |
Annotation Extraction Rule
1. allele-specific extraction(using configuration file all_dbs.annoc)
Example1: 1:869244 C|T exact matching(position) with 2 features of 1000g as following:
#query 1 869244 rs575524849 C T 100 PASS .
1000g 1 869244 rs200586552 C CAG 100 PASS AC=321;AF=0.0640974;AN=5008;NS=2504;DP=10653;EAS_AF=0;AMR_AF=0.0187;AFR_AF=0.2322;EUR_AF=0.001;SAS_AF=0;VT=INDEL
1000g 1 869244 rs575524849 C T 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=10653;EAS_AF=0.001;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0;AA=c|||;VT=SNP
Annotation program only extract 1:869244 C|T by default.
1 869244 rs575524849 C T 100 PASS .;1000g_AC=1;1000g_AF=0.000199681
Annotation program extract all features(within omiting REF and ALT matching) when
1 869244 rs575524849 C T 100 PASS .;1000g_AC=321,1;1000g_AF=0.0640974,0.000199681
Example2: 1:1404746 C|A exact matching(position) with 1 features of 1000g as following:
#query 1 1404746 rs147265720 C A 100 PASS .
1000g 1 1404746 rs147265720 C A,T 100 PASS AC=119,11;AF=0.023762,0.00219649;AN=5008;NS=2504;DP=8984;EAS_AF=0,0;AMR_AF=0.0072,0.0159;AFR_AF=0.0862,0;EUR_AF=0,0;SAS_AF=0,0;AA=N|||;VT=SNP;MULTI_ALLELIC
Annotation program only extract information of A allele by default.
1 1404746 rs147265720 C A 100 PASS .;1000g_AC=119;1000g_AF=0.023762
2. Extraction from multiple database(using configuration file all_dbs.annoc)
Example1: 1:3646192 G|A exact matching(position) with 13 features of 1000g and cosmic as following:
#query 1 3646192 rs1885867 G A 100 PASS .
1000g 1 3646192 rs1885867 G A 100 PASS AC=2992;AF=0.597444;AN=5008;NS=2504;DP=19582;EAS_AF=0.3284;AMR_AF=0.6628;AFR_AF=0.6059;EUR_AF=0.833;SAS_AF=0.5746;AA=g|||;VT=SNP;EX_TARGET
cosmic 1 3646192 COSV60700745 G A . . GENE=TP73_ENST00000378285;STRAND=+;LEGACY_ID=COSN28766204;SNP;CDS=c.1049+180G>A;AA=p.?;CNT=3
cosmic 1 3646192 COSV60700745 G A . . GENE=TP73_ENST00000603362;STRAND=+;LEGACY_ID=COSN28766204;SNP;CDS=c.1196+180G>A;AA=p.?;CNT=3
cosmic 1 3646192 COSV60700745 G A . . GENE=TP73_ENST00000354437;STRAND=+;LEGACY_ID=COSN28766204;SNP;CDS=c.1196+180G>A;AA=p.?;CNT=3
cosmic 1 3646192 COSV60700745 G A . . GENE=TP73_ENST00000604479;STRAND=+;LEGACY_ID=COSN28766204;SNP;CDS=c.1196+180G>A;AA=p.?;CNT=3
...
Annotation program only extract information as following.
1 3646192 rs1885867 G A 100 PASS .;1000g_AC=2992;1000g_AF=0.597444;cosmic_GENE=TP73_ENST00000378285,TP73_ENST00000603362,TP73_ENST00000354437,TP73_ENST00000604479,TP73_ENST00000357733,TP73,TP73_ENST00000378280,TP73_ENST00000346387,TP73_ENST00000604074,TP73_ENST00000378290,TP73_ENST00000378288;cosmic_STRAND=+,+,+,+,+,+,+,+,+,+,+
Example2: 1:3318823 G|A exact matching(position) with 9 features of 1000g and cosmic as following:
#query 1 3318823 rs534786798 G A 100 PASS .
1000g 1 3318823 rs534786798 G A 100 PASS AC=2;AF=0.000399361;AN=5008;NS=2504;DP=15458;EAS_AF=0;AMR_AF=0;AFR_AF=0.0015;EUR_AF=0;SAS_AF=0;AA=G|||;VT=SNP
cosmic 1 3318823 COSV54604891 G C . . GENE=PRDM16;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.677-532G>C;AA=p.?;CNT=1
cosmic 1 3318823 COSV54604891 G C . . GENE=PRDM16_ENST00000378398;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.680-532G>C;AA=p.?;CNT=1
cosmic 1 3318823 COSV54604891 G C . . GENE=PRDM16_ENST00000378391;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.677-532G>C;AA=p.?;CNT=1
cosmic 1 3318823 COSV54604891 G C . . GENE=PRDM16_ENST00000441472;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.677-532G>C;AA=p.?;CNT=1
cosmic 1 3318823 COSV54604891 G C . . GENE=PRDM16_ENST00000442529;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.677-532G>C;AA=p.?;CNT=1
...
Annotation program only extract information as following.
Default output
1 3318823 rs534786798 G A 100 PASS .;1000g_AC=2;1000g_AF=0.000399361
User can set --force-overlap = true to omiting REF and ALT matching.
Setting --force-overlap to true
1 3318823 rs534786798 G A 100 PASS .;1000g_AC=2;1000g_AF=0.000399361;cosmic_GENE=PRDM16,PRDM16_ENST00000378398,PRDM16_ENST00000378391,PRDM16_ENST00000441472,PRDM16_ENST00000442529,PRDM16_ENST00000511072,PRDM16_ENST00000514189;cosmic_STRAND=+,+,+,+,+,+,+
Position Resolving Rule
Since chromosome positions in both query and annotation database are critical elements,
VarNote will parse chromosome position of each record according to both file format and allele composition,
and finally transform them to 0-based, half opened half closed coordinate system. Following shows the position resolving rules.
Note
Ref column will be used for position resolving in all programs. For example: "1 10177 . ACC A" will be parsing as
Ref and alt column will be used for allele-specific extraction in Annotaion programs.
VCF format, vcf is one-based
Data | Position Resolved | Description |
---|---|---|
1 10177 . A T ... | 10176, 10177 | SNV |
1 10177 . A ACC ... | 10176, 10177 | INS |
1 10177 . ACC A ... | 10176, 10179 | DEL, end=10176+3 |
1 10177 . GGCGCG TCCGCA ... | 10176, 10182 | |
1 10177 . CGCA TGCA,C ... | 10176, 10180 | |
1 10177 . GGGG G . . END=10180 | 10176, 10180 | Read END from INFO |
Structural variant with confidence interval of breakpoint | ||
1 869465 1 N <DEL> 1293.8 . SVTYPE=DEL;END=870217;CIPOS=-10,157;CIEND=-84,10; | 869454, 870227 | SVTYPE with END in INFO, beg=beg+CIPOS[0], end=end+CIEND[1], that is beg=869464-10, end=870217+10 |
1 1157791 4345_1 N N[4:76212291[ 0.0 . SVTYPE=BND;CIPOS=-8,8;CIEND=-9,6 | 1157782, 1157797 | SVTYPE, beg=beg+CIPOS[0], end=end+CIEND[1], that is beg=1157790-8, end=1157791+6 |
BED-like Format, bed is zero-based
Data | Position Resolved | Description |
---|---|---|
1 10177 10178 . A T | 10177, 10178 | SNV |
1 10177 10178 . A ACC | 10177, 10178 | INS |
1 10177 10179 . ACC A | 10177, 10179 | DEL |
Tab Format, default is one-based
Data | Position Resolved | Description |
---|---|---|
1 10177 . A T | 10176, 10177 | SNV |
1 10177 . A ACC | 10176, 10177 | INS |
1 10177 . ACC A | 10176, 10179 | DEL, end=10176+3 |
1 10177 10178 . A ACC | 10176, 10178 | INS, read end from the 3 column |
1 10177 10179 . ACC A | 10176, 10179 | DEL, read end from the 3 column |
Tab Format, zero-based
Data | Position Resolved | Description |
---|---|---|
1 10177 . A T | 10177, 10178 | SNV |
1 10177 . A ACC | 10177, 10178 | INS |
1 10177 . ACC A | 10177, 10180 | DEL, end=10177+3 |
1 10177 10178 . A ACC | 10177, 10178 | INS, read end from the 3 column |
1 10177 10179 . ACC A | 10177, 10179 | DEL, read end from the 3 column. |
Please cite VarNote as follows:
Huang D, Yi X, Zhou Y, Yao H, Xu H, Wang J, Zhang S, Nong W, Wang P, Shi L, Xuan C, Li M, Wang J, Li W, Kwan HS, Sham PC, Wang K, Li MJ*. Ultrafast and scalable variant annotation and prioritization with big functional genomics data. Genome Res. 2020 Dec;30(12):1789-1801.