VarNote Documentation

VarNote Usage Documentation Quick Tutorial System Requirements Installation and Test Data Available Tools^NEW Quick Start 1. Index Database 2. Fast counting of intersection 3. Data retrieval with single region 4. Intersection 5. Annotation 6. Annotation upon intersection result 7. Run with a config file 8. VarNote Prioritization ^NEW 9. VarNote Toolkits ^NEW File Format File Overview Query and Database File Format 1. VCF Format 2. VCF-Like Format 3. BED-Like Format 4. BED-Like Allele Format 5. Coord-Only Format 6. Coord-Allele Format 7. TAB Format VarNote Tools Documentation Input File Options Global Options VarNote Index Index IndexInfo VarNote Query Count RandomAccess Intersect VarNote Annotation Annotation AnnotationIntersectFile Run with config file IntersectConfig AnnotationConfig VarNote Prioritization^NEW VarNote-REG VarNote-PAT VarNote-CAN VarNote Toolkits ^NEW IndexRefGenotype LDBatch LDVariant LDPair rsIDConversion Annotation Extraction Rule Position Resolving Rule Cite Varnote

Quick Tutorial

System Requirements

Java Runtime Environment (JRE) version 8.0 or above is required for VarNote. To check your java version by open your terminal application and run the following command:

java -version

You are expected to see java version "1.8.x" or above. If not, you may need to update your version; see the Oracle Java website to download and install the latest JDK.

Installation and Test Data

The VarNote command-line tools are provided as an executable JAVA program. You can download the Latest Release of VarNote jar file from here and a sample data for testing from here. Please also find the VarNote source code from https://github.com/mulinlab/VarNote.

NEWTest data and demo script for VarNote prioritization/Toolkits functions can be downloaded from here

Available Tools

USAGE: java -jar /path/to/VarNote.jar <program name> [-h]

Program Summary Table:
--------------------------------------------------------------------------------------
VarNote Index:                                  Tools that generates index for the compressed database file.
    Index                                       To generate VarNote index (".vanno" and ".vanno.vi") for compressed (block gzip) annotation database file.
    IndexInfo                                   To query related information (such as header, format, meta information or sequence name) stored in the VarNote index file of each annotation database.

--------------------------------------------------------------------------------------
VarNote Query:                                  Tools that quickly retrieve data lines from database(s).
    Count                                       To quickly count intersected records in annotation database(s).
    RandomAccess                                To quickly retrieve (by independent random access) intersected records from indexed annotation database(s) given a genomic region like "chrN:beginPos-endPos".
    Intersect                                   To quickly retrieve (by random-sweep algorithm) intersected records from indexed annotation database(s) given query intervals/variants.
    IntersectConfig                              Run VarNote Intersect program with a config file.

--------------------------------------------------------------------------------------
VarNote Annotation:                             Tools that identifies desired annotation fields from database(s).
    Annotation                                  To quickly extract (by random-sweep algorithm) desired annotation fields from indexed annotation database(s) given query intervals/variants.
    AnnotationIntersectFile                     To quickly extract desired annotation fields from an existing VarNote intersection file.
    AnnotationConfig                             Run VarNote Annotation program with a config file.

--------------------------------------------------------------------------------------
NEWVarNote Prioritization:                     To facilitate researchers to execute the genome-scale regulatory variant annotation and prioritization locally as well as compare results with different parameters, several local pipelines are implemented as online version.
    VarNote-REG                                 To prioritize causal regulatory variants in the LD of each GWAS signal and provide combined ranking scores based on multiple tissue/cell type-specific prediction methods.
    VarNote-PAT                                 To prioritize candidate pathogenic regulatory variants from whole genome sequencing variants of inherited diseases.
    VarNote-CAN                                 To prioritize likely cancer driver regulatory mutation given personal cancer genome profile.

--------------------------------------------------------------------------------------
NEWVarNote Toolkits:                           Several useful tools that ease the analysis of genetic sequencing data.
    rsIDConversion                              To efficiently interconvert dbSNP ID/genomic position.
    IndexRefGenotype                            To index the whole-genome genotypes (such as 1000 Genomes Project genotype VCF file) for efficient linkage disequilibrium (LD) calculation.
    LDPair                                      To efficiently calculate LD for a single pair of variants.
    LDVariant                                   To efficiently calculate all LD variants for a variant.
    LDBatch                                     To efficiently calculate all LD variants for a list of variants.

Quick Start

1. Index Database

The first step after downloading VarNote is to index the annotation database file in test data. VarNote Index program will generate the index files .vanno and .vanno.vi for the annotation database.

Note

The annotation database file must be a TAB-delimited genome position file compressed by bgzip program (http://www.htslib.org/doc/bgzip.html) (end with .gz or .bgz).
The annotation database file must be position-sorted (first by sequence name and then by leftmost coordinate).
The index files should be used together with the original database file.

Info

VarNote also provides a random-sweep searching for Tabix index(for users who don't want to reindex a large database).
With performance loss, using Tabix index will be ~10X times slower than VarNote index. We strongly suggest user to index it with VarNote Index program before query.

Command line usage example:

# Moving VarNote-XXX.jar into test data folder and rename it to VarNote.jar.
mv VarNote.jar /path/to/test_data
cd /path/to/test_data

# List all programs of VarNote.
java -jar VarNote.jar

# Displays options specific to Index.
java -jar VarNote.jar Index

# Sorting and using htslib bgzip program to compress annotation database file before index.
sort -k1,1 -k2,2n roadmap.bed > roadmap.sort.bed
/path/to/htslib-1.X/bgzip roadmap.sort.bed

# index VCF file format
java -jar VarNote.jar Index -I 1000G_p3.sort.vcf.gz
java -jar VarNote.jar Index -I cosmic.sort.vcf.gz

# index BED-like file format
java -jar VarNote.jar Index -I roadmap.sort.bed.gz

# index TAB file format
java -jar VarNote.jar Index -I:coordAllele dbscSNV.sort.tab.gz
java -jar VarNote.jar Index -I:tab,c=1,b=2,e=2,ref=3,alt=4 dbscSNV.sort.tab.gz

2. Fast counting of intersection

To quickly count data lines from the indexed databases that intersect with genomic features defined in the query file.

java -jar VarNote.jar Count -Q q2.sort.bed \
                            -D:db,tag=1000g 1000G_p3.sort.vcf.gz \
                            -D:db,tag=roadmap roadmap.sort.bed.gz

Result file: q2.sort.bed.count.gz

Info

-D option could be used multiple times to specify multiple annotation databases.
The command on multiple lines could also be written on a single line.

Note

Count is 2+ times faster than other programs.
Count program only support VarNote index.

3. Data retrieval with single region

To quickly retrieve data lines from the indexed databases that intersect with the specified genomic region like "chr:beginPos-endPos"

#Query a genomic locus
java -jar VarNote.jar RandomAccess -Q 1:2298288-2298289 -D 1000G_p3.sort.vcf.gz

#Query a genomic region
java -jar VarNote.jar RandomAccess -Q 1:959100-959200 \
                                  -D:db,tag=1000g 1000G_p3.sort.vcf.gz \
                                  -D:db,tag=cosmic,index=TBI cosmic.sort.vcf.gz

Output in console

4. Intersection

To quickly retrieve data lines from the indexed databases that intersect with genomic features defined in the query file.

#Multiple databases using exact mode and different Index(1000g using VarNote and cosmic using TBI).
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
                                -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                                -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz

#Multiple databases using exact mode within left join
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
                                -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                                -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
                                -loj true

#Intersect mode (using 4 threads to run, default is 1)
java -jar VarNote.jar Intersect -Q q2.sort.bed --maxVariantLength 50000 \
                                -D roadmap.sort.bed.gz \
                                -T 4

#Multiple databases using different mode(roadmap using intersect mode and 1000g using exact mode)
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
                                -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                                -D:db,tag=roadmap,mode=0 roadmap.sort.bed.gz \
                                -T 4 \
                                -O /path/to/test_data/q1.twomode.overlap.gz

#Intersection using remote database with VarNote index. Querying remote database is relatively slow, please be patient 
java -jar VarNote.jar Intersect -Q:tab,c=1,b=2,e=2,ref=3,alt=4 q3.sort.tab \
                                -D:db,tag=gnomAD,mode=1 http://202.113.53.226/VarNoteDB/VarNoteDB_AF_gnomAD_Genome.vcf.gz

Result files: q1.sort.vcf.overlap.gz q2.sort.bed.overlap.gz q1.twomode.overlap.gz q3.sort.tab.overlap.gz

Note

Intersect mode: perform common interaction operation according to query and database formats.
Exact match mode: force the program only to consider the chromosome position of database records that exactly match the corresponding chromosome position of query.

5. Annotation

Annotation extraction for a list of intervals or variants.
!!!Please reading Annotation Extraction Rule first.

Note

User should define an annotation configuration file (set with option -A) to specify the fields to extract.
If -A is not set, the program will search configuration file named QueryFileName.annoc in query folder first.
If annotate configuration file is not found, the program will extract all fields in databases by default.
If query format is VCF, default output format will be set to VCF; If query format is BED or TAB, default output format will be set to BED; User can change output format with -OF option.

# Annotation without configuration file
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
                            -O ./q1.sort.vcf.allfields.anno.gz \
                            -T 4

# Annotation with configuration file
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
                            -A config/all_dbs.annoc

# Change output format with -OF
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -A config/all_dbs.annoc  \
                            -O ./q1.sort.bed.anno.gz \
                            -OF BED

# Annotation using remote database with VarNote index
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -D:db,tag=gnomAD,mode=1 http://202.113.53.226/VarNoteDB/VarNoteDB_AF_gnomAD_Genome.vcf.gz \
                            -A config/all_dbs.annoc  \
                            -O ./q1.sort.vcf.remote.anno.gz

Result files: q1.sort.vcf.allfields.anno.gz q1.sort.vcf.anno.gz q1.sort.bed.anno.gz

6. Annotation upon intersection result

Annotation an OVERLAP file, which is the result file of Intersect program.

Note

Extraction information directly from an OVERLAP file could save a lot of time in query step.
User can change annotation configuration file to get different results, which is convenient and very fast.

java -jar VarNote.jar AnnotationIntersectFile -I q1.sort.vcf.overlap.gz -A config/all_dbs.annoc

Result files: q1.sort.vcf.anno.gz

7. Run with a config file

Run Intersect or Annotation program with all options defined in a configuration file.

java -jar VarNote.jar IntersectConfig -I config/intersect.full.options.confg
java -jar VarNote.jar IntersectConfig -I config/intersect.required.options.confg
java -jar VarNote.jar AnnotationConfig -I config/anno.full.options.confg

Result files: q1.sort.vcf.overlap.gz

8. VarNote Prioritization ^NEW

VarNote now provides three local pipelines to filter, annotate and prioritize:

disease-causal regulatory variants for GWAS results, VarNote-REG;
pathogenic regulatory variants for rare inherited diseases, VarNote-PAT;
driver regulatory variants for cancers, VarNote-CAN.

Note: Please download test data and demo script for VarNote prioritization functions from here first, and all of these procedures should be executed at VarNote version 1.2.0 or above.

# The data has been successfully tested on java version 8/9/11.

cd ./advanced_funs_test_data  # make sure VarNote.jar is located in the advanced_funs_test_data folder

# open script to learn how to run the program for specific job.
bash script/01-index.sh   # index the whole-genome genotypes (such as 1000 Genomes Project genotype VCF file) for efficient LD calculation.
bash script/02-REG-genomic-Input.sh   # prioritize causal regulatory variants in the LD of each GWAS signal with VCF input.
bash script/02-REG-genomic-Input-comma.sh   # prioritize causal regulatory variants in the LD of each GWAS signal with variant information delimited by comma.
bash script/02-REG-rsID-Input.sh    # prioritize causal regulatory variants in the LD of each GWAS signal with dbSNP variant list.
bash script/03-PAT.sh     # prioritize candidate pathogenic regulatory variants from whole genome sequencing variants of inherited diseases.
bash script/04-CAN.sh     # prioritize likely cancer driver regulatory mutation given personal cancer genome profile.

!!!! Note All the database in the advanced_funs_test_data are for test, please download and replace it with the full version before you use it for your data.

9. VarNote Toolkits ^NEW

VarNote also implements several commonly-used tools for efficiently processing genetic variant information, such as format conversion, LD calculation, etc.

Note: Please download test data and demo script for VarNote prioritization functions from here first, and all of these procedures should be executed at VarNote version 1.2.0 or above.

# The data has been successfully tested on java version 8/9/11.

cd ./advanced_funs_test_data  # make sure VarNote.jar is located in the advanced_funs_test_data folder
bash script/01-index.sh   # index the whole-genome genotypes (such as 1000 Genomes Project genotype VCF file) for efficient LD calculation.
bash script/05-LD.sh      # efficient calculate LD information for a list of variants.
bash script/06-rsID-Conversion.sh # efficient interconvert dbSNP ID/genomic position.

!!!! Note All the database in the advanced_funs_test_data are for test, please download and replace it with the full version before you use it for your data.

File Format

File Overview

Most programs of VarNote requires the following file type as input:

Query File:
A file contains a list of variants to query.
Annotation Database:
An indexed annotation file, indexed by VarNote(recommand) or Tabix.
Overlap File:
The result of Intersect program, and can be used to run AnnotationIntersectFile.
Annotation Configuration File:
A configuration file which defines the fields for data extraction in Annotation program.

Below, we summarizes the input files required for each program.

	Index		Query			Annotation
	Index	IndexInfo	Count	RandomAccess	Intersect	Annotation	AnnotationIntersectFile
Query File
Annotation Database
Overlap File
Annotation Configuration File

File types support for each File.

File Type	Plain Text	gzip(.gz, support original file size smaller than 4 Gb)	bgzip(.gz)
Query File
Annotation Database
Annotation Configuration File

Note

The annotation database file should be bgzip format and position-sorted.
The query file could be plain text, gzip and bgzip format, position-sorted.
For compressed format, bgzip is strongly recommended.
For gzip, up to 4Gb original file size is supported currently.

Query and Database File Format

Both query and annotation database file accepts flexible types of genomic format including VCF, VCF-Like, BED-like, BED-Allele, Coord-Only, Coord-Allele and TAB as input.

1. VCF Format

See details about VCF format.

USAGE: java -jar /path/to/VarNote.jar <program name> -I xxx.vcf.gz
       java -jar /path/to/VarNote.jar <program name> -I:vcf xxx.vcf.gz

2. VCF-Like Format

The first five columns of VCF-Like format are the same as the VCF format and other columns are optional.
Actually VCF-Like format is a TAB format with preset parameters: c=1,b=2,e=2,ref=4,alt=5,0=false

USAGE: java -jar /path/to/VarNote.jar <program name> -I:vcfLike xxx.tab.gz

3. BED-Like Format

The first three column of BED-like format must be the CHROM, START, END and other columns are optional.

USAGE: java -jar /path/to/VarNote.jar <program name> -I:bed xxx.bed.gz

4. BED-Like Allele Format

The first five column of BED-like Allele format must be the CHROM, START, END, REF, ALT and other columns are optional.

USAGE: java -jar /path/to/VarNote.jar <program name> -I:bedAllele xxx.tab.gz

5. Coord-Only Format

The first two column of Coord-Only format must be the CHROM, POS and other columns are optional.
Actually Coord-Only format is a TAB format with preset parameters: c=1,b=2,e=2,0=false

USAGE: java -jar /path/to/VarNote.jar <program name> -I:coordOnly xxx.tab.gz

6. Coord-Allele Format

The first two column of Coord-Allele format must be the CHROM, POS, REF, ALT and other columns are optional.
Actually Coord-Allele format is a TAB format with preset parameters: c=1,b=2,e=2,ref=3,alt=4,0=false

USAGE: java -jar /path/to/VarNote.jar <program name> -I:coordAllele xxx.tab.gz

7. TAB Format

Tab-separated data file that contains genomic locations.
The TAB format should be used with attributes:

Required: c,b,e
Optional: ref,alt,0,ci,sep

USAGE: java -jar /path/to/VarNote.jar <program name> -I:tab,c=1,b=4,e=5,0=true xxx.tab.gz

More Examples

# Index an annotation database of bgzip BED-like file.
java -jar VarNote.jar Index -I:bed database1.sorted.bed.gz

# Count a VCF file.
java -jar VarNote.jar Count -I:vcf q1.sorted.vcf

# Intersect
java -jar VarNote.jar Intersect -Q:coordAllele query.sort.tab -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz

VarNote Tools Documentation

Input File Options

The following options are relevant to input file

Option	Description
--input,-I:TagArgument (in Index program) --query-file,-Q:TagArgument (in Count, RandomAccess, Intersect, Annotation program)	TagArgument: arguments with tag and attributes Usage: -I:tag,attr1=XXX,attr2=XXX... /path/to/file Possible tags: {vcf, vcfLike, bed, bedAllele, coordOnly, coordAllele, tab} Possible attributes for all tags: {sep, ci} Possible attributes for tab tag: {c, b, e, ref, alt, 0} Attributes should be used with tag, here are the description of attributes: c - column of sequence name (1-based, required for tab) b - column of start chromosomal position (1-based, required for tab) e - column of end chromosomal position (1-based, required for tab) ref - column of reference allele (optional) alt - column of alternative allele (optional) 0 - specify the position in the data file is 0-based rather than 1-based (optional) ci - specify a new comment indicator instead of "##" (Used with bed or tab tag, optional) sep - specifies the character that separates fields in file, possible values are: {TAB, COMMA} !!Options b, e, ref, alt, 0 are relevant to position resolving rule
--has-header,-header:Boolean	Indicate whether input file contains a header line for defining column names. If -header is included, the first line below the comment line will be considered as a column header line. The header line must starts with '#' or have no indicator.
--header-path,-HP:File	Path of external file to include the header and meta lines.
--d-files,-D:TagArgument	TagArgument: arguments with tag and attributes Usage: -I:db,attr1=XXX,attr2=XXX... /path/to/file Possible tags: {db} Possible attributes: {index, mode, tag}. Attributes should be used with tag， here are the description of attributes: index - Indicate which index to use, VarNote or Tabix. Default value is "VarNote". Possible values: {VarNote, TBI}, optional mode - Mode of Intersection. default value is "0". Possible values: {0, 1, 2}, optional. 0: Intersect mode, perform common interaction operation according to query and database formats; 1: Exact match mode, force the program only to consider the chromosome position of database records that exactly match the corresponding chromosome position of query; 2: Full close mode, force the program to report database records that overlap both endpoints of query interval regardless of original query and database formats. tag - A label to rename the database in the output file, optional. By default, the program will use original file name as tag for the database optional.

Global Options

The following standard options are relevant to most VarNote tools:

Option	Description
--log:Boolean	Whether to print log. Default value: true. Possible values: {true, false}
--use-jdk-inflater,-UJI:Boolean	Use the JDK Inflater instead of the IntelInflater for reading index. Default value: false. Possible values: {true, false}
--use-jdk-deflater,-UJD:Boolean	Use the JDK Deflater instead of the IntelDeflater for writing index. Default value: false. Possible values: {true, false}

Tool-Specific Documentation

Below, you will find detailed documentation of all the options that are specific to each tool.

VarNote Index

Index

VarNote index function is a necessary step to build index system for fast retrieval. Since most of existing annotation databases are indexed by Tabix, VarNote also provides a random-sweep searching based on Tabix index. This may imply that VarNote can faithfully process existing annotation resources without re-indexing them. However, for large-scale and frequently used annotation datasets such as CADD, gnomAD and dbNSFP, we strongly suggest to use VarNote index system for gained speed.

Usage example:

# Index an annotation database of VCF format.
java -jar VarNote.jar Index -I 1000G_p3.sort.vcf.gz

# Index an annotation database of BED-like format.
java -jar VarNote.jar Index -I roadmap.sort.bed.gz

# Index an annotation database of TAB format.
java -jar VarNote.jar Index -I:coordAllele dbscSNV.sort.tab.gz
java -jar VarNote.jar Index -I:tab,c=1,b=2,e=2,ref=3,alt=4 dbscSNV.sort.tab.gz

Arguments:

Option	Description
--input,-I:TagArgument	The path of the TAB-delimited genome position file compressed by bgzip program. The file must be position-sorted, first by sequence name and then by leftmost coordinate. Possible Tags: {vcf, vcfLike, bed, bedAllele, coordOnly, coordAllele, tab} Required. Please refer File Format for more details
--has-header,-header:Boolean	Please refer File Format for more details
--header-path,-HP:File	Please refer File Format for more details
--out-folder,-O:String	Output directory. By default, the output files will be written into the same folder as the input file. Default value: null
--skip,-S:Integer	Skip first INT lines(including comment lines) in the data file. Default value: 0.
Global Options: --use-jdk-deflater,-UJD:Boolean --log:Boolean

IndexInfo

To query related information (such as header, format, meta information or sequence name) stored in the VarNote index file of each annotation database.

Usage example:

# Input database file must have been indexed by VarNote before query.
java -jar VarNote.jar IndexInfo -LC true -PH true -PM true -I 1000G_p3.sort.vcf.gz

java -jar VarNote.jar IndexInfo -I 1000G_p3.sort.vcf.gz -RH "CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO"

Arguments:

Option	Description
--input,-I:File	The indexed annotation database file, must be indexed by VarNote. Required.
--list-chroms,-LC:Boolean	List the sequence names stored in the index file. Default value: false. Possible values: {true, false}
--print-header,-PH:Boolean	Print header line(s). Default value: false. Possible values: {true, false}
--print-format,-PF:Boolean	Print database format. Default value: false. Possible values: {true, false}
--print-meta-data,-PM:Boolean	Print meta information lines. Default value: false. Possible values: {true, false}
--reheader,-RH:String	Replace the column names with a comma-separated string containing the column names. Columns name should be separated by comma and included with double quotation.
Global Options: --log:Boolean

VarNote Query

Count

To quickly count intersected records in annotation database(s).

Usage example:

java -jar VarNote.jar Count -Q q2.sort.bed \
                            -D:db,tag=1000g 1000G_p3.sort.vcf.gz \
                            -D:db,tag=roadmap roadmap.sort.bed.gz

Arguments:

Option	Description
--query-file,-Q:TagArgument	Path of query file (support plain text or compressed file, including gzip and block gzip).Required. Please refer File Format for more details
--has-header,-header:Boolean	Please refer File Format for more details
--header-path,-HP:File	Please refer File Format for more details
--d-files,-D:TagArgument	Local path or http/ftp address of indexed annotation database(s). Either VarNote index or Tabix index should be in the same location of database file(s). This argument must be specified at least once. Required. Please refer File Format for more details
--out-file,-O:String	Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--thread,-T:Integer	Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
Global Options: --use-jdk-inflater,-UJI:Boolean --log:Boolean

RandomAccess

To quickly retrieve (by independent random access) intersected records from indexed annotation database(s) given a genomic region like "chrN:beginPos-endPos".

Usage example:

java -jar VarNote.jar RandomAccess -Q 1:959100-959200 \
                                  -D:db,tag=1000g 1000G_p3.sort.vcf.gz \
                                  -D:db,tag=cosmic,index=TBI cosmic.sort.vcf.gz

Arguments:

Option	Description
--q-region,-Q:String	Region specified as the format chrN:beginPos-endPos Required.
--d-files,-D:TagArgument	Local path or http/ftp address of indexed annotation database(s). Either VarNote index or Tabix index should be in the same location of database file(s). This argument must be specified at least once. Required. Please refer File Format for more details
--is-label,-L:Boolean	A flag to determine whether or not to print database name as the first column of the result. Default value: true. Possible values: {true, false}
Global Options: --use-jdk-inflater,-UJI:Boolean --log:Boolean

Intersect

To quickly retrieve (by random-sweep algorithm) intersected records from indexed annotation database(s) given query intervals/variants.

Usage example:

#Multiple databases using exact mode and different Index(1000g using VarNote and cosmic using TBI).
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
                                -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                                -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz

#Multiple databases using exact mode within left join
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
                                -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                                -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
                                -loj true

#Intersect mode (using 4 threads to run, default is 1)
java -jar VarNote.jar Intersect -Q q2.sort.bed --maxVariantLength 50000 \
                                -D roadmap.sort.bed.gz \
                                -T 4

Arguments:

Option	Description
--query-file,-Q:TagArgument	Path of query file (support plain text or compressed file, including gzip and block gzip).Required. Please refer File Format for more details
--has-header,-header:Boolean	Please refer File Format for more details
--header-path,-HP:File	Please refer File Format for more details
--d-files,-D:TagArgument	Local path or http/ftp address of indexed annotation database(s). Either VarNote index or Tabix index should be in the same location of database file(s). This argument must be specified at least once. Required. Please refer File Format for more details
--thread,-T:Integer	Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String	Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--out-mode,-OM:Integer	Output recording mode (default 2). 0 for "only output query records"; 1 for "only output matched database records"; 2 for "output both query records and matched database records" Default value: 2.
--is-loj,-loj:Boolean	A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--maxVariantLength,-MVL:Integer	Indicator of the max length of query interval/variant. Default value: 50.
--allowLargeVariants,-ALV:Boolean	Indicator to allow large query intervals/variants or not Default value: false. Possible values: {true, false}
--is-remove-comment,-RC:Boolean	A flag to determine whether or not to remove the comment lines(start with '@') in the output file. Note that the comment lines are required for the VarNote Annotation program. Default value: false. Possible values: {true, false}
--is-zip,-Z:Boolean	A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
Global Options: --use-jdk-inflater,-UJI:Boolean --log:Boolean

VarNote Annotation

Annotation

To quickly extract (by random-sweep algorithm) desired annotation fields from indexed annotation database(s) given query intervals/variants. It allows feature extraction using both interval-level overlap and variant-level exact matching. It also has an annotation mode supporting allele-specific variant annotation for SNV/Indel and region-specific annotation for large variations.

Usage example:

comment># Annotation without configuration file
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
                            -O ./q1.sort.vcf.allfields.anno.gz \
                            -T 4

# Annotation with configuration file
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
                            -A config/all_dbs.annoc

# Change output format with -OF
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -A config/all_dbs.annoc  \
                            -O ./q1.sort.bed.anno.gz \
                            -OF BED

Arguments:

Option	Description
--query-file,-Q:TagArgument	Path of query file (support plain text or compressed file, including gzip and block gzip).Required. Please refer File Format for more details
--has-header,-header:Boolean	Please refer File Format for more details
--header-path,-HP:File	Please refer File Format for more details
--d-files,-D:TagArgument	Local path or http/ftp address of indexed annotation database(s). Either VarNote index or Tabix index should be in the same location of database file(s). This argument must be specified at least once. Required. Please refer File Format for more details
--anno-config,-A:File	Path of annotation extraction configuration file. The program annotates query variant with all information in database(s) if without this option. If -a is not defined, the program will automatically search file named /path/to/query/query_file + .annoc under the folder of the query file. Default value: null.
--thread,-T:Integer	Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String	Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--out-format,-OF:AnnoOutFormatOutput	Output format. Default value: null. Possible values: {VCF, BED}
--is-loj,-loj:Boolean	A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--is-zip,-Z:Boolean	A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
--force-overlap,-FO:Boolean	Force overlap mode. Force the program to omit REF and ALT matching and allele specific feature extraction. Default value: false. Possible values: {true, false}
--vcf-header-for-bed,-VH:File	VCF output header file. This file is required when the format of query file is BED or TAB-delimited, but the format of annotation output is VCF. Default value: null.
Global Options: --use-jdk-inflater,-UJI:Boolean --log:Boolean

AnnotationIntersectFile

To quickly extract desired annotation fields from an existing VarNote intersection file, the OVERLAP file

Usage example:

java -jar VarNote.jar AnnotationIntersectFile -I q1.sort.vcf.overlap.gz -A config/all_dbs.annoc

Arguments:

Option	Description
--input,-I:File	VarNote intersection file path. Required.
--anno-config,-A:File	Path of annotation extraction configuration file. The program annotates query variant with all information in database(s) if without this option. If -a is not defined, the program will automatically search file named /path/to/query/query_file + .annoc under the folder of the query file. Default value: null.
--out-file,-O:String	Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--out-format,-OF:AnnoOutFormatOutput	Output format. Default value: null. Possible values: {VCF, BED}
--is-loj,-loj:Boolean	A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--is-zip,-Z:Boolean	A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
--force-overlap,-FO:Boolean	Force overlap mode. Force the program to omit REF and ALT matching and allele specific feature extraction. Default value: false. Possible values: {true, false}
--vcf-header-for-bed,-VH:File	VCF output header file. This file is required when the format of query file is BED or TAB-delimited, but the format of annotation output is VCF. Default value: null.
Global Options: --use-jdk-inflater,-UJI:Boolean --log:Boolean

Run with config file

IntersectConfig

Run VarNote Intersect program from a config file.

Usage example:

java -jar VarNote.jar IntersectConfig -I config/intersect.full.options.confg
java -jar VarNote.jar IntersectConfig -I config/intersect.required.options.confg

Arguments:

Option	Description
--input,-I:File	Config file path. Required.

AnnotationConfig

Run VarNote Annotation program from a config file.

Usage example:

java -jar VarNote.jar AnnotationConfig -I config/anno.full.options.confg

Arguments:

Option	Description
--input,-I:File	Config file path. Required.

VarNote Prioritization

VarNote-REG

To prioritize causal regulatory variants in the LD of each GWAS signal and provide combined ranking scores based on multiple tissue/cell type-specific prediction methods.

Pre-requirement:
Please download several dependent annotation databases (include .gz, .gz.vanno, .gz.vanno.vi) for VarNote-REG

VarNoteDB_FA_regBase_prediction (hg19/hg38)
VarNoteDB_FP_Roadmap_127Epi (hg19/hg38)
VarNoteDB_FA_CellTypeScore (hg19/hg38)
VarNoteDB_Reference (hg19/hg38)

NOTE: Since VarNote supports remote databases (with VarNote index or Tabix index) via passing http/ftp URL, however, the speed heavily relies on the network environment.

Usage example:

## -D:db {FitCons2, GenoSkylinePlus , FUNLDA , GenoNet } prediction scores are used for tissue/cell type-specific prioritization (required)
## -D:db pos2snp_b153.gz is used to convert genomic position to rsID (required)
## -F {FitCons2, GenoSkylinePlus, FUNLDA, GenoNet} are used to compute Combined Rank (required)
## -ID is used to specify the tissue/cell type (required)
## -LDE is used to indicate enabling LD extension, -LDE should be used with -B -LDC (optional)
## -GR and -G are used to annotate variant effect (optional)

java -jar VarNote.jar REG -Q:coordAllele $DIR/input/reg.demo.txt \
-D:db,tag=regbase $DBPath/VarNoteDB_FA_regBase_prediction.gz \
-D:db,tag=roadmap $DBPath/VarNoteDB_FP_Roadmap_127Epi.bed.gz \
-D:db,tag=fitCons2 $DBPath/fitcons2.merge.gz \
-D:db,tag=genoSkylinePlus $DBPath/GenoSkylinePlus.merge.gz \
-D:db,tag=funlda $DBPath/FUN-LDA.merge.gz \
-D:db,tag=genoNet $DBPath/GenoNet_all.gz \
-D:db,tag=pos2snp $DBPath/pos2snp_b153.gz \
-F FitCons2 \
-F GenoSkylinePlus \
-F FUNLDA \
-F GenoNet \
-F FUNLDA \
-ID E116 \
-LDE true \
-B $DBPath/1kg.phase3.v5.shapeit2.eur.hg19.chr1.vcf.gz.bit \
-LDC 0.9 \
-GR $DBPath/hg19_ensembl.ser \
-G hg19 \
-T 4

Arguments: * is required option

Option	Description
--query-file,-Q:TagArgument *	Path of query file (support plain text or compressed file, including gzip and block gzip).Required. Please refer File Format for more details
--has-header,-header:Boolean	Please refer File Format for more details
--header-path,-HP:File	Please refer File Format for more details
--d-files,-D:TagArgument *	The following databases are required for VarNote-REG program: VarNoteDB_FA_regBase_prediction.gz VarNoteDB_FP_Roadmap_127Epi.bed.gz fitcons2.merge.gz GenoSkylinePlus.merge.gz FUN-LDA.merge.gz GenoNet_all.gz pos2snp_b153.gz Please download databases (include .gz, .gz.vanno, .gz.vanno.vi) from here. This argument must be specified at least once. Required. Please refer File Format for more details
--funcs,-F:RegFunc *	Context-dependent regulatory variant prediction model used to compute combine rank. This argument must be specified at least once. Required. Possible values: {FitCons2, GenoSkylinePlus, FUNLDA, GenoNet} This argument must be specified at least once. Required
--CellID,-ID:CellType *	Tissue/Cell type to extract. Please refer EID in https://github.com/mdozmorov/genomerunner_web/wiki/Roadmap-cell-types for valid Cell IDs Required. Possible values: {E001, E002, E003, E004, E005, E006, E007, E008, E009, E010, E011, E012, E013, E014, E015, E016, E017, E018, E019, E020, E021, E022, E023, E024, E025, E026, E027, E028, E029, E030, E031, E032, E033, E034, E035, E036, E037, E038, E039, E040, E041, E042, E043, E044, E045, E046, E047, E048, E049, E050, E051, E052, E053, E054, E055, E056, E057, E058, E059, E061, E062, E063, E065, E066, E067, E068, E069, E070, E071, E072, E073, E074, E075, E076, E077, E078, E079, E080, E081, E082, E083, E084, E085, E086, E087, E088, E089, E090, E091, E092, E093, E094, E095, E096, E097, E098, E099, E100, E101, E102, E103, E104, E105, E106, E107, E108, E109, E110, E111, E112, E113, E114, E115, E116, E117, E118, E119, E120, E121, E122, E123, E124, E125, E126, E127, E128, E129} Required
--ldExtension,-LDE:Boolean	Set true to enable LD extension. Default value: true. Possible values: {true, false}
--bitFile,-B:String	Compressed genotypes file, like 1000G (.bit and .bit.idx file, generate by IndexRefGenotype program), required for --ldExtension option. Default value: null. Please download build up bit files (include .bit and .bit.idx) from VarNoteDB_Reference/LD/. Note: There are 5 population including EUR, EAS, AFR, SAS, AMR, please download the files for correct population.
--ldCutoff,-LDC:Double	LD Cutoff (a float value between [0.5, 1], default is 0.8). Default value: 0.8.
--ldWindow,-LDW:Integer	LD Window (an integer value between [1, 1000] kilobase (KB), default is 100KB). Default value: 100.
--geneRefFile,-GR:String	Path of gene annotation file (a file with the extension .ser). Default value: null. Please download build up gene reference file (.ser files) including Ensembl, RefSeq, UCSC from VarNoteDB_Reference/Gene//.
--Genome,-G:GenomeAssembly	Genome reference assembly version, required for --geneRefFile option. Default value: null. Possible values: {hg19, hg38}
--thread,-T:Integer	Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String	Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--is-loj,-loj:Boolean	A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--is-zip,-Z:Boolean	A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
--maxVariantLength,-MVL:Integer	Indicator of the max length of query interval/variant. Default value: 50.
--allowLargeVariants,-ALV:Boolean	Indicator to allow large query intervals/variants or not Default value: false. Possible values: {true, false}
Global Options: --use-jdk-inflater, -UJI:Boolean --log:Boolean

VarNote-PAT

To prioritize candidate pathogenic regulatory variants from whole genome sequencing variants of inherited diseases.

Pre-requirement:
Please download several dependent annotation databases (include .gz, .gz.vanno, .gz.vanno.vi) for VarNote-PAT

VarNoteDB_FA_regBase (hg19/hg38)
VarNoteDB_AF_gnomAD_Genome (hg19/hg38)
VarNoteDB_FP_Roadmap_127Epi (hg19/hg38)
VarNoteDB_FA_dbNSFP (hg19/hg38)
VarNoteDB_Reference (hg19/hg38)

NOTE: Since VarNote supports remote databases (with VarNote index or Tabix index) via passing http/ftp URL, however, the speed heavily relies on the network environment.

Usage example:

## -Q indicates the path of WGS/WES data file (VCF or GZIP compressed VCF format) (required)
## -P indicates the path of pedigree file (required)
## -IM is used to indicate the mode of inheritance  (required)
## -F is used to specify the selected scores to compute Combined Rank (required)
## -GR and -G are used to annotate variant effect (required)
## -D:db VarNoteDB_FA_regBase.gz is used to extract function scores (required)
## -D:db VarNoteDB_FP_Roadmap_127Epi.bed.gz is used to filter variant on tissue/cell type-specific epigenomic marks (required when --CellID is specified)
## -D:db VarNoteDB_AF_gnomAD_Genome.vcf.gz is used to filter allele frequency (required)
## -D:db pos2snp_b153.gz is used to convert genomic position to rsID (required)

java -jar VarNote.jar PAT -Q $DIR/input/pat.demo.filter.vcf.gz \
-P $DIR/input/pat.demo.ped \
-D:db,tag=regbase $DBPath/VarNoteDB_FA_regBase.gz \
-D:db,tag=roadmap $DBPath/VarNoteDB_FP_Roadmap_127Epi.bed.gz \
-D:db,tag=dbNSFP $DBPath/VarNoteDB_FA_dbNSFP.gz \
-D:db,tag=gnomad $DBPath/VarNoteDB_AF_gnomAD_Genome.vcf.gz \
-D:db,tag=pos2snp $DBPath/pos2snp_b153.gz \
-IM AUTOSOMAL_DOMINANT \
-GF $configPath/region_exclude.txt \
-GT REGION_EXCLUDE \
-VF $configPath/VF_exclude.txt \
-F Eigen \
-F FATHMM_MKL \
-F FATHMM_XF \
-F GenoCanyon \
-GR $DBPath/hg19_ensembl.ser \
-G hg19
-T 4

Arguments: * is required option

Option	Description
--query-file,-Q: *	The path of WGS/WES VCF file (VCF or Gzip compressed VCF format) Required
--pedigree,-P:File *	The path of pedigree file, refer to PLINK or GATK PED file format. Required
--d-files,-D:TagArgument *	The following databases are required for VarNote-PAT program: VarNoteDB_FA_regBase.gz VarNoteDB_FP_Roadmap_127Epi.bed.gz VarNoteDB_FA_dbNSFP.gz VarNoteDB_AF_gnomAD_Genome.vcf.gz pos2snp_b153.gz Please download databases (include .gz, .gz.vanno, .gz.vanno.vi) from here. This argument must be specified at least once. Required. Please refer File Format for more details
--inheritanceMode,-IM:ModeOfInheritance *	Mode Of Inheritance Required. Possible values: {AUTOSOMAL_DOMINANT, AUTOSOMAL_RECESSIVE, X_RECESSIVE, X_DOMINANT, ANY} Required
--funcs,-F:RegFunc *	Prediction model used to compute combined rank. This argument must be specified at least once. Required. Possible values: {CADD, Eigen, FATHMM_MKL, FATHMM_XF, GenoCanyon, LINSIGHT, ReMM} This argument must be specified at least once. Required
--geneRefFile,-GA:String *	Path of gene annotation file (a file with the extension .ser). Required Please download build up gene reference files (.ser files) including Ensembl, RefSeq, UCSC from VarNoteDB_Reference/Gene//.
--Genome,-G:GenomeAssembly *	Genome reference assembly version. Possible values: {hg19, hg38} Required
--variantEffectFile,-VF:File	File contains variant effect to exclude. Default value: null.
--genomicFile,-GF:File	File contains genomic region or gene to exclude/include. Default value: null.
--regionFileType,-GT:GenomicRegionType	Types of region/gene to exclude/include, should be used with '-GF' option. Default value: null. Possible values: {GENE_INCLUDE, GENE_EXCLUDE, REGION_INCLUDE, REGION_EXCLUDE, ALL_GENE_INCLUDE, ALL_GENE_EXCLUDE}
--distance,-DT:Integer	Distance to upstream and downstream of gene (kilobase, KB). Default value: 5.
--afCutoff,-AC:Double	Cutoff of allele frequency to filter out germline variant. Default value: 0.005.
--CellID,-ID:CellType	Tissue/Cell type to filter. Please refer EID in https://github.com/mdozmorov/genomerunner_web/wiki/Roadmap-cell-types for valid Cell ID Default value: null. Possible values: {E001, E002, E003, E004, E005, E006, E007, E008, E009, E010, E011, E012, E013, E014, E015, E016, E017, E018, E019, E020, E021, E022, E023, E024, E025, E026, E027, E028, E029, E030, E031, E032, E033, E034, E035, E036, E037, E038, E039, E040, E041, E042, E043, E044, E045, E046, E047, E048, E049, E050, E051, E052, E053, E054, E055, E056, E057, E058, E059, E061, E062, E063, E065, E066, E067, E068, E069, E070, E071, E072, E073, E074, E075, E076, E077, E078, E079, E080, E081, E082, E083, E084, E085, E086, E087, E088, E089, E090, E091, E092, E093, E094, E095, E096, E097, E098, E099, E100, E101, E102, E103, E104, E105, E106, E107, E108, E109, E110, E111, E112, E113, E114, E115, E116, E117, E118, E119, E120, E121, E122, E123, E124, E125, E126, E127, E128, E129}
--CellMark,-CM:CellMark	Chromatin state to filter. '-CM' should be used with '-ID' option. This argument may be specified 0 or more times. Default value: null. Possible values: {DNase, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9me3}
--GTQuality,-GQ:Integer	Min value of Genotyping Quality. Default value: 20.
--VQuality,-VQD:Integer	Min value of Variant Confidence/Quality by Depth. Default value: 2.
--thread,-T:Integer	Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String	Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--is-loj,-loj:Boolean	A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--is-zip,-Z:Boolean	A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
--maxVariantLength,-MVL:Integer	Indicator of the max length of query interval/variant. Default value: 50.
--allowLargeVariants,-ALV:Boolean	Indicator to allow large query intervals/variants or not Default value: false. Possible values: {true, false}
Global Options: --use-jdk-inflater, -UJI:Boolean --log:Boolean

VarNote-CAN

To prioritize likely cancer driver regulatory mutation given personal cancer genome profile.

Pre-requirement:
Please download several dependent annotation databases (include .gz, .gz.vanno, .gz.vanno.vi) for VarNote-CAN

VarNoteDB_FA_regBase (hg19/hg38)
VarNoteDB_AF_gnomAD_Genome (hg19/hg38)
VarNoteDB_FP_Roadmap_127Epi (hg19/hg38)
VarNoteDB_TA_COSMIC_NonCoding (hg19/hg38)
VarNoteDB_Reference (hg19/hg38)

NOTE: Since VarNote supports remote databases (with VarNote index or Tabix index) via passing http/ftp URL, however, the speed heavily relies on the network environment.

Usage example:

## -D:db VarNoteDB_FA_regBase_prediction.gz is used to extract regBase-CAN score (required)
## -D:db VarNoteDB_FP_Roadmap_127Epi.bed.gz is used to filter variant on tissue/cell type-specific epigenomic marks (required when --CellID is specified)
## -D:db VarNoteDB_TA_COSMIC_NonCoding.vcf.gz is used to filter somatic mutation recurrence (required)
## -D:db VarNoteDB_AF_gnomAD_Genome.vcf.gz is used to filter allele frequency (required)
## -D:db pos2snp_b153.gz is used to convert genomic position to rsID (required)
## -GA and -G are used to annotate variant effect (required)

java -jar VarNote.jar CAN -Q $DIR/input/can.demo.vcf.gz \
-D:db,tag=regbase $DBPath/VarNoteDB_FA_regBase_prediction.gz \
-D:db,tag=roadmap $DBPath/VarNoteDB_FP_Roadmap_127Epi.bed.gz \
-D:db,tag=cosmic $DBPath/VarNoteDB_TA_COSMIC_NonCoding.vcf.gz \
-D:db,tag=gnomad $DBPath/VarNoteDB_AF_gnomAD_Genome.vcf.gz \
-D:db,tag=pos2snp $DBPath/pos2snp_b153.gz \
-VF $configPath/VF_exclude.txt \
-GA $DBPath/hg19_ensembl.ser \
-G hg19

Arguments: * is required option

Option	Description
--query-file,-Q:TagArgument *	Path of query file (support plain text or compressed file, including gzip and block gzip).Required. Please refer File Format for more details
--has-header,-header:Boolean	Please refer File Format for more details
--header-path,-HP:File	Please refer File Format for more details
--d-files,-D:TagArgument *	The following databases are required for VarNote-CAN program: VarNoteDB_FA_regBase_prediction.gz VarNoteDB_FP_Roadmap_127Epi.bed.gz VarNoteDB_TA_COSMIC_NonCoding.vcf.gz VarNoteDB_AF_gnomAD_Genome.vcf.gz pos2snp_b153.gz Please download databases (include .gz, .gz.vanno, .gz.vanno.vi) from here. This argument must be specified at least once. Required. Please refer File Format for more details
--geneRefFile,-GA:String *	Path of gene annotation file (a file with the extension .ser). Required Please download build up gene reference files (.ser files) including Ensembl, RefSeq, UCSC from VarNoteDB_Reference/Gene//.
--Genome,-G:GenomeAssembly *	Genome reference assembly version. Possible values: {hg19, hg38} Required
--recurRate,-RR:Integer	Min value of recurrence rate of COSMIC. Default value: 1.
--afCutoff,-AFC:Double	Cutoff of allele frequency to filter out germline variant. Default value: 0.005.
--variantEffectFile,-VF:File	File contains variant effect to exclude. Default value: null.
--genomicFile,-GF:File	File contains genomic region or gene to exclude/include. Default value: null.
--regionFileType,-GT:GenomicRegionType	Types of region/gene to exclude/include, should be used with '-GF' option. Default value: null. Possible values: {GENE_INCLUDE, GENE_EXCLUDE, REGION_INCLUDE, REGION_EXCLUDE, ALL_GENE_INCLUDE, ALL_GENE_EXCLUDE}
--distance,-DT:Integer	Distance to upstream and downstream of gene (kilobase, KB). Default value: 5.
--CellID,-ID:CellType	Tissue/Cell type to filter. Please refer EID in https://github.com/mdozmorov/genomerunner_web/wiki/Roadmap-cell-types for valid Cell ID Default value: null. Possible values: {E001, E002, E003, E004, E005, E006, E007, E008, E009, E010, E011, E012, E013, E014, E015, E016, E017, E018, E019, E020, E021, E022, E023, E024, E025, E026, E027, E028, E029, E030, E031, E032, E033, E034, E035, E036, E037, E038, E039, E040, E041, E042, E043, E044, E045, E046, E047, E048, E049, E050, E051, E052, E053, E054, E055, E056, E057, E058, E059, E061, E062, E063, E065, E066, E067, E068, E069, E070, E071, E072, E073, E074, E075, E076, E077, E078, E079, E080, E081, E082, E083, E084, E085, E086, E087, E088, E089, E090, E091, E092, E093, E094, E095, E096, E097, E098, E099, E100, E101, E102, E103, E104, E105, E106, E107, E108, E109, E110, E111, E112, E113, E114, E115, E116, E117, E118, E119, E120, E121, E122, E123, E124, E125, E126, E127, E128, E129}
--CellMark,-CM:CellMark	Chromatin state to filter. '-CM' should be used with '-ID' option. This argument may be specified 0 or more times. Default value: null. Possible values: {DNase, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9me3}
--thread,-T:Integer	Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String	Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--is-loj,-loj:Boolean	A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--is-zip,-Z:Boolean	A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
--maxVariantLength,-MVL:Integer	Indicator of the max length of query interval/variant. Default value: 50.
--allowLargeVariants,-ALV:Boolean	Indicator to allow large query intervals/variants or not Default value: false. Possible values: {true, false}
Global Options: --use-jdk-inflater, -UJI:Boolean --log:Boolean

VarNote Toolkits

IndexRefGenotype

To index the whole-genome genotypes (such as 1000 Genomes Project genotype VCF file) for efficient linkage disequilibrium (LD) calculation.

Usage example:

java -jar VarNote.jar IndexRefGenotype -I 1kg.phase3.v5.shapeit2.eur.hg19.vcf.gz

Arguments: * is required option

Option	Description
--input,-I:File *	Path of VCF file with individual phased genotypes. Required.

NOTE: VarNote provides indexed genotypes for 1000 Genomes Project phase3, please download the indexed files from VarNoteDB_Reference/LD/ (hg19/hg38)

LDBatch

To efficiently calculate all LD variants for a list of variants.

Usage example:

java -jar VarNote.jar LDBatch -B $DBPath/1kg.phase3.v5.shapeit2.eur.hg19.vcf.gz.bit -Q:coordAllele $DIR/input/reg.demo.txt

Arguments: * is required option

Option	Description
--bit-file,-B:File*	Bit file of indexed reference population genotypes. Required. Please download build up 1000G bit files (include .bit and .bit.idx) from VarNoteDB_Reference/LD/. Note: There are 5 population including EUR, EAS, AFR, SAS, AMR, please download the files for correct population.
--query-file,-Q:TagArgument *	Path of query file (support plain text or compressed file, including gzip and block gzip).Required. Please refer File Format for more details
--has-header,-header:Boolean	Please refer File Format for more details
--header-path,-HP:File	Please refer File Format for more details
--ld-window-kb,-D:Integer	LD window (kilobase, KB). Default value: 100.
--cutoff,-C:Double	LD cutoff (default: 0.8). Default value: 0.8.
--thread,-T:Integer	Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String	Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--is-loj,-loj:Boolean	A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--is-zip,-Z:Boolean	A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
Global Options: --help,-h:Boolean --log:Boolean

LDVariant

To efficiently calculate all LD variants for a variant.

Usage example:

java -jar VarNote.jar LDVariant -B $DBPath/1kg.phase3.v5.shapeit2.eur.hg19.vcf.gz.bit -Q 1:3325912-C-A

Arguments: * is required option

Option	Description
--bit-file,-B:File*	Bit file of indexed reference population genotypes. Required. Please download build up 1000G bit files (include .bit and .bit.idx) from VarNoteDB_Reference/LD/. Note: There are 5 population including EUR, EAS, AFR, SAS, AMR, please download the files for correct population.
--query-loc,-Q:String*	Genomic feature specified as the format "chr:pos-ref-alt" for variant or "chr:beginPos-endPos" for region. Required.
--is-region,-R:Boolean	Indicator whether the input genomic feature is a region or not. The program will enumerate all known variants for LD calculations if the input is a genomic region. Default value: false. Possible values: {true, false}
--ld-window-kb,-D:Integer	LD window (kilobase, KB). Default value: 100.
--cutoff,-C:Double	LD cutoff. Default value: 0.8.
Global Options: --help,-h:Boolean

LDPair

To efficiently calculate LD for a single pair of variants.

Usage example:

java -jar VarNote.jar LDPair -B $DBPath/1kg.phase3.v5.shapeit2.eur.hg19.vcf.gz.bit -P1 1:3325912-C-A -P2 1:3326796-C-G

Arguments: * is required option

Option	Description
--bit-file,-B:File*	Bit file of indexed reference population genotypes. Required. Please download build up 1000G bit files (include .bit and .bit.idx) from VarNoteDB_Reference/LD/. Note: There are 5 population including EUR, EAS, AFR, SAS, AMR, please download the files for correct population.
--P1,-P1:String*	SNP A (format chr:pos-ref-alt) Required.
--P2,-P2:String*	SNP B (format chr:pos-ref-alt) Required.
Global Options: --help,-h:Boolean

rsIDConversion

To efficiently interconvert dbSNP ID/genomic position.

Pre-requirement:
Please download several dependent annotation databases (include .gz, .gz.vanno, .gz.vanno.vi) for rsIDConversion

VarNoteDB_Reference/rs2pos (hg19/hg38)

NOTE: Since VarNote supports remote databases (with VarNote index or Tabix index) via passing http/ftp URL, however, the speed heavily relies on the network environment.

Usage example:

java -jar VarNote.jar rsIDConversion -I $DIR/input/reg.demo.txt -F CoordAllele -M POS2SNP -D:db,tag=pos2snp $DBPath/pos2snp_b153.gz
java -jar VarNote.jar rsIDConversion -I $DIR/input/reg.demo.rsid.txt -M SNP2POS -D:db,tag=merge $DBPath/merged_b153.gz -D:db,tag=snp2pos $DBPath/snp2pos_b153.gz

Arguments: * is required option

Option	Description
--input,-I:File *	Path of input file to convert. Required.
--d-files,-D:TagArgument *	SNP2POS requires merged_b153.gz and snp2pos_b153.gz databases; POS2SNP requires pos2snp_b153.gz databases; Please download databases from the VarNoteDB_Reference/rs2pos/ folder This argument must be specified at least once. Required. Please refer File Format for more details
--Mode,-M:Snp2PosMode *	Mode 0 converts rsID to genomic position, and mode 1 converts genomic position (with ref and alt) to rsID. Required. Possible values: {SNP2POS, POS2SNP} Required.
--FileType,-F:QuickFileType	File type of input file for POS2SNP Mode. Default value: null. Possible values: {VCF, VCFLike, CoordOnly, CoordAllele}
--MatchRefAlt,-FM:Boolean	Force to macth ref and alt for POS2SNP Mode. Default value: false. Possible values: {true, false}
--thread,-T:Integer	Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String	Output file path. By default, output file will be written into the same folder as the input file. Default value: null.

Annotation Extraction Rule

1. allele-specific extraction(using configuration file all_dbs.annoc)

Example1: 1:869244 C|T exact matching(position) with 2 features of 1000g as following:

#query	1	869244	rs575524849	C	T	100	PASS	.
1000g	1	869244	rs200586552	C	CAG	100	PASS	AC=321;AF=0.0640974;AN=5008;NS=2504;DP=10653;EAS_AF=0;AMR_AF=0.0187;AFR_AF=0.2322;EUR_AF=0.001;SAS_AF=0;VT=INDEL
1000g	1	869244	rs575524849	C	T	100	PASS	AC=1;AF=0.000199681;AN=5008;NS=2504;DP=10653;EAS_AF=0.001;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0;AA=c|||;VT=SNP

Annotation program only extract 1:869244 C|T by default.

1 869244 rs575524849 C T 100 PASS .;1000g_AC=1;1000g_AF=0.000199681

Annotation program extract all features(within omiting REF and ALT matching) when --force-overlap is set to true

1 869244 rs575524849 C T 100 PASS .;1000g_AC=321,1;1000g_AF=0.0640974,0.000199681

Example2: 1:1404746 C|A exact matching(position) with 1 features of 1000g as following:

#query	1	1404746	rs147265720	C	A	100	PASS	.
1000g	1	1404746	rs147265720	C	A,T	100	PASS	AC=119,11;AF=0.023762,0.00219649;AN=5008;NS=2504;DP=8984;EAS_AF=0,0;AMR_AF=0.0072,0.0159;AFR_AF=0.0862,0;EUR_AF=0,0;SAS_AF=0,0;AA=N|||;VT=SNP;MULTI_ALLELIC

Annotation program only extract information of A allele by default.

1 1404746 rs147265720 C A 100 PASS .;1000g_AC=119;1000g_AF=0.023762

2. Extraction from multiple database(using configuration file all_dbs.annoc)

Example1: 1:3646192 G|A exact matching(position) with 13 features of 1000g and cosmic as following:

#query	1	3646192	rs1885867	G	A	100	PASS	.
1000g	1	3646192	rs1885867	G	A	100	PASS	AC=2992;AF=0.597444;AN=5008;NS=2504;DP=19582;EAS_AF=0.3284;AMR_AF=0.6628;AFR_AF=0.6059;EUR_AF=0.833;SAS_AF=0.5746;AA=g|||;VT=SNP;EX_TARGET
cosmic	1	3646192	COSV60700745	G	A	.	.	GENE=TP73_ENST00000378285;STRAND=+;LEGACY_ID=COSN28766204;SNP;CDS=c.1049+180G>A;AA=p.?;CNT=3
cosmic	1	3646192	COSV60700745	G	A	.	.	GENE=TP73_ENST00000603362;STRAND=+;LEGACY_ID=COSN28766204;SNP;CDS=c.1196+180G>A;AA=p.?;CNT=3
cosmic	1	3646192	COSV60700745	G	A	.	.	GENE=TP73_ENST00000354437;STRAND=+;LEGACY_ID=COSN28766204;SNP;CDS=c.1196+180G>A;AA=p.?;CNT=3
cosmic	1	3646192	COSV60700745	G	A	.	.	GENE=TP73_ENST00000604479;STRAND=+;LEGACY_ID=COSN28766204;SNP;CDS=c.1196+180G>A;AA=p.?;CNT=3
...

Annotation program only extract information as following.

1	3646192	rs1885867	G	A	100	PASS	.;1000g_AC=2992;1000g_AF=0.597444;cosmic_GENE=TP73_ENST00000378285,TP73_ENST00000603362,TP73_ENST00000354437,TP73_ENST00000604479,TP73_ENST00000357733,TP73,TP73_ENST00000378280,TP73_ENST00000346387,TP73_ENST00000604074,TP73_ENST00000378290,TP73_ENST00000378288;cosmic_STRAND=+,+,+,+,+,+,+,+,+,+,+

Example2: 1:3318823 G|A exact matching(position) with 9 features of 1000g and cosmic as following:

#query	1	3318823	rs534786798	G	A	100	PASS	.
1000g	1	3318823	rs534786798	G	A	100	PASS	AC=2;AF=0.000399361;AN=5008;NS=2504;DP=15458;EAS_AF=0;AMR_AF=0;AFR_AF=0.0015;EUR_AF=0;SAS_AF=0;AA=G|||;VT=SNP
cosmic	1	3318823	COSV54604891	G	C	.	.	GENE=PRDM16;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.677-532G>C;AA=p.?;CNT=1
cosmic	1	3318823	COSV54604891	G	C	.	.	GENE=PRDM16_ENST00000378398;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.680-532G>C;AA=p.?;CNT=1
cosmic	1	3318823	COSV54604891	G	C	.	.	GENE=PRDM16_ENST00000378391;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.677-532G>C;AA=p.?;CNT=1
cosmic	1	3318823	COSV54604891	G	C	.	.	GENE=PRDM16_ENST00000441472;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.677-532G>C;AA=p.?;CNT=1
cosmic	1	3318823	COSV54604891	G	C	.	.	GENE=PRDM16_ENST00000442529;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.677-532G>C;AA=p.?;CNT=1
...

Annotation program only extract information as following.

Default output
1	3318823	rs534786798	G	A	100	PASS	.;1000g_AC=2;1000g_AF=0.000399361

User can set --force-overlap = true to omiting REF and ALT matching.

Setting --force-overlap to true
1	3318823	rs534786798	G	A	100	PASS	.;1000g_AC=2;1000g_AF=0.000399361;cosmic_GENE=PRDM16,PRDM16_ENST00000378398,PRDM16_ENST00000378391,PRDM16_ENST00000441472,PRDM16_ENST00000442529,PRDM16_ENST00000511072,PRDM16_ENST00000514189;cosmic_STRAND=+,+,+,+,+,+,+

Position Resolving Rule

Since chromosome positions in both query and annotation database are critical elements, VarNote will parse chromosome position of each record according to both file format and allele composition, and finally transform them to 0-based, half opened half closed coordinate system. Following shows the position resolving rules.

Note

Ref column will be used for position resolving in all programs. For example: "1 10177 . ACC A" will be parsing as (10176, 10180)

Ref and alt column will be used for allele-specific extraction in Annotaion programs.

VCF format, vcf is one-based

Data	Position Resolved	Description
1 10177 . A T ...	10176, 10177	SNV
1 10177 . A ACC ...	10176, 10177	INS
1 10177 . ACC A ...	10176, 10179	DEL, end=10176+3
1 10177 . GGCGCG TCCGCA ...	10176, 10182
1 10177 . CGCA TGCA,C ...	10176, 10180
1 10177 . GGGG G . . END=10180	10176, 10180	Read END from INFO
Structural variant with confidence interval of breakpoint
1 869465 1 N <DEL> 1293.8 . SVTYPE=DEL;END=870217;CIPOS=-10,157;CIEND=-84,10;	869454, 870227	SVTYPE with END in INFO, beg=beg+CIPOS[0], end=end+CIEND[1], that is beg=869464-10, end=870217+10
1 1157791 4345_1 N N[4:76212291[ 0.0 . SVTYPE=BND;CIPOS=-8,8;CIEND=-9,6	1157782, 1157797	SVTYPE, beg=beg+CIPOS[0], end=end+CIEND[1], that is beg=1157790-8, end=1157791+6

BED-like Format, bed is zero-based

Data	Position Resolved	Description
1 10177 10178 . A T	10177, 10178	SNV
1 10177 10178 . A ACC	10177, 10178	INS
1 10177 10179 . ACC A	10177, 10179	DEL

Tab Format, default is one-based

Data	Position Resolved	Description
1 10177 . A T	10176, 10177	SNV
1 10177 . A ACC	10176, 10177	INS
1 10177 . ACC A	10176, 10179	DEL, end=10176+3
1 10177 10178 . A ACC	10176, 10178	INS, read end from the 3 column
1 10177 10179 . ACC A	10176, 10179	DEL, read end from the 3 column

Tab Format, zero-based

Data	Position Resolved	Description
1 10177 . A T	10177, 10178	SNV
1 10177 . A ACC	10177, 10178	INS
1 10177 . ACC A	10177, 10180	DEL, end=10177+3
1 10177 10178 . A ACC	10177, 10178	INS, read end from the 3 column
1 10177 10179 . ACC A	10177, 10179	DEL, read end from the 3 column.

Please cite VarNote as follows:

Huang D, Yi X, Zhou Y, Yao H, Xu H, Wang J, Zhang S, Nong W, Wang P, Shi L, Xuan C, Li M, Wang J, Li W, Kwan HS, Sham PC, Wang K, Li MJ*. Ultrafast and scalable variant annotation and prioritization with big functional genomics data. Genome Res. 2020 Dec;30(12):1789-1801.

Quick Tutorial

System Requirements

Installation and Test Data

Available Tools

Quick Start

1. Index Database

2. Fast counting of intersection

3. Data retrieval with single region

4. Intersection

5. Annotation

6. Annotation upon intersection result

7. Run with a config file

8. VarNote Prioritization NEW

9. VarNote Toolkits NEW

File Format

File Overview

Query and Database File Format

1. VCF Format

2. VCF-Like Format

3. BED-Like Format

4. BED-Like Allele Format

5. Coord-Only Format

6. Coord-Allele Format

7. TAB Format

More Examples

VarNote Tools Documentation

Input File Options

Global Options

Tool-Specific Documentation

VarNote Index

Index

IndexInfo

VarNote Query

Count

RandomAccess

Intersect

VarNote Annotation

Annotation

AnnotationIntersectFile

Run with config file

IntersectConfig

AnnotationConfig

VarNote Prioritization

VarNote-REG

VarNote-PAT

VarNote-CAN

VarNote Toolkits

IndexRefGenotype

LDBatch

LDVariant

LDPair

rsIDConversion

Annotation Extraction Rule

1. allele-specific extraction(using configuration file all_dbs.annoc)

2. Extraction from multiple database(using configuration file all_dbs.annoc)

Position Resolving Rule

VCF format, vcf is one-based

BED-like Format, bed is zero-based

Tab Format, default is one-based

Tab Format, zero-based

Please cite VarNote as follows:

8. VarNote Prioritization ^NEW

9. VarNote Toolkits ^NEW