Quick Tutorial

System Requirements

Java Runtime Environment (JRE) version 8.0 or above is required for VarNote. To check your java version by open your terminal application and run the following command:

java -version

You are expected to see java version "1.8.x" or above. If not, you may need to update your version; see the Oracle Java website to download and install the latest JDK.

Installation and Test Data

The VarNote command-line tools are provided as an executable JAVA program. You can download the Latest Release of VarNote jar file from here and a sample data for testing from here. Please also find the VarNote source code from https://github.com/mulinlab/VarNote.

NEWTest data and demo script for VarNote prioritization/Toolkits functions can be downloaded from here

Available Tools

USAGE: java -jar /path/to/VarNote.jar <program name> [-h]

Program Summary Table:
--------------------------------------------------------------------------------------
VarNote Index:                                  Tools that generates index for the compressed database file.
    Index                                       To generate VarNote index (".vanno" and ".vanno.vi") for compressed (block gzip) annotation database file.
    IndexInfo                                   To query related information (such as header, format, meta information or sequence name) stored in the VarNote index file of each annotation database.

--------------------------------------------------------------------------------------
VarNote Query:                                  Tools that quickly retrieve data lines from database(s).
    Count                                       To quickly count intersected records in annotation database(s).
    RandomAccess                                To quickly retrieve (by independent random access) intersected records from indexed annotation database(s) given a genomic region like "chrN:beginPos-endPos".
    Intersect                                   To quickly retrieve (by random-sweep algorithm) intersected records from indexed annotation database(s) given query intervals/variants.
    IntersectConfig                              Run VarNote Intersect program with a config file.

--------------------------------------------------------------------------------------
VarNote Annotation:                             Tools that identifies desired annotation fields from database(s).
    Annotation                                  To quickly extract (by random-sweep algorithm) desired annotation fields from indexed annotation database(s) given query intervals/variants.
    AnnotationIntersectFile                     To quickly extract desired annotation fields from an existing VarNote intersection file.
    AnnotationConfig                             Run VarNote Annotation program with a config file.

--------------------------------------------------------------------------------------
NEWVarNote Prioritization:                     To facilitate researchers to execute the genome-scale regulatory variant annotation and prioritization locally as well as compare results with different parameters, several local pipelines are implemented as online version.
    VarNote-REG                                 To prioritize causal regulatory variants in the LD of each GWAS signal and provide combined ranking scores based on multiple tissue/cell type-specific prediction methods.
    VarNote-PAT                                 To prioritize candidate pathogenic regulatory variants from whole genome sequencing variants of inherited diseases.
    VarNote-CAN                                 To prioritize likely cancer driver regulatory mutation given personal cancer genome profile.

--------------------------------------------------------------------------------------
NEWVarNote Toolkits:                           Several useful tools that ease the analysis of genetic sequencing data.
    rsIDConversion                              To efficiently interconvert dbSNP ID/genomic position.
    IndexRefGenotype                            To index the whole-genome genotypes (such as 1000 Genomes Project genotype VCF file) for efficient linkage disequilibrium (LD) calculation.
    LDPair                                      To efficiently calculate LD for a single pair of variants.
    LDVariant                                   To efficiently calculate all LD variants for a variant.
    LDBatch                                     To efficiently calculate all LD variants for a list of variants.

                    

Quick Start

1. Index Database

The first step after downloading VarNote is to index the annotation database file in test data. VarNote Index program will generate the index files .vanno and .vanno.vi for the annotation database.

Note

  • The annotation database file must be a TAB-delimited genome position file compressed by bgzip program (http://www.htslib.org/doc/bgzip.html) (end with .gz or .bgz).
  • The annotation database file must be position-sorted (first by sequence name and then by leftmost coordinate).
  • The index files should be used together with the original database file.

Info

  • VarNote also provides a random-sweep searching for Tabix index(for users who don't want to reindex a large database).
  • With performance loss, using Tabix index will be ~10X times slower than VarNote index. We strongly suggest user to index it with VarNote Index program before query.

Command line usage example:

# Moving VarNote-XXX.jar into test data folder and rename it to VarNote.jar.
mv VarNote.jar /path/to/test_data
cd /path/to/test_data

# List all programs of VarNote.
java -jar VarNote.jar

# Displays options specific to Index.
java -jar VarNote.jar Index

# Sorting and using htslib bgzip program to compress annotation database file before index.
sort -k1,1 -k2,2n roadmap.bed > roadmap.sort.bed
/path/to/htslib-1.X/bgzip roadmap.sort.bed

# index VCF file format
java -jar VarNote.jar Index -I 1000G_p3.sort.vcf.gz
java -jar VarNote.jar Index -I cosmic.sort.vcf.gz

# index BED-like file format
java -jar VarNote.jar Index -I roadmap.sort.bed.gz

# index TAB file format
java -jar VarNote.jar Index -I:coordAllele dbscSNV.sort.tab.gz
java -jar VarNote.jar Index -I:tab,c=1,b=2,e=2,ref=3,alt=4 dbscSNV.sort.tab.gz
2. Fast counting of intersection

To quickly count data lines from the indexed databases that intersect with genomic features defined in the query file.

java -jar VarNote.jar Count -Q q2.sort.bed \
                            -D:db,tag=1000g 1000G_p3.sort.vcf.gz \
                            -D:db,tag=roadmap roadmap.sort.bed.gz

Result file: q2.sort.bed.count.gz

Info

  • -D option could be used multiple times to specify multiple annotation databases.
  • The command on multiple lines could also be written on a single line.

Note

  • Count is 2+ times faster than other programs.
  • Count program only support VarNote index.
3. Data retrieval with single region

To quickly retrieve data lines from the indexed databases that intersect with the specified genomic region like "chr:beginPos-endPos"

#Query a genomic locus
java -jar VarNote.jar RandomAccess -Q 1:2298288-2298289 -D 1000G_p3.sort.vcf.gz

#Query a genomic region
java -jar VarNote.jar RandomAccess -Q 1:959100-959200 \
                                  -D:db,tag=1000g 1000G_p3.sort.vcf.gz \
                                  -D:db,tag=cosmic,index=TBI cosmic.sort.vcf.gz

Output in console

4. Intersection

To quickly retrieve data lines from the indexed databases that intersect with genomic features defined in the query file.

#Multiple databases using exact mode and different Index(1000g using VarNote and cosmic using TBI).
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
                                -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                                -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz

#Multiple databases using exact mode within left join
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
                                -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                                -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
                                -loj true

#Intersect mode (using 4 threads to run, default is 1)
java -jar VarNote.jar Intersect -Q q2.sort.bed --maxVariantLength 50000 \
                                -D roadmap.sort.bed.gz \
                                -T 4

#Multiple databases using different mode(roadmap using intersect mode and 1000g using exact mode)
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
                                -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                                -D:db,tag=roadmap,mode=0 roadmap.sort.bed.gz \
                                -T 4 \
                                -O /path/to/test_data/q1.twomode.overlap.gz

#Intersection using remote database with VarNote index. Querying remote database is relatively slow, please be patient 
java -jar VarNote.jar Intersect -Q:tab,c=1,b=2,e=2,ref=3,alt=4 q3.sort.tab \
                                -D:db,tag=gnomAD,mode=1 http://202.113.53.226/VarNoteDB/VarNoteDB_AF_gnomAD_Genome.vcf.gz  

Result files: q1.sort.vcf.overlap.gz  q2.sort.bed.overlap.gz  q1.twomode.overlap.gz  q3.sort.tab.overlap.gz 

Note

  • Intersect mode: perform common interaction operation according to query and database formats.
  • Exact match mode: force the program only to consider the chromosome position of database records that exactly match the corresponding chromosome position of query.
5. Annotation

Annotation extraction for a list of intervals or variants.
!!!Please reading Annotation Extraction Rule first.

Note

  • User should define an annotation configuration file (set with option -A) to specify the fields to extract.
  • If -A is not set, the program will search configuration file named QueryFileName.annoc in query folder first.
  • If annotate configuration file is not found, the program will extract all fields in databases by default.
  • If query format is VCF, default output format will be set to VCF; If query format is BED or TAB, default output format will be set to BED; User can change output format with -OF option.
# Annotation without configuration file
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
                            -O ./q1.sort.vcf.allfields.anno.gz \
                            -T 4

# Annotation with configuration file
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
                            -A config/all_dbs.annoc

# Change output format with -OF
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -A config/all_dbs.annoc  \
                            -O ./q1.sort.bed.anno.gz \
                            -OF BED

# Annotation using remote database with VarNote index
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -D:db,tag=gnomAD,mode=1 http://202.113.53.226/VarNoteDB/VarNoteDB_AF_gnomAD_Genome.vcf.gz \
                            -A config/all_dbs.annoc  \
                            -O ./q1.sort.vcf.remote.anno.gz

Result files: q1.sort.vcf.allfields.anno.gz  q1.sort.vcf.anno.gz  q1.sort.bed.anno.gz 

6. Annotation upon intersection result

Annotation an OVERLAP file, which is the result file of Intersect program.

Note

  • Extraction information directly from an OVERLAP file could save a lot of time in query step.
  • User can change annotation configuration file to get different results, which is convenient and very fast.
java -jar VarNote.jar AnnotationIntersectFile -I q1.sort.vcf.overlap.gz -A config/all_dbs.annoc

Result files: q1.sort.vcf.anno.gz 

7. Run with a config file

Run Intersect or Annotation program with all options defined in a configuration file.

java -jar VarNote.jar IntersectConfig -I config/intersect.full.options.confg
java -jar VarNote.jar IntersectConfig -I config/intersect.required.options.confg
java -jar VarNote.jar AnnotationConfig -I config/anno.full.options.confg
                    

Result files: q1.sort.vcf.overlap.gz 

8. VarNote Prioritization NEW
VarNote now provides three local pipelines to filter, annotate and prioritize:
  • disease-causal regulatory variants for GWAS results, VarNote-REG;
  • pathogenic regulatory variants for rare inherited diseases, VarNote-PAT;
  • driver regulatory variants for cancers, VarNote-CAN.

Note: Please download test data and demo script for VarNote prioritization functions from here first, and all of these procedures should be executed at VarNote version 1.2.0 or above.

# The data has been successfully tested on java version 8/9/11.

cd ./advanced_funs_test_data  # make sure VarNote.jar is located in the advanced_funs_test_data folder

# open script to learn how to run the program for specific job.
bash script/01-index.sh   # index the whole-genome genotypes (such as 1000 Genomes Project genotype VCF file) for efficient LD calculation.
bash script/02-REG-genomic-Input.sh   # prioritize causal regulatory variants in the LD of each GWAS signal with VCF input.
bash script/02-REG-genomic-Input-comma.sh   # prioritize causal regulatory variants in the LD of each GWAS signal with variant information delimited by comma.
bash script/02-REG-rsID-Input.sh    # prioritize causal regulatory variants in the LD of each GWAS signal with dbSNP variant list.
bash script/03-PAT.sh     # prioritize candidate pathogenic regulatory variants from whole genome sequencing variants of inherited diseases.
bash script/04-CAN.sh     # prioritize likely cancer driver regulatory mutation given personal cancer genome profile.

!!!! Note All the database in the advanced_funs_test_data are for test, please download and replace it with the full version before you use it for your data.

9. VarNote Toolkits NEW

VarNote also implements several commonly-used tools for efficiently processing genetic variant information, such as format conversion, LD calculation, etc.

Note: Please download test data and demo script for VarNote prioritization functions from here first, and all of these procedures should be executed at VarNote version 1.2.0 or above.

# The data has been successfully tested on java version 8/9/11.

cd ./advanced_funs_test_data  # make sure VarNote.jar is located in the advanced_funs_test_data folder
bash script/01-index.sh   # index the whole-genome genotypes (such as 1000 Genomes Project genotype VCF file) for efficient LD calculation.
bash script/05-LD.sh      # efficient calculate LD information for a list of variants.
bash script/06-rsID-Conversion.sh # efficient interconvert dbSNP ID/genomic position.

!!!! Note All the database in the advanced_funs_test_data are for test, please download and replace it with the full version before you use it for your data.

File Format

File Overview

Most programs of VarNote requires the following file type as input:

Below, we summarizes the input files required for each program.

IndexQueryAnnotation
IndexIndexInfoCountRandomAccessIntersectAnnotationAnnotationIntersectFile
Query File
Annotation Database
Overlap File
Annotation Configuration File

File types support for each File.

File TypePlain Textgzip(.gz, support original file size smaller than 4 Gb)bgzip(.gz)
Query File
Annotation Database
Annotation Configuration File

Note

  • The annotation database file should be bgzip format and position-sorted.
  • The query file could be plain text, gzip and bgzip format, position-sorted.
  • For compressed format, bgzip is strongly recommended.
  • For gzip, up to 4Gb original file size is supported currently.

Query and Database File Format

Both query and annotation database file accepts flexible types of genomic format including VCF, VCF-Like, BED-like, BED-Allele, Coord-Only, Coord-Allele and TAB as input.

1. VCF Format

See details about VCF format.

USAGE: java -jar /path/to/VarNote.jar <program name> -I xxx.vcf.gz
       java -jar /path/to/VarNote.jar <program name> -I:vcf xxx.vcf.gz

2. VCF-Like Format

The first five columns of VCF-Like format are the same as the VCF format and other columns are optional.
Actually VCF-Like format is a TAB format with preset parameters: c=1,b=2,e=2,ref=4,alt=5,0=false

USAGE: java -jar /path/to/VarNote.jar <program name> -I:vcfLike xxx.tab.gz

3. BED-Like Format

The first three column of BED-like format must be the CHROM, START, END and other columns are optional.

USAGE: java -jar /path/to/VarNote.jar <program name> -I:bed xxx.bed.gz

4. BED-Like Allele Format

The first five column of BED-like Allele format must be the CHROM, START, END, REF, ALT and other columns are optional.

USAGE: java -jar /path/to/VarNote.jar <program name> -I:bedAllele xxx.tab.gz

5. Coord-Only Format

The first two column of Coord-Only format must be the CHROM, POS and other columns are optional.
Actually Coord-Only format is a TAB format with preset parameters: c=1,b=2,e=2,0=false

USAGE: java -jar /path/to/VarNote.jar <program name> -I:coordOnly xxx.tab.gz

6. Coord-Allele Format

The first two column of Coord-Allele format must be the CHROM, POS, REF, ALT and other columns are optional.
Actually Coord-Allele format is a TAB format with preset parameters: c=1,b=2,e=2,ref=3,alt=4,0=false

USAGE: java -jar /path/to/VarNote.jar <program name> -I:coordAllele xxx.tab.gz

7. TAB Format

Tab-separated data file that contains genomic locations.
The TAB format should be used with attributes:

  • Required: c,b,e
  • Optional: ref,alt,0,ci,sep
USAGE: java -jar /path/to/VarNote.jar <program name> -I:tab,c=1,b=4,e=5,0=true xxx.tab.gz

More Examples

# Index an annotation database of bgzip BED-like file.
java -jar VarNote.jar Index -I:bed database1.sorted.bed.gz

# Count a VCF file.
java -jar VarNote.jar Count -I:vcf q1.sorted.vcf

# Intersect
java -jar VarNote.jar Intersect -Q:coordAllele query.sort.tab -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz 

VarNote Tools Documentation

Input File Options

The following options are relevant to input file

OptionDescription
--input,-I:TagArgument (in Index program)
--query-file,-Q:TagArgument (in Count, RandomAccess, Intersect, Annotation program)
TagArgument: arguments with tag and attributes
Usage: -I:tag,attr1=XXX,attr2=XXX...  /path/to/file

Possible tags: {vcf, vcfLike, bed, bedAllele, coordOnly, coordAllele, tab}
Possible attributes for all tags: {sep, ci}
Possible attributes for tab tag: {c, b, e, ref, alt, 0}

Attributes should be used with tag, here are the description of attributes:
c - column of sequence name (1-based, required for tab)
b - column of start chromosomal position (1-based, required for tab)
e - column of end chromosomal position (1-based, required for tab)
ref - column of reference allele (optional)
alt - column of alternative allele (optional)
0 - specify the position in the data file is 0-based rather than 1-based (optional)
ci - specify a new comment indicator instead of "##" (Used with bed or tab tag, optional)
sep - specifies the character that separates fields in file, possible values are: {TAB, COMMA}

!!Options b, e, ref, alt, 0 are relevant to position resolving rule
--has-header,-header:Boolean Indicate whether input file contains a header line for defining column names.
If -header is included, the first line below the comment line will be considered as a column header line. The header line must starts with '#' or have no indicator.
--header-path,-HP:File Path of external file to include the header and meta lines.
--d-files,-D:TagArgument TagArgument: arguments with tag and attributes
Usage: -I:db,attr1=XXX,attr2=XXX...  /path/to/file

Possible tags: {db}
Possible attributes: {index, mode, tag}.

Attributes should be used with tag, here are the description of attributes:
index - Indicate which index to use, VarNote or Tabix. Default value is "VarNote". Possible values: {VarNote, TBI}, optional
mode - Mode of Intersection. default value is "0". Possible values: {0, 1, 2}, optional.
  • 0: Intersect mode, perform common interaction operation according to query and database formats;
  • 1: Exact match mode, force the program only to consider the chromosome position of database records that exactly match the corresponding chromosome position of query;
  • 2: Full close mode, force the program to report database records that overlap both endpoints of query interval regardless of original query and database formats.
tag - A label to rename the database in the output file, optional. By default, the program will use original file name as tag for the database optional.

Global Options

The following standard options are relevant to most VarNote tools:

OptionDescription
--log:Boolean Whether to print log. Default value: true. Possible values: {true, false}
--use-jdk-inflater,-UJI:Boolean Use the JDK Inflater instead of the IntelInflater for reading index. Default value: false. Possible values: {true, false}
--use-jdk-deflater,-UJD:Boolean Use the JDK Deflater instead of the IntelDeflater for writing index. Default value: false. Possible values: {true, false}

Tool-Specific Documentation

Below, you will find detailed documentation of all the options that are specific to each tool.

VarNote Index

Index

VarNote index function is a necessary step to build index system for fast retrieval. Since most of existing annotation databases are indexed by Tabix, VarNote also provides a random-sweep searching based on Tabix index. This may imply that VarNote can faithfully process existing annotation resources without re-indexing them. However, for large-scale and frequently used annotation datasets such as CADD, gnomAD and dbNSFP, we strongly suggest to use VarNote index system for gained speed.

Usage example:

# Index an annotation database of VCF format.
java -jar VarNote.jar Index -I 1000G_p3.sort.vcf.gz

# Index an annotation database of BED-like format.
java -jar VarNote.jar Index -I roadmap.sort.bed.gz

# Index an annotation database of TAB format.
java -jar VarNote.jar Index -I:coordAllele dbscSNV.sort.tab.gz
java -jar VarNote.jar Index -I:tab,c=1,b=2,e=2,ref=3,alt=4 dbscSNV.sort.tab.gz

Arguments:

OptionDescription
--input,-I:TagArgument The path of the TAB-delimited genome position file compressed by bgzip program. The file must be position-sorted, first by sequence name and then by leftmost coordinate. Possible Tags: {vcf, vcfLike, bed, bedAllele, coordOnly, coordAllele, tab} Required.
Please refer File Format for more details
--has-header,-header:Boolean Please refer File Format for more details
--header-path,-HP:File Please refer File Format for more details
--out-folder,-O:String Output directory. By default, the output files will be written into the same folder as the input file. Default value: null
--skip,-S:Integer Skip first INT lines(including comment lines) in the data file. Default value: 0.
Global Options: --use-jdk-deflater,-UJD:Boolean  --log:Boolean
IndexInfo

To query related information (such as header, format, meta information or sequence name) stored in the VarNote index file of each annotation database.

Usage example:

# Input database file must have been indexed by VarNote before query.
java -jar VarNote.jar IndexInfo -LC true -PH true -PM true -I 1000G_p3.sort.vcf.gz

java -jar VarNote.jar IndexInfo -I 1000G_p3.sort.vcf.gz -RH "CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO"

Arguments:

OptionDescription
--input,-I:File The indexed annotation database file, must be indexed by VarNote. Required.
--list-chroms,-LC:Boolean List the sequence names stored in the index file. Default value: false. Possible values: {true, false}
--print-header,-PH:Boolean Print header line(s). Default value: false. Possible values: {true, false}
--print-format,-PF:Boolean Print database format. Default value: false. Possible values: {true, false}
--print-meta-data,-PM:Boolean Print meta information lines. Default value: false. Possible values: {true, false}
--reheader,-RH:String Replace the column names with a comma-separated string containing the column names. Columns name should be separated by comma and included with double quotation.
Global Options: --log:Boolean

VarNote Query

Count

To quickly count intersected records in annotation database(s).

Usage example:

java -jar VarNote.jar Count -Q q2.sort.bed \
                            -D:db,tag=1000g 1000G_p3.sort.vcf.gz \
                            -D:db,tag=roadmap roadmap.sort.bed.gz

Arguments:

OptionDescription
--query-file,-Q:TagArgument Path of query file (support plain text or compressed file, including gzip and block gzip).Required.
Please refer File Format for more details
--has-header,-header:Boolean Please refer File Format for more details
--header-path,-HP:File Please refer File Format for more details
--d-files,-D:TagArgument Local path or http/ftp address of indexed annotation database(s). Either VarNote index or Tabix index should be in the same location of database file(s). This argument must be specified at least once. Required.
Please refer File Format for more details
--out-file,-O:String Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--thread,-T:Integer Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
Global Options: --use-jdk-inflater,-UJI:Boolean  --log:Boolean
RandomAccess

To quickly retrieve (by independent random access) intersected records from indexed annotation database(s) given a genomic region like "chrN:beginPos-endPos".

Usage example:

java -jar VarNote.jar RandomAccess -Q 1:959100-959200 \
                                  -D:db,tag=1000g 1000G_p3.sort.vcf.gz \
                                  -D:db,tag=cosmic,index=TBI cosmic.sort.vcf.gz

Arguments:

OptionDescription
--q-region,-Q:String Region specified as the format chrN:beginPos-endPos Required.
--d-files,-D:TagArgument Local path or http/ftp address of indexed annotation database(s). Either VarNote index or Tabix index should be in the same location of database file(s). This argument must be specified at least once. Required.
Please refer File Format for more details
--is-label,-L:Boolean A flag to determine whether or not to print database name as the first column of the result. Default value: true. Possible values: {true, false}
Global Options: --use-jdk-inflater,-UJI:Boolean  --log:Boolean
Intersect

To quickly retrieve (by random-sweep algorithm) intersected records from indexed annotation database(s) given query intervals/variants.

Usage example:

#Multiple databases using exact mode and different Index(1000g using VarNote and cosmic using TBI).
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
                                -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                                -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz

#Multiple databases using exact mode within left join
java -jar VarNote.jar Intersect -Q q1.sort.vcf \
                                -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                                -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
                                -loj true

#Intersect mode (using 4 threads to run, default is 1)
java -jar VarNote.jar Intersect -Q q2.sort.bed --maxVariantLength 50000 \
                                -D roadmap.sort.bed.gz \
                                -T 4

Arguments:

OptionDescription
--query-file,-Q:TagArgument Path of query file (support plain text or compressed file, including gzip and block gzip).Required.
Please refer File Format for more details
--has-header,-header:Boolean Please refer File Format for more details
--header-path,-HP:File Please refer File Format for more details
--d-files,-D:TagArgument Local path or http/ftp address of indexed annotation database(s). Either VarNote index or Tabix index should be in the same location of database file(s). This argument must be specified at least once. Required.
Please refer File Format for more details
--thread,-T:Integer Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--out-mode,-OM:Integer Output recording mode (default 2). 0 for "only output query records"; 1 for "only output matched database records"; 2 for "output both query records and matched database records" Default value: 2.
--is-loj,-loj:Boolean A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--maxVariantLength,-MVL:Integer Indicator of the max length of query interval/variant. Default value: 50.
--allowLargeVariants,-ALV:Boolean Indicator to allow large query intervals/variants or not Default value: false. Possible values: {true, false}
--is-remove-comment,-RC:Boolean A flag to determine whether or not to remove the comment lines(start with '@') in the output file. Note that the comment lines are required for the VarNote Annotation program. Default value: false. Possible values: {true, false}
--is-zip,-Z:Boolean A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
Global Options: --use-jdk-inflater,-UJI:Boolean  --log:Boolean

VarNote Annotation

Annotation

To quickly extract (by random-sweep algorithm) desired annotation fields from indexed annotation database(s) given query intervals/variants. It allows feature extraction using both interval-level overlap and variant-level exact matching. It also has an annotation mode supporting allele-specific variant annotation for SNV/Indel and region-specific annotation for large variations.

Usage example:

comment># Annotation without configuration file
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
                            -O ./q1.sort.vcf.allfields.anno.gz \
                            -T 4

# Annotation with configuration file
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -D:db,tag=cosmic,mode=1,index=TBI cosmic.sort.vcf.gz \
                            -A config/all_dbs.annoc

# Change output format with -OF
java -jar VarNote.jar Annotation -Q q1.sort.vcf \
                            -D:db,tag=1000g,mode=1 1000G_p3.sort.vcf.gz \
                            -A config/all_dbs.annoc  \
                            -O ./q1.sort.bed.anno.gz \
                            -OF BED

Arguments:

OptionDescription
--query-file,-Q:TagArgument Path of query file (support plain text or compressed file, including gzip and block gzip).Required.
Please refer File Format for more details
--has-header,-header:Boolean Please refer File Format for more details
--header-path,-HP:File Please refer File Format for more details
--d-files,-D:TagArgument Local path or http/ftp address of indexed annotation database(s). Either VarNote index or Tabix index should be in the same location of database file(s). This argument must be specified at least once. Required.
Please refer File Format for more details
--anno-config,-A:File Path of annotation extraction configuration file. The program annotates query variant with all information in database(s) if without this option. If -a is not defined, the program will automatically search file named /path/to/query/query_file + .annoc under the folder of the query file. Default value: null.
--thread,-T:Integer Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--out-format,-OF:AnnoOutFormatOutput Output format. Default value: null. Possible values: {VCF, BED}
--is-loj,-loj:BooleanA flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--is-zip,-Z:BooleanA flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
--force-overlap,-FO:Boolean Force overlap mode. Force the program to omit REF and ALT matching and allele specific feature extraction. Default value: false. Possible values: {true, false}
--vcf-header-for-bed,-VH:FileVCF output header file. This file is required when the format of query file is BED or TAB-delimited, but the format of annotation output is VCF. Default value: null.
Global Options: --use-jdk-inflater,-UJI:Boolean  --log:Boolean
AnnotationIntersectFile

To quickly extract desired annotation fields from an existing VarNote intersection file, the OVERLAP file

Usage example:

java -jar VarNote.jar AnnotationIntersectFile -I q1.sort.vcf.overlap.gz -A config/all_dbs.annoc

Arguments:

OptionDescription
--input,-I:File VarNote intersection file path. Required.
--anno-config,-A:File Path of annotation extraction configuration file. The program annotates query variant with all information in database(s) if without this option. If -a is not defined, the program will automatically search file named /path/to/query/query_file + .annoc under the folder of the query file. Default value: null.
--out-file,-O:String Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--out-format,-OF:AnnoOutFormatOutput Output format. Default value: null. Possible values: {VCF, BED}
--is-loj,-loj:BooleanA flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--is-zip,-Z:BooleanA flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
--force-overlap,-FO:Boolean Force overlap mode. Force the program to omit REF and ALT matching and allele specific feature extraction. Default value: false. Possible values: {true, false}
--vcf-header-for-bed,-VH:FileVCF output header file. This file is required when the format of query file is BED or TAB-delimited, but the format of annotation output is VCF. Default value: null.
Global Options: --use-jdk-inflater,-UJI:Boolean  --log:Boolean

Run with config file

IntersectConfig

Run VarNote Intersect program from a config file.

Usage example:

java -jar VarNote.jar IntersectConfig -I config/intersect.full.options.confg
java -jar VarNote.jar IntersectConfig -I config/intersect.required.options.confg

Arguments:

OptionDescription
--input,-I:File Config file path. Required.
AnnotationConfig

Run VarNote Annotation program from a config file.

Usage example:

java -jar VarNote.jar AnnotationConfig -I config/anno.full.options.confg

Arguments:

OptionDescription
--input,-I:File Config file path. Required.

VarNote Prioritization

VarNote-REG

To prioritize causal regulatory variants in the LD of each GWAS signal and provide combined ranking scores based on multiple tissue/cell type-specific prediction methods.

Pre-requirement:
Please download several dependent annotation databases (include .gz, .gz.vanno, .gz.vanno.vi) for VarNote-REG

NOTE: Since VarNote supports remote databases (with VarNote index or Tabix index) via passing http/ftp URL, however, the speed heavily relies on the network environment.

Usage example:

## -D:db {FitCons2, GenoSkylinePlus , FUNLDA , GenoNet } prediction scores are used for tissue/cell type-specific prioritization (required)
## -D:db pos2snp_b153.gz is used to convert genomic position to rsID (required)
## -F {FitCons2, GenoSkylinePlus, FUNLDA, GenoNet} are used to compute Combined Rank (required)
## -ID is used to specify the tissue/cell type (required)
## -LDE is used to indicate enabling LD extension, -LDE should be used with -B -LDC (optional)
## -GR and -G are used to annotate variant effect (optional)

java -jar VarNote.jar REG -Q:coordAllele $DIR/input/reg.demo.txt \
-D:db,tag=regbase $DBPath/VarNoteDB_FA_regBase_prediction.gz \
-D:db,tag=roadmap $DBPath/VarNoteDB_FP_Roadmap_127Epi.bed.gz \
-D:db,tag=fitCons2 $DBPath/fitcons2.merge.gz \
-D:db,tag=genoSkylinePlus $DBPath/GenoSkylinePlus.merge.gz \
-D:db,tag=funlda $DBPath/FUN-LDA.merge.gz \
-D:db,tag=genoNet $DBPath/GenoNet_all.gz \
-D:db,tag=pos2snp $DBPath/pos2snp_b153.gz \
-F FitCons2 \
-F GenoSkylinePlus \
-F FUNLDA \
-F GenoNet \
-F FUNLDA \
-ID E116 \
-LDE true \
-B $DBPath/1kg.phase3.v5.shapeit2.eur.hg19.chr1.vcf.gz.bit \
-LDC 0.9 \
-GR $DBPath/hg19_ensembl.ser \
-G hg19 \
-T 4

Arguments: * is required option

OptionDescription
--query-file,-Q:TagArgument *Path of query file (support plain text or compressed file, including gzip and block gzip).Required.
Please refer File Format for more details
--has-header,-header:Boolean Please refer File Format for more details
--header-path,-HP:File Please refer File Format for more details
--d-files,-D:TagArgument *

The following databases are required for VarNote-REG program:

  • VarNoteDB_FA_regBase_prediction.gz
  • VarNoteDB_FP_Roadmap_127Epi.bed.gz
  • fitcons2.merge.gz
  • GenoSkylinePlus.merge.gz
  • FUN-LDA.merge.gz
  • GenoNet_all.gz
  • pos2snp_b153.gz
Please download databases (include .gz, .gz.vanno, .gz.vanno.vi) from here.

This argument must be specified at least once. Required.
Please refer File Format for more details
--funcs,-F:RegFunc * Context-dependent regulatory variant prediction model used to compute combine rank. This argument must be specified at least once. Required. Possible values: {FitCons2, GenoSkylinePlus, FUNLDA, GenoNet}
This argument must be specified at least once. Required
--CellID,-ID:CellType * Tissue/Cell type to extract. Please refer EID in https://github.com/mdozmorov/genomerunner_web/wiki/Roadmap-cell-types for valid Cell IDs Required. Possible values: {E001, E002, E003, E004, E005, E006, E007, E008, E009, E010, E011, E012, E013, E014, E015, E016, E017, E018, E019, E020, E021, E022, E023, E024, E025, E026, E027, E028, E029, E030, E031, E032, E033, E034, E035, E036, E037, E038, E039, E040, E041, E042, E043, E044, E045, E046, E047, E048, E049, E050, E051, E052, E053, E054, E055, E056, E057, E058, E059, E061, E062, E063, E065, E066, E067, E068, E069, E070, E071, E072, E073, E074, E075, E076, E077, E078, E079, E080, E081, E082, E083, E084, E085, E086, E087, E088, E089, E090, E091, E092, E093, E094, E095, E096, E097, E098, E099, E100, E101, E102, E103, E104, E105, E106, E107, E108, E109, E110, E111, E112, E113, E114, E115, E116, E117, E118, E119, E120, E121, E122, E123, E124, E125, E126, E127, E128, E129}
Required
--ldExtension,-LDE:Boolean Set true to enable LD extension. Default value: true. Possible values: {true, false}
--bitFile,-B:String Compressed genotypes file, like 1000G (.bit and .bit.idx file, generate by IndexRefGenotype program), required for --ldExtension option. Default value: null.

Please download build up bit files (include .bit and .bit.idx) from VarNoteDB_Reference/LD/.
Note: There are 5 population including EUR, EAS, AFR, SAS, AMR, please download the files for correct population.
--ldCutoff,-LDC:DoubleLD Cutoff (a float value between [0.5, 1], default is 0.8). Default value: 0.8.
--ldWindow,-LDW:Integer LD Window (an integer value between [1, 1000] kilobase (KB), default is 100KB). Default value: 100.
--geneRefFile,-GR:String Path of gene annotation file (a file with the extension .ser). Default value: null.

Please download build up gene reference file (.ser files) including Ensembl, RefSeq, UCSC from VarNoteDB_Reference/Gene//.
--Genome,-G:GenomeAssembly Genome reference assembly version, required for --geneRefFile option. Default value: null. Possible values: {hg19, hg38}
--thread,-T:Integer Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--is-loj,-loj:Boolean A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--is-zip,-Z:Boolean A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
--maxVariantLength,-MVL:Integer Indicator of the max length of query interval/variant. Default value: 50.
--allowLargeVariants,-ALV:Boolean Indicator to allow large query intervals/variants or not Default value: false. Possible values: {true, false}
Global Options: --use-jdk-inflater, -UJI:Boolean  --log:Boolean
VarNote-PAT

To prioritize candidate pathogenic regulatory variants from whole genome sequencing variants of inherited diseases.

Pre-requirement:
Please download several dependent annotation databases (include .gz, .gz.vanno, .gz.vanno.vi) for VarNote-PAT

NOTE: Since VarNote supports remote databases (with VarNote index or Tabix index) via passing http/ftp URL, however, the speed heavily relies on the network environment.

Usage example:

## -Q indicates the path of WGS/WES data file (VCF or GZIP compressed VCF format) (required)
## -P indicates the path of pedigree file (required)
## -IM is used to indicate the mode of inheritance  (required)
## -F is used to specify the selected scores to compute Combined Rank (required)
## -GR and -G are used to annotate variant effect (required)
## -D:db VarNoteDB_FA_regBase.gz is used to extract function scores (required)
## -D:db VarNoteDB_FP_Roadmap_127Epi.bed.gz is used to filter variant on tissue/cell type-specific epigenomic marks (required when --CellID is specified)
## -D:db VarNoteDB_AF_gnomAD_Genome.vcf.gz is used to filter allele frequency (required)
## -D:db pos2snp_b153.gz is used to convert genomic position to rsID (required)

java -jar VarNote.jar PAT -Q $DIR/input/pat.demo.filter.vcf.gz \
-P $DIR/input/pat.demo.ped \
-D:db,tag=regbase $DBPath/VarNoteDB_FA_regBase.gz \
-D:db,tag=roadmap $DBPath/VarNoteDB_FP_Roadmap_127Epi.bed.gz \
-D:db,tag=dbNSFP $DBPath/VarNoteDB_FA_dbNSFP.gz \
-D:db,tag=gnomad $DBPath/VarNoteDB_AF_gnomAD_Genome.vcf.gz \
-D:db,tag=pos2snp $DBPath/pos2snp_b153.gz \
-IM AUTOSOMAL_DOMINANT \
-GF $configPath/region_exclude.txt \
-GT REGION_EXCLUDE \
-VF $configPath/VF_exclude.txt \
-F Eigen \
-F FATHMM_MKL \
-F FATHMM_XF \
-F GenoCanyon \
-GR $DBPath/hg19_ensembl.ser \
-G hg19
-T 4

Arguments: * is required option

OptionDescription
--query-file,-Q: *The path of WGS/WES VCF file (VCF or Gzip compressed VCF format)
Required
--pedigree,-P:File *The path of pedigree file, refer to PLINK or GATK PED file format.
Required
--d-files,-D:TagArgument *

The following databases are required for VarNote-PAT program:

  • VarNoteDB_FA_regBase.gz
  • VarNoteDB_FP_Roadmap_127Epi.bed.gz
  • VarNoteDB_FA_dbNSFP.gz
  • VarNoteDB_AF_gnomAD_Genome.vcf.gz
  • pos2snp_b153.gz
Please download databases (include .gz, .gz.vanno, .gz.vanno.vi) from here.

This argument must be specified at least once. Required.
Please refer File Format for more details
--inheritanceMode,-IM:ModeOfInheritance * Mode Of Inheritance Required. Possible values: {AUTOSOMAL_DOMINANT, AUTOSOMAL_RECESSIVE, X_RECESSIVE, X_DOMINANT, ANY}
Required
--funcs,-F:RegFunc *Prediction model used to compute combined rank. This argument must be specified at least once. Required. Possible values: {CADD, Eigen, FATHMM_MKL, FATHMM_XF, GenoCanyon, LINSIGHT, ReMM}
This argument must be specified at least once. Required
--geneRefFile,-GA:String *Path of gene annotation file (a file with the extension .ser).
Required

Please download build up gene reference files (.ser files) including Ensembl, RefSeq, UCSC from VarNoteDB_Reference/Gene//.
--Genome,-G:GenomeAssembly *Genome reference assembly version. Possible values: {hg19, hg38}
Required
--variantEffectFile,-VF:File File contains variant effect to exclude. Default value: null.
--genomicFile,-GF:File File contains genomic region or gene to exclude/include. Default value: null.
--regionFileType,-GT:GenomicRegionType Types of region/gene to exclude/include, should be used with '-GF' option. Default value: null. Possible values: {GENE_INCLUDE, GENE_EXCLUDE, REGION_INCLUDE, REGION_EXCLUDE, ALL_GENE_INCLUDE, ALL_GENE_EXCLUDE}
--distance,-DT:Integer Distance to upstream and downstream of gene (kilobase, KB). Default value: 5.
--afCutoff,-AC:Double Cutoff of allele frequency to filter out germline variant. Default value: 0.005.
--CellID,-ID:CellType Tissue/Cell type to filter. Please refer EID in https://github.com/mdozmorov/genomerunner_web/wiki/Roadmap-cell-types for valid Cell ID Default value: null. Possible values: {E001, E002, E003, E004, E005, E006, E007, E008, E009, E010, E011, E012, E013, E014, E015, E016, E017, E018, E019, E020, E021, E022, E023, E024, E025, E026, E027, E028, E029, E030, E031, E032, E033, E034, E035, E036, E037, E038, E039, E040, E041, E042, E043, E044, E045, E046, E047, E048, E049, E050, E051, E052, E053, E054, E055, E056, E057, E058, E059, E061, E062, E063, E065, E066, E067, E068, E069, E070, E071, E072, E073, E074, E075, E076, E077, E078, E079, E080, E081, E082, E083, E084, E085, E086, E087, E088, E089, E090, E091, E092, E093, E094, E095, E096, E097, E098, E099, E100, E101, E102, E103, E104, E105, E106, E107, E108, E109, E110, E111, E112, E113, E114, E115, E116, E117, E118, E119, E120, E121, E122, E123, E124, E125, E126, E127, E128, E129}
--CellMark,-CM:CellMark Chromatin state to filter. '-CM' should be used with '-ID' option. This argument may be specified 0 or more times. Default value: null. Possible values: {DNase, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9me3}
--GTQuality,-GQ:Integer Min value of Genotyping Quality. Default value: 20.
--VQuality,-VQD:Integer Min value of Variant Confidence/Quality by Depth. Default value: 2.
--thread,-T:Integer Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--is-loj,-loj:Boolean A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--is-zip,-Z:Boolean A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
--maxVariantLength,-MVL:Integer Indicator of the max length of query interval/variant. Default value: 50.
--allowLargeVariants,-ALV:Boolean Indicator to allow large query intervals/variants or not Default value: false. Possible values: {true, false}
Global Options: --use-jdk-inflater, -UJI:Boolean  --log:Boolean
VarNote-CAN

To prioritize likely cancer driver regulatory mutation given personal cancer genome profile.

Pre-requirement:
Please download several dependent annotation databases (include .gz, .gz.vanno, .gz.vanno.vi) for VarNote-CAN

NOTE: Since VarNote supports remote databases (with VarNote index or Tabix index) via passing http/ftp URL, however, the speed heavily relies on the network environment.

Usage example:

## -D:db VarNoteDB_FA_regBase_prediction.gz is used to extract regBase-CAN score (required)
## -D:db VarNoteDB_FP_Roadmap_127Epi.bed.gz is used to filter variant on tissue/cell type-specific epigenomic marks (required when --CellID is specified)
## -D:db VarNoteDB_TA_COSMIC_NonCoding.vcf.gz is used to filter somatic mutation recurrence (required)
## -D:db VarNoteDB_AF_gnomAD_Genome.vcf.gz is used to filter allele frequency (required)
## -D:db pos2snp_b153.gz is used to convert genomic position to rsID (required)
## -GA and -G are used to annotate variant effect (required)

java -jar VarNote.jar CAN -Q $DIR/input/can.demo.vcf.gz \
-D:db,tag=regbase $DBPath/VarNoteDB_FA_regBase_prediction.gz \
-D:db,tag=roadmap $DBPath/VarNoteDB_FP_Roadmap_127Epi.bed.gz \
-D:db,tag=cosmic $DBPath/VarNoteDB_TA_COSMIC_NonCoding.vcf.gz \
-D:db,tag=gnomad $DBPath/VarNoteDB_AF_gnomAD_Genome.vcf.gz \
-D:db,tag=pos2snp $DBPath/pos2snp_b153.gz \
-VF $configPath/VF_exclude.txt \
-GA $DBPath/hg19_ensembl.ser \
-G hg19

Arguments: * is required option

OptionDescription
--query-file,-Q:TagArgument *Path of query file (support plain text or compressed file, including gzip and block gzip).Required.
Please refer File Format for more details
--has-header,-header:Boolean Please refer File Format for more details
--header-path,-HP:File Please refer File Format for more details
--d-files,-D:TagArgument *

The following databases are required for VarNote-CAN program:

  • VarNoteDB_FA_regBase_prediction.gz
  • VarNoteDB_FP_Roadmap_127Epi.bed.gz
  • VarNoteDB_TA_COSMIC_NonCoding.vcf.gz
  • VarNoteDB_AF_gnomAD_Genome.vcf.gz
  • pos2snp_b153.gz
Please download databases (include .gz, .gz.vanno, .gz.vanno.vi) from here.

This argument must be specified at least once. Required.
Please refer File Format for more details
--geneRefFile,-GA:String *Path of gene annotation file (a file with the extension .ser).
Required

Please download build up gene reference files (.ser files) including Ensembl, RefSeq, UCSC from VarNoteDB_Reference/Gene//.
--Genome,-G:GenomeAssembly *Genome reference assembly version. Possible values: {hg19, hg38}
Required
--recurRate,-RR:Integer Min value of recurrence rate of COSMIC. Default value: 1.
--afCutoff,-AFC:Double Cutoff of allele frequency to filter out germline variant. Default value: 0.005.
--variantEffectFile,-VF:File File contains variant effect to exclude. Default value: null.
--genomicFile,-GF:File File contains genomic region or gene to exclude/include. Default value: null.
--regionFileType,-GT:GenomicRegionType Types of region/gene to exclude/include, should be used with '-GF' option. Default value: null. Possible values: {GENE_INCLUDE, GENE_EXCLUDE, REGION_INCLUDE, REGION_EXCLUDE, ALL_GENE_INCLUDE, ALL_GENE_EXCLUDE}
--distance,-DT:Integer Distance to upstream and downstream of gene (kilobase, KB). Default value: 5.
--CellID,-ID:CellType Tissue/Cell type to filter. Please refer EID in https://github.com/mdozmorov/genomerunner_web/wiki/Roadmap-cell-types for valid Cell ID Default value: null. Possible values: {E001, E002, E003, E004, E005, E006, E007, E008, E009, E010, E011, E012, E013, E014, E015, E016, E017, E018, E019, E020, E021, E022, E023, E024, E025, E026, E027, E028, E029, E030, E031, E032, E033, E034, E035, E036, E037, E038, E039, E040, E041, E042, E043, E044, E045, E046, E047, E048, E049, E050, E051, E052, E053, E054, E055, E056, E057, E058, E059, E061, E062, E063, E065, E066, E067, E068, E069, E070, E071, E072, E073, E074, E075, E076, E077, E078, E079, E080, E081, E082, E083, E084, E085, E086, E087, E088, E089, E090, E091, E092, E093, E094, E095, E096, E097, E098, E099, E100, E101, E102, E103, E104, E105, E106, E107, E108, E109, E110, E111, E112, E113, E114, E115, E116, E117, E118, E119, E120, E121, E122, E123, E124, E125, E126, E127, E128, E129}
--CellMark,-CM:CellMark Chromatin state to filter. '-CM' should be used with '-ID' option. This argument may be specified 0 or more times. Default value: null. Possible values: {DNase, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9me3}
--thread,-T:Integer Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--is-loj,-loj:Boolean A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--is-zip,-Z:Boolean A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
--maxVariantLength,-MVL:Integer Indicator of the max length of query interval/variant. Default value: 50.
--allowLargeVariants,-ALV:Boolean Indicator to allow large query intervals/variants or not Default value: false. Possible values: {true, false}
Global Options: --use-jdk-inflater, -UJI:Boolean  --log:Boolean

VarNote Toolkits

IndexRefGenotype

To index the whole-genome genotypes (such as 1000 Genomes Project genotype VCF file) for efficient linkage disequilibrium (LD) calculation.

Usage example:

java -jar VarNote.jar IndexRefGenotype -I 1kg.phase3.v5.shapeit2.eur.hg19.vcf.gz

Arguments: * is required option

OptionDescription
--input,-I:File * Path of VCF file with individual phased genotypes. Required.

NOTE: VarNote provides indexed genotypes for 1000 Genomes Project phase3, please download the indexed files from VarNoteDB_Reference/LD/ (hg19/hg38)

LDBatch

To efficiently calculate all LD variants for a list of variants.

Usage example:

java -jar VarNote.jar LDBatch -B $DBPath/1kg.phase3.v5.shapeit2.eur.hg19.vcf.gz.bit -Q:coordAllele $DIR/input/reg.demo.txt

Arguments: * is required option

OptionDescription
--bit-file,-B:File* Bit file of indexed reference population genotypes. Required.
Please download build up 1000G bit files (include .bit and .bit.idx) from VarNoteDB_Reference/LD/.
Note: There are 5 population including EUR, EAS, AFR, SAS, AMR, please download the files for correct population.
--query-file,-Q:TagArgument *Path of query file (support plain text or compressed file, including gzip and block gzip).Required.
Please refer File Format for more details
--has-header,-header:Boolean Please refer File Format for more details
--header-path,-HP:File Please refer File Format for more details
--ld-window-kb,-D:Integer LD window (kilobase, KB). Default value: 100.
--cutoff,-C:Double LD cutoff (default: 0.8). Default value: 0.8.
--thread,-T:Integer Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String Output file path. By default, output file will be written into the same folder as the input file. Default value: null.
--is-loj,-loj:Boolean A flag to determine whether or not to use the left outer join mode. The 'left outer join' mode reports each of query record regardless of whether containing intersected records. Default value: false. Possible values: {true, false}
--is-zip,-Z:Boolean A flag to determine whether or not to compress the output results. Default value: true. Possible values: {true, false}
Global Options: --help,-h:Boolean  --log:Boolean
LDVariant

To efficiently calculate all LD variants for a variant.

Usage example:

java -jar VarNote.jar LDVariant -B $DBPath/1kg.phase3.v5.shapeit2.eur.hg19.vcf.gz.bit -Q 1:3325912-C-A 

Arguments: * is required option

OptionDescription
--bit-file,-B:File* Bit file of indexed reference population genotypes. Required.
Please download build up 1000G bit files (include .bit and .bit.idx) from VarNoteDB_Reference/LD/.
Note: There are 5 population including EUR, EAS, AFR, SAS, AMR, please download the files for correct population.
--query-loc,-Q:String*Genomic feature specified as the format "chr:pos-ref-alt" for variant or "chr:beginPos-endPos" for region. Required.
--is-region,-R:BooleanIndicator whether the input genomic feature is a region or not. The program will enumerate all known variants for LD calculations if the input is a genomic region. Default value: false. Possible values: {true, false}
--ld-window-kb,-D:Integer LD window (kilobase, KB). Default value: 100.
--cutoff,-C:Double LD cutoff. Default value: 0.8.
Global Options: --help,-h:Boolean
LDPair

To efficiently calculate LD for a single pair of variants.

Usage example:

java -jar VarNote.jar LDPair -B $DBPath/1kg.phase3.v5.shapeit2.eur.hg19.vcf.gz.bit -P1 1:3325912-C-A -P2 1:3326796-C-G 

Arguments: * is required option

OptionDescription
--bit-file,-B:File* Bit file of indexed reference population genotypes. Required.
Please download build up 1000G bit files (include .bit and .bit.idx) from VarNoteDB_Reference/LD/.
Note: There are 5 population including EUR, EAS, AFR, SAS, AMR, please download the files for correct population.
--P1,-P1:String*SNP A (format chr:pos-ref-alt) Required.
--P2,-P2:String*SNP B (format chr:pos-ref-alt) Required.
Global Options: --help,-h:Boolean
rsIDConversion

To efficiently interconvert dbSNP ID/genomic position.

Pre-requirement:
Please download several dependent annotation databases (include .gz, .gz.vanno, .gz.vanno.vi) for rsIDConversion

  • VarNoteDB_Reference/rs2pos (hg19/hg38)

NOTE: Since VarNote supports remote databases (with VarNote index or Tabix index) via passing http/ftp URL, however, the speed heavily relies on the network environment.

Usage example:

java -jar VarNote.jar rsIDConversion -I $DIR/input/reg.demo.txt -F CoordAllele -M POS2SNP -D:db,tag=pos2snp $DBPath/pos2snp_b153.gz
java -jar VarNote.jar rsIDConversion -I $DIR/input/reg.demo.rsid.txt -M SNP2POS -D:db,tag=merge $DBPath/merged_b153.gz -D:db,tag=snp2pos $DBPath/snp2pos_b153.gz

Arguments: * is required option

OptionDescription
--input,-I:File * Path of input file to convert. Required.
--d-files,-D:TagArgument * SNP2POS requires merged_b153.gz and snp2pos_b153.gz databases;
POS2SNP requires pos2snp_b153.gz databases;
Please download databases from the VarNoteDB_Reference/rs2pos/ folder

This argument must be specified at least once. Required.
Please refer File Format for more details
--Mode,-M:Snp2PosMode *Mode 0 converts rsID to genomic position, and mode 1 converts genomic position (with ref and alt) to rsID. Required. Possible values: {SNP2POS, POS2SNP} Required.
--FileType,-F:QuickFileTypeFile type of input file for POS2SNP Mode. Default value: null. Possible values: {VCF, VCFLike, CoordOnly, CoordAllele}
--MatchRefAlt,-FM:Boolean Force to macth ref and alt for POS2SNP Mode. Default value: false. Possible values: {true, false}
--thread,-T:Integer Number of used threads. Sets thread to -1 to get thread number by available processors automatically. Default value: 1.
--out-file,-O:String Output file path. By default, output file will be written into the same folder as the input file. Default value: null.

Annotation Extraction Rule

1. allele-specific extraction(using configuration file all_dbs.annoc)

Example1: 1:869244 C|T exact matching(position) with 2 features of 1000g as following:

#query	1	869244	rs575524849	C	T	100	PASS	.
1000g	1	869244	rs200586552	C	CAG	100	PASS	AC=321;AF=0.0640974;AN=5008;NS=2504;DP=10653;EAS_AF=0;AMR_AF=0.0187;AFR_AF=0.2322;EUR_AF=0.001;SAS_AF=0;VT=INDEL
1000g	1	869244	rs575524849	C	T	100	PASS	AC=1;AF=0.000199681;AN=5008;NS=2504;DP=10653;EAS_AF=0.001;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0;AA=c|||;VT=SNP

Annotation program only extract 1:869244 C|T by default.

1	869244	rs575524849	C	T	100	PASS	.;1000g_AC=1;1000g_AF=0.000199681

Annotation program extract all features(within omiting REF and ALT matching) when --force-overlap is set to true

1	869244	rs575524849	C	T	100	PASS	.;1000g_AC=321,1;1000g_AF=0.0640974,0.000199681

Example2: 1:1404746 C|A exact matching(position) with 1 features of 1000g as following:

#query	1	1404746	rs147265720	C	A	100	PASS	.
1000g	1	1404746	rs147265720	C	A,T	100	PASS	AC=119,11;AF=0.023762,0.00219649;AN=5008;NS=2504;DP=8984;EAS_AF=0,0;AMR_AF=0.0072,0.0159;AFR_AF=0.0862,0;EUR_AF=0,0;SAS_AF=0,0;AA=N|||;VT=SNP;MULTI_ALLELIC

Annotation program only extract information of A allele by default.

1	1404746	rs147265720	C	A	100	PASS	.;1000g_AC=119;1000g_AF=0.023762
2. Extraction from multiple database(using configuration file all_dbs.annoc)

Example1: 1:3646192 G|A exact matching(position) with 13 features of 1000g and cosmic as following:

#query	1	3646192	rs1885867	G	A	100	PASS	.
1000g	1	3646192	rs1885867	G	A	100	PASS	AC=2992;AF=0.597444;AN=5008;NS=2504;DP=19582;EAS_AF=0.3284;AMR_AF=0.6628;AFR_AF=0.6059;EUR_AF=0.833;SAS_AF=0.5746;AA=g|||;VT=SNP;EX_TARGET
cosmic	1	3646192	COSV60700745	G	A	.	.	GENE=TP73_ENST00000378285;STRAND=+;LEGACY_ID=COSN28766204;SNP;CDS=c.1049+180G>A;AA=p.?;CNT=3
cosmic	1	3646192	COSV60700745	G	A	.	.	GENE=TP73_ENST00000603362;STRAND=+;LEGACY_ID=COSN28766204;SNP;CDS=c.1196+180G>A;AA=p.?;CNT=3
cosmic	1	3646192	COSV60700745	G	A	.	.	GENE=TP73_ENST00000354437;STRAND=+;LEGACY_ID=COSN28766204;SNP;CDS=c.1196+180G>A;AA=p.?;CNT=3
cosmic	1	3646192	COSV60700745	G	A	.	.	GENE=TP73_ENST00000604479;STRAND=+;LEGACY_ID=COSN28766204;SNP;CDS=c.1196+180G>A;AA=p.?;CNT=3
...

Annotation program only extract information as following.

1	3646192	rs1885867	G	A	100	PASS	.;1000g_AC=2992;1000g_AF=0.597444;cosmic_GENE=TP73_ENST00000378285,TP73_ENST00000603362,TP73_ENST00000354437,TP73_ENST00000604479,TP73_ENST00000357733,TP73,TP73_ENST00000378280,TP73_ENST00000346387,TP73_ENST00000604074,TP73_ENST00000378290,TP73_ENST00000378288;cosmic_STRAND=+,+,+,+,+,+,+,+,+,+,+

Example2: 1:3318823 G|A exact matching(position) with 9 features of 1000g and cosmic as following:

#query	1	3318823	rs534786798	G	A	100	PASS	.
1000g	1	3318823	rs534786798	G	A	100	PASS	AC=2;AF=0.000399361;AN=5008;NS=2504;DP=15458;EAS_AF=0;AMR_AF=0;AFR_AF=0.0015;EUR_AF=0;SAS_AF=0;AA=G|||;VT=SNP
cosmic	1	3318823	COSV54604891	G	C	.	.	GENE=PRDM16;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.677-532G>C;AA=p.?;CNT=1
cosmic	1	3318823	COSV54604891	G	C	.	.	GENE=PRDM16_ENST00000378398;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.680-532G>C;AA=p.?;CNT=1
cosmic	1	3318823	COSV54604891	G	C	.	.	GENE=PRDM16_ENST00000378391;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.677-532G>C;AA=p.?;CNT=1
cosmic	1	3318823	COSV54604891	G	C	.	.	GENE=PRDM16_ENST00000441472;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.677-532G>C;AA=p.?;CNT=1
cosmic	1	3318823	COSV54604891	G	C	.	.	GENE=PRDM16_ENST00000442529;STRAND=+;LEGACY_ID=COSN22481089;CDS=c.677-532G>C;AA=p.?;CNT=1
...

Annotation program only extract information as following.

Default output
1	3318823	rs534786798	G	A	100	PASS	.;1000g_AC=2;1000g_AF=0.000399361

User can set --force-overlap = true to omiting REF and ALT matching.

Setting --force-overlap to true
1	3318823	rs534786798	G	A	100	PASS	.;1000g_AC=2;1000g_AF=0.000399361;cosmic_GENE=PRDM16,PRDM16_ENST00000378398,PRDM16_ENST00000378391,PRDM16_ENST00000441472,PRDM16_ENST00000442529,PRDM16_ENST00000511072,PRDM16_ENST00000514189;cosmic_STRAND=+,+,+,+,+,+,+

Position Resolving Rule

Since chromosome positions in both query and annotation database are critical elements, VarNote will parse chromosome position of each record according to both file format and allele composition, and finally transform them to 0-based, half opened half closed coordinate system. Following shows the position resolving rules.

Note

Ref column will be used for position resolving in all programs. For example: "1   10177   .   ACC   A" will be parsing as (10176, 10180)

Ref and alt column will be used for allele-specific extraction in Annotaion programs.

VCF format, vcf is one-based

DataPosition ResolvedDescription
1   10177   .   A   T  ...10176, 10177SNV
1   10177   .   A   ACC  ...10176, 10177INS
1   10177   .   ACCA  ...10176, 10179DEL, end=10176+3
1   10177   .   GGCGCGTCCGCA  ...10176, 10182
1   10177   .   CGCATGCA,C  ...10176, 10180
1   10177   .   GGGG   G   .   .   END=1018010176, 10180Read END from INFO
Structural variant with confidence interval of breakpoint
1   869465   1   N   <DEL>   1293.8   .   SVTYPE=DEL;END=870217;CIPOS=-10,157;CIEND=-84,10;869454, 870227SVTYPE with END in INFO, beg=beg+CIPOS[0], end=end+CIEND[1], that is beg=869464-10, end=870217+10
1   1157791   4345_1   N   N[4:76212291[   0.0   .   SVTYPE=BND;CIPOS=-8,8;CIEND=-9,61157782, 1157797SVTYPE, beg=beg+CIPOS[0], end=end+CIEND[1], that is beg=1157790-8, end=1157791+6

BED-like Format, bed is zero-based

DataPosition ResolvedDescription
1   10177   10178   .   A   T10177, 10178SNV
1   10177   10178   .   A   ACC10177, 10178INS
1   10177   10179   .   ACC   A10177, 10179DEL

Tab Format, default is one-based

DataPosition ResolvedDescription
1   10177   .   A   T  10176, 10177SNV
1   10177   .   A   ACC  10176, 10177INS
1   10177   .   ACC   A10176, 10179DEL, end=10176+3
1   10177   10178   .   A   ACC10176, 10178INS, read end from the 3 column
1   10177   10179   .   ACC   A10176, 10179DEL, read end from the 3 column

Tab Format, zero-based

DataPosition ResolvedDescription
1   10177   .   A   T  10177, 10178SNV
1   10177   .   A   ACC  10177, 10178INS
1   10177   .   ACC   A10177, 10180DEL, end=10177+3
1   10177   10178   .   A   ACC10177, 10178INS, read end from the 3 column
1   10177   10179   .   ACC   A10177, 10179DEL, read end from the 3 column.

Please cite VarNote as follows:

Huang D, Yi X, Zhou Y, Yao H, Xu H, Wang J, Zhang S, Nong W, Wang P, Shi L, Xuan C, Li M, Wang J, Li W, Kwan HS, Sham PC, Wang K, Li MJ*. Ultrafast and scalable variant annotation and prioritization with big functional genomics data. Genome Res. 2020 Dec;30(12):1789-1801.