Breakmer

A method to identify structural variation from sequencing data in target regions

View the Project on GitHub a-bioinformatician/BreaKmer

BreaKmer

A method to identify genomic structural variation in target regions/genes from reference-aligned high-throughput sequence data. It uses a “kmer” strategy to assemble misaligned sequence reads for predicting insertions, deletions, inversions, tandem duplications, and translocations at base-pair resolution.

Installation

The following are required for installation:

Download the python scripts and run the command:

python setup.py install

Use appropriate commands for installing locally:

python setup.py install --user

Using the setup.py script for installation should setup the required python module dependencies for appropriate usage. If the install script is not used, the two modules need to be downloaded and installed if not already:

BreaKmer has currently been installed and tested on:

Usage

List the available command line parameters.

python <PATH_TO_BREAKMER_DIR>/breakmer.py -h

Analyze all the target genes specified in the targets bed file.

python <path to breakmer scripts>/breakmer.py run <options> -c <path to breakmer configuration file>

Prepare the reference data files before starting the analysis.

python <path to breakmer scripts>/breakmer.py prepare_reference_data <options> -c <path to breakmer configuration file>

Start the blat server on a specific host and port.

python <path to breakmer scripts>/breakmer.py start_blat_server <options> -c <path to breakmer configuration file>

Analyze a subset of genes specified in a file.

python <path to breakmer scripts>/breakmer.py -g <file containing list of target genes to analyze> <path to breakmer configuration file>

Requirements

Programs

When these programs are installed, the paths to the binaries can be either be specified in the BreaKmer configuration file or put in the path (e.g. export PATH=$PATH:/path/to/binary).

Sequence data

Reference data

Other files

Configuration file

# Required parameters
analysis_name=<sample_id, string value that all the output files will contain>
targets_bed_file=<path to bed file containing locations of target regions>
sample_bam_file=<path to sorted and indexed sample bam file>
analysis_dir=<path to analysis directory, location where analysis and final output files will be located>
reference_data_dir=<path to where reference files will/are stored for each of the targeted genes analyzed> 
cutadapt=<path to cutadapt binary v1.5, i.e. /usr/bin/cutadapt-1.5/bin/cutadapt> 
cutadapt_config_file=<path to cutadapt configuration file> 
jellyfish=<path to Jellyfish binary, i.e. /usr/bin/jellyfish>
blat=<path to blat binary, i.e. /usr/bin/blat>
gfserver=<path to gfServer binary, i.e. /usr/bin/gfServer>
gfclient=<path to gfClient binary, i.e. /usr/bin/gfClient>
fatotwobit=<path to faToTwoBit binary, i.e. /usr/bin/faToTwoBit>
reference_fasta=<path to whole genome reference fasta file, one file with all records>
gene_annotation_file=<path to gene annotation file>
kmer_size=15

# Optional parameters
other_regions_file=<path to bed file containing coordinates for targeted unannotated cluster regions if they exist, such as IGH, IGK> 
normal_bam_file=<path to normal bam file, can be used to filter germline events with matched-normal sample>

Input file formats

15      45003745        45003811        B2M     exon
15      45007621        45007922        B2M     exon
15      45008527        45008540        B2M     exon
14   22090057        23021075        TRA
7    141998851       142510972       TRB
column-number content values/format
1 chromosome name chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M}
2 annotation source {ENSEMBL,HAVANA}
3 feature-type  {gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine}
4 genomic start location  integer-value (1-based)
5 genomic end location  integer-value
6 score (not used)  .
7 genomic strand  {+,-}
8 genomic phase (for CDS features)  {0,1,2,.}
9 additional information as key-value pairs see below

chr21   HAVANA  transcript      10862622        10863067        .       +       .       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  exon    10862622        10862667        .       +       .       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  CDS     10862622        10862667        .       +       0       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  start_codon     10862622        10862624        .       +       0       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  exon    10862751        10863067        .       +       .       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  CDS     10862751        10863064        .       +       2       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  stop_codon      10863065        10863067        .       +       0       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  UTR     10863065        10863067        .       +       .       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
...

BreaKmer parameters

Program Mode Parameter Description Default
run Perform structural variant detection analysis NA
-h, --help Show the help message and exit NA
--log_level The level of logging detail to record while program is running. DEBUG
--indel_size The minimum indel size (in base pairs) to keep and output. 15
--trl_sr_thresh Split read support threshold for translocation events (i.e., the minimum number of split reads required to keep a translocation event). 2
--indel_sr_thresh Split read support threshold for indels 5
--rearr_sr_thresh Split read support threshold for general rearrangements, inversions, tandem duplications, or other unclassified rearrangements. 2
--rearr_min_seg_len Minimum length (in base pairs) of a realignment portion of an assembled contig sequence to consider for filtering a rearrangement event (inversions, tandem duplications, other unclassified rearrangements). 30
--trl_min_seg_len Minimum length (in base pairs) of a realignment portion of an assembled contig sequence to consider for filtering a translocation event. 25
--align_thresh Threshold for minimum read alignment during assembly. 0.9
--no_output_header Suppress column headers in output files. False
--discread_only_thresh The number of discordant read pairs in a cluster to output without evidence from a split read event. 2
--generate_image Generate a plot containing the assembled reads used to create a contig sequence with a called variant along with genes and transcripts associated with the event - requires gene_annotation_file in configuration file to be set with GTF file containing transcript annotations. False
--hostname The hostname for the blat server. localhost
-g, --gene_list File containing the names of specific regions to analyze, these target names must match the names in the bed file defining the target regions to analyze. None
-f, --filter_list A file containing specific variant events to filter out. None
-n, --nprocessors The number of processors to use for analysis. 1
-s, --start_blat_sever Option to indicate that a blat server needs to be started before analysis begins. A random part number and the localhost will be used if neither is specified. False
-k, --keep_blat_server Option to keep the blat server running after the analysis completes. False
-p, --port_number The port number for the blat server to either start on or already is running on. None
-c, --config The configuration filename that contains additional parameters. None
start_blat_server Start the blat server prior to performing the analysis NA
-h, --help Show the help message and exit. NA
--hostname The hostname for the blat server. localhost
-p, --port_number The port number for the blat server to either start on or already is running on. None
-c, --config The configuration filename that contains additional parameters. None
prepare_reference_data Extract the reference sequence from all the input target regions and store them in files to access during analysis. NA
-h, --help Show the help message and exit NA
-g, --gene_list File containing the names of specific regions to analyze, these target names must match the names in the bed file defining the target regions to analyze. None
-c, --config The configuration filename that contains additional parameters. None
-n, --nprocessors The number of processors to use for analysis. 1

Output files and formats

Logging

Target data

Final output