A method to identify structural variation from sequencing data in target regions
A method to identify genomic structural variation in target regions/genes from reference-aligned high-throughput sequence data. It uses a “kmer” strategy to assemble misaligned sequence reads for predicting insertions, deletions, inversions, tandem duplications, and translocations at base-pair resolution.
The following are required for installation:
Download the python scripts and run the command:
python setup.py install
Use appropriate commands for installing locally:
python setup.py install --user
Using the setup.py script for installation should setup the required python module dependencies for appropriate usage. If the install script is not used, the two modules need to be downloaded and installed if not already:
BreaKmer has currently been installed and tested on:
List the available command line parameters.
python <PATH_TO_BREAKMER_DIR>/breakmer.py -h
Analyze all the target genes specified in the targets bed file.
python <path to breakmer scripts>/breakmer.py run <options> -c <path to breakmer configuration file>
Prepare the reference data files before starting the analysis.
python <path to breakmer scripts>/breakmer.py prepare_reference_data <options> -c <path to breakmer configuration file>
Start the blat server on a specific host and port.
python <path to breakmer scripts>/breakmer.py start_blat_server <options> -c <path to breakmer configuration file>
Analyze a subset of genes specified in a file.
python <path to breakmer scripts>/breakmer.py -g <file containing list of target genes to analyze> <path to breakmer configuration file>
When these programs are installed, the paths to the binaries can be either be specified in the BreaKmer configuration file or put in the path (e.g. export PATH=$PATH:/path/to/binary).
# Required parameters
analysis_name=<sample_id, string value that all the output files will contain>
targets_bed_file=<path to bed file containing locations of target regions>
sample_bam_file=<path to sorted and indexed sample bam file>
analysis_dir=<path to analysis directory, location where analysis and final output files will be located>
reference_data_dir=<path to where reference files will/are stored for each of the targeted genes analyzed>
cutadapt=<path to cutadapt binary v1.5, i.e. /usr/bin/cutadapt-1.5/bin/cutadapt>
cutadapt_config_file=<path to cutadapt configuration file>
jellyfish=<path to Jellyfish binary, i.e. /usr/bin/jellyfish>
blat=<path to blat binary, i.e. /usr/bin/blat>
gfserver=<path to gfServer binary, i.e. /usr/bin/gfServer>
gfclient=<path to gfClient binary, i.e. /usr/bin/gfClient>
fatotwobit=<path to faToTwoBit binary, i.e. /usr/bin/faToTwoBit>
reference_fasta=<path to whole genome reference fasta file, one file with all records>
gene_annotation_file=<path to gene annotation file>
kmer_size=15
# Optional parameters
other_regions_file=<path to bed file containing coordinates for targeted unannotated cluster regions if they exist, such as IGH, IGK>
normal_bam_file=<path to normal bam file, can be used to filter germline events with matched-normal sample>
15 45003745 45003811 B2M exon
15 45007621 45007922 B2M exon
15 45008527 45008540 B2M exon
14 22090057 23021075 TRA
7 141998851 142510972 TRB
column-number content values/format
1 chromosome name chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M}
2 annotation source {ENSEMBL,HAVANA}
3 feature-type {gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine}
4 genomic start location integer-value (1-based)
5 genomic end location integer-value
6 score (not used) .
7 genomic strand {+,-}
8 genomic phase (for CDS features) {0,1,2,.}
9 additional information as key-value pairs see below
chr21 HAVANA transcript 10862622 10863067 . + . gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21 HAVANA exon 10862622 10862667 . + . gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21 HAVANA CDS 10862622 10862667 . + 0 gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21 HAVANA start_codon 10862622 10862624 . + 0 gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21 HAVANA exon 10862751 10863067 . + . gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21 HAVANA CDS 10862751 10863064 . + 2 gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21 HAVANA stop_codon 10863065 10863067 . + 0 gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21 HAVANA UTR 10863065 10863067 . + . gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
...
Program Mode | Parameter | Description | Default |
---|---|---|---|
run | Perform structural variant detection analysis | NA | |
-h, --help | Show the help message and exit | NA | |
--log_level | The level of logging detail to record while program is running. | DEBUG | |
--indel_size | The minimum indel size (in base pairs) to keep and output. | 15 | |
--trl_sr_thresh | Split read support threshold for translocation events (i.e., the minimum number of split reads required to keep a translocation event). | 2 | |
--indel_sr_thresh | Split read support threshold for indels | 5 | |
--rearr_sr_thresh | Split read support threshold for general rearrangements, inversions, tandem duplications, or other unclassified rearrangements. | 2 | |
--rearr_min_seg_len | Minimum length (in base pairs) of a realignment portion of an assembled contig sequence to consider for filtering a rearrangement event (inversions, tandem duplications, other unclassified rearrangements). | 30 | |
--trl_min_seg_len | Minimum length (in base pairs) of a realignment portion of an assembled contig sequence to consider for filtering a translocation event. | 25 | |
--align_thresh | Threshold for minimum read alignment during assembly. | 0.9 | |
--no_output_header | Suppress column headers in output files. | False | |
--discread_only_thresh | The number of discordant read pairs in a cluster to output without evidence from a split read event. | 2 | |
--generate_image | Generate a plot containing the assembled reads used to create a contig sequence with a called variant along with genes and transcripts associated with the event - requires gene_annotation_file in configuration file to be set with GTF file containing transcript annotations. | False | |
--hostname | The hostname for the blat server. | localhost | |
-g, --gene_list | File containing the names of specific regions to analyze, these target names must match the names in the bed file defining the target regions to analyze. | None | |
-f, --filter_list | A file containing specific variant events to filter out. | None | |
-n, --nprocessors | The number of processors to use for analysis. | 1 | |
-s, --start_blat_sever | Option to indicate that a blat server needs to be started before analysis begins. A random part number and the localhost will be used if neither is specified. | False | |
-k, --keep_blat_server | Option to keep the blat server running after the analysis completes. | False | |
-p, --port_number | The port number for the blat server to either start on or already is running on. | None | |
-c, --config | The configuration filename that contains additional parameters. | None | |
start_blat_server | Start the blat server prior to performing the analysis | NA | |
-h, --help | Show the help message and exit. | NA | |
--hostname | The hostname for the blat server. | localhost | |
-p, --port_number | The port number for the blat server to either start on or already is running on. | None | |
-c, --config | The configuration filename that contains additional parameters. | None | |
prepare_reference_data | Extract the reference sequence from all the input target regions and store them in files to access during analysis. | NA | |
-h, --help | Show the help message and exit | NA | |
-g, --gene_list | File containing the names of specific regions to analyze, these target names must match the names in the bed file defining the target regions to analyze. | None | |
-c, --config | The configuration filename that contains additional parameters. | None | |
-n, --nprocessors | The number of processors to use for analysis. | 1 |