| extractTranscriptsFromGenome {GenomicFeatures} | R Documentation |
extractTranscriptsFromGenome extracts the transcript
sequences from a BSgenome data package using the transcript
information (exon boundaries) stored in a TranscriptDb
or GRangesList object.
extractTranscripts extracts a set of transcripts
from a single DNA sequence.
Related utilities:
transcriptWidths to get the lengths of the transcripts
(called the "widths" in this context) based on the boundaries
of their exons.
transcriptLocs2refLocs converts transcript-based
locations into reference-based (aka chromosome-based or genomic)
locations.
extractTranscriptsFromGenome(genome, txdb, use.names=TRUE)
extractTranscripts(x,
exonStarts=list(), exonEnds=list(), strand=character(0),
reorder.exons.on.minus.strand=FALSE)
## Related utilities:
transcriptWidths(exonStarts=list(), exonEnds=list())
transcriptLocs2refLocs(tlocs,
exonStarts=list(), exonEnds=list(), strand=character(0),
reorder.exons.on.minus.strand=FALSE)
genome |
A BSgenome object.
See the |
txdb |
A TranscriptDb object or a GRangesList object. |
use.names |
|
x |
A DNAString or MaskedDNAString object. |
exonStarts, exonEnds |
The starts and ends of the exons, respectively. Each argument can be a list of integer vectors,
an IntegerList object,
or a character vector where each element is a
comma-separated list of integers.
In addition, the lists represented by |
strand |
A character vector of the same length as |
reorder.exons.on.minus.strand |
|
tlocs |
A list of integer vectors of the same length as |
For extractTranscriptsFromGenome: A named
DNAStringSet object with one element per transcript.
When txdb is a GRangesList object, elements
in the output align with elements in the input (txdb), and they
have the same names.
For extractTranscripts: A DNAStringSet object with one
element per transcript.
For transcriptWidths: An integer vector with one element per
transcript.
For transcriptLocs2refLocs: A list of integer vectors of the same
shape as tlocs.
H. Pages
available.genomes,
TranscriptDb-class,
GRangesList-class,
DNAStringSet-class
library(BSgenome.Hsapiens.UCSC.hg18) # load the genome
## ---------------------------------------------------------------------
## A. USING extractTranscriptsFromGenome() WITH A TranscriptDb OBJECT
## ---------------------------------------------------------------------
txdb_file <- system.file("extdata", "UCSC_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb <- loadFeatures(txdb_file)
txseqs <- extractTranscriptsFromGenome(Hsapiens, txdb)
txseqs
## ---------------------------------------------------------------------
## B. USING extractTranscriptsFromGenome() WITH A GRangesList OBJECT
## ---------------------------------------------------------------------
## A GRangesList object containing exons grouped by transcripts gives
## the same result as above:
exbytx <- exonsBy(txdb, by="tx", use.names=TRUE)
txseqs2 <- extractTranscriptsFromGenome(Hsapiens, exbytx)
## A sanity check:
stopifnot(identical(unname(sapply(width(exbytx), sum)), width(txseqs2)))
## CDSs grouped by transcripts (this extracts only the translated parts
## of the transcripts):
cds <- extractTranscriptsFromGenome(Hsapiens, cdsBy(txdb))
## ---------------------------------------------------------------------
## C. GOING FROM TRANSCRIPT-BASED TO REFERENCE-BASED LOCATIONS
## ---------------------------------------------------------------------
## Get the reference-based locations of the first 4 (5' end)
## and last 4 (3' end) nucleotides in each transcript:
tlocs <- lapply(width(txseqs2), function(w) c(1:4, (w-3):w))
tx_strand <- sapply(strand(exbytx), runValue)
## Note that, because of how we made them, 'tlocs', 'start(exbytx)',
## 'end(exbytx)' and 'tx_strand' have the same length, and, for any
## valid positional index, elements at this position are corresponding
## to each other. This is how transcriptLocs2refLocs() expects them
## to be!
rlocs <- transcriptLocs2refLocs(tlocs, start(exbytx), end(exbytx),
tx_strand, reorder.exons.on.minus.strand=TRUE)
## ---------------------------------------------------------------------
## D. EXTRACTING WORM TRANSCRIPTS ZC101.3 AND F37B1.1
## ---------------------------------------------------------------------
## Transcript ZC101.3 (is on + strand):
## Exons starts/ends relative to transcript:
rstarts1 <- c(1, 488, 654, 996, 1365, 1712, 2163, 2453)
rends1 <- c(137, 578, 889, 1277, 1662, 1870, 2410, 2561)
## Exons starts/ends relative to chromosome:
starts1 <- 14678410 + rstarts1
ends1 <- 14678410 + rends1
## Transcript F37B1.1 (is on - strand):
## Exons starts/ends relative to transcript:
rstarts2 <- c(1, 325)
rends2 <- c(139, 815)
## Exons starts/ends relative to chromosome:
starts2 <- 13611188 - rends2
ends2 <- 13611188 - rstarts2
exon_starts <- list(as.integer(starts1), as.integer(starts2))
exon_ends <- list(as.integer(ends1), as.integer(ends2))
library(BSgenome.Celegans.UCSC.ce2)
## Both transcripts are on chrII:
chrII <- Celegans$chrII
transcripts <- extractTranscripts(chrII,
exonStarts=exon_starts,
exonEnds=exon_ends,
strand=c("+","-"))
## Same as 'width(transcripts)':
transcriptWidths(exonStarts=exon_starts, exonEnds=exon_ends)
transcriptLocs2refLocs(list(c(1:6, 135:140, 1555:1560),
c(1:6, 137:142, 625:630)),
exonStarts=exon_starts,
exonEnds=exon_ends,
strand=c("+","-"))
## A sanity check:
ref_locs <- transcriptLocs2refLocs(list(1:1560, 1:630),
exonStarts=exon_starts,
exonEnds=exon_ends,
strand=c("+","-"))
stopifnot(chrII[ref_locs[[1]]] == transcripts[[1]])
stopifnot(complement(chrII)[ref_locs[[2]]] == transcripts[[2]])