wig¶
This module allow to parse wig files (wig file specifications are available here: https://wiki.nci.nih.gov/display/tcga/wiggle+format+specification, http://genome.ucsc.edu/goldenPath/help/wiggle.html). The wig file handle by this modules slightly differ fom de canonic specifications as it allow to specify coverage on forward and reverse strand. If the coverage score is positive that mean that it’s on the forward strand if it’s negative, it’s on the reverse strand.
The WigParser and helpers¶
The craw.wig.WigParser
allow to parse the wig file. It read the file line by line,
test the category of the line trackLine, declarationLine or dataLine and call the right method
to parse the line and build the genome object.
- The classes
craw.wig.VariableChunk
andcraw.wig.FixedChunk
are not keep in the final data model, they are just used to parsed the data lines and convert the wig file information (step, span) in coverages for each positions.
The data model to handle the wig information¶
The craw.wig.Genome
objects contains craw.wig.Chromosome
(each chromosomes ar unique and the names of chromosomes are unique).
Each chromosomes contains the coverage for the both strands.
To get the coverage for region or a position just access it with indices or slices as traditional
python list, tuple, on so on. The slicing return two lists.
The first list correspond to the coverage on this particular region for the forward strand,
the second element for the reverse strand.
By default the chromosomes are initialized with 0.0 as coverage for all positions.
All information specified in the track line are stored in the infos
attribute of craw.wig.Genome
as a dict.
wig API reference¶
- class
craw.wig.
Chromosome
(name, size=1000000)[source]¶Handle chromosomes. A chromosome as a name and contains
Chunk
objects (forward and reverse)
__getitem__
(pos)[source]¶
- Parameters
pos – a position or a slice (0 based) if pos is a slice the left indice is excluded
- Returns
the coverage at this position or corresponding to this slice.
- Return type
a list of 2 list of float [[float,..],[float, ..]]
- Raises
IndexError – if pos is not in coverage or one bound of slice is out the coverage
__init__
(name, size=1000000)[source]¶
- Parameters
name (str) –
size (the default size of the chromosome. Each time we try to set a value greater than the chromosome the chromosome size is doubled. This is to protect the machine against memory swapping if the user provide a wig file with very big chromosomes.) –
__setitem__
(pos, value)[source]¶
- Parameters
pos (int or
slice
object) – the postion (0-based) to set valuevalue (float or iterable of float) – value to assign
- Raises
ValueError – when pos is a slice and value have not the same length of the slice
TypeError – when pos is a slice and value is not iterable
IndexError – if pos is not in coverage or one bound of slice is out the coverage
__weakref__
¶list of weak references to the object (if defined)
_estimate_memory
(col_nb, mem_per_col)[source]¶
- Parameters
col_nb (int) – the number of column of the new array or the extension
mem_per_col (int) – the memory needed to create or extend an array with one col and 2 rows fill with 0.0
- Returns
the estimation of free memory available after creating or extending chromosome
- Return type
int
_extend
(size=1000000, fill=0.0)[source]¶Extend this chromosome of the size size and fill with fill. :param size: the size (in bp) we want to increase the chromosome. :type size: int :param fill: the default value to fill the chromosome. :type fill: float or nan :raise MemoryError: if the chromosome extension could overcome the free memory.
- class
craw.wig.
Chunk
(**kwargs)[source]¶Represent the data following a declaration line. The a Chunk contains sparse data on coverage on a region of one chromosomes on both strand plus data contains on the declaration line.
__init__
(**kwargs)[source]¶
- Parameters
kwargs (dictionary) – the key,values pairs found on a Declaration line
__weakref__
¶list of weak references to the object (if defined)
is_fixed_step
()[source]¶This is an abstract methods, must be implemented in inherited class :return: True if i’s a fixed chunk of data, False otheweise :rtype: boolean
parse_data_line
(line, chrom, strand_type)[source]¶parse a line of data and append the results in the corresponding strand This is an abstract methods, must be implemented in inherited class.
- Parameters
line (string) – line of data to parse (the white spaces at the end must be strip)
chrom (
Chromosome
object.) – the chromosome to add coverage datastrand_type (string '+' , '-', 'mixed') – which kind of wig is parsing: forward, reverse, or mixed strand
- class
craw.wig.
FixedChunk
(**kwargs)[source]¶The FixedChunk objects handle data of ‘fixedStep’ declaration line and it’s coverage data
__init__
(**kwargs)[source]¶
- Parameters
kwargs (dictionary) – the key,values pairs found on a Declaration line
parse_data_line
(line, chrom, strand_type)[source]¶parse line of data following a fixedStep Declaration. add the result on the corresponding strand (forward if coverage value is positive, reverse otherwise) :param line: line of data to parse (the white spaces at the end must be strip) :type line: string :param chrom: the chromosome to add coverage data :type chrom:
Chromosome
object. :param strand_type: which kind of wig is parsing: forward, reverse, or mixed strand :type strand_type: string ‘+’ , ‘-‘, ‘mixed’
- class
craw.wig.
Genome
[source]¶A genome is made of chromosomes and some metadata, called infos
__delitem__
(name)[source]¶remove a chromosome from this genome
- Parameters
name (string) – the name of the chromosome to remove
- Returns
None
__getitem__
(name)[source]¶
- Parameters
name (string) – the name of the chromosome to retrieve
- Returns
the chromosome corresponding to the name.
- Return type
Chromosome
object.
__weakref__
¶list of weak references to the object (if defined)
add
(chrom)[source]¶add a chromosome in to a genome. if a chromosome with the same name already exist the previous one is replaced silently by this one.
- Parameters
chrom (
Chromosome
object.) – a chromosome to ad to this genome- Raise
TypeError if chrom is not a
Chromosome
object.
- class
craw.wig.
VariableChunk
(**kwargs)[source]¶The Variable Chunk objects handle data of ‘variableStep’ declaration line and it’s coverage data
If in data there is negative values this indicate that the coverage match on the reverse strand. the chunk start with the smallest position and end to the higest position whatever on wich strand are these position. This mean that when the chunk will be convert in Coverage, the lacking positions will be filled with 0.0.
for instance:
variableStep chrom=chr3 span=2 10 11 20 22 20 -30 25 -50
will give coverages starting at position 10 and ending at 26 for both strands and with the following coverages values
for = [11.0, 11.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 22.0, 22.0, 0.0, 0.0, 0.0, 0.0, 0.0]rev = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 30.0, 30.0, 0.0, 0.0, 0.0, 50.0, 50.0]
parse_data_line
(line, chrom, strand_type)[source]¶Parse line of data following a variableStep Declaration. Add the result on the corresponding strand (forward if coverage value is positive, reverse otherwise)
- Parameters
line (string) – line of data to parse (the white spaces at the end must be strip)
chrom (
Chromosome
object.) – the chromosome to add coverage datastrand_type (string '+' , '-', 'mixed') – which kind of wig is parsing: forward, reverse, or mixed strand
- Raises
ValueError – if strand_type is different than ‘mixed’, ‘-‘, ‘+’
- exception
craw.wig.
WigError
[source]¶Handle error related to wig parsing
__weakref__
¶list of weak references to the object (if defined)
- class
craw.wig.
WigParser
(mixed_wig='', for_wig='', rev_wig='')[source]¶class to parse file in wig format. at the end of parsing it returns a
Genome
object.
__init__
(mixed_wig='', for_wig='', rev_wig='')[source]¶
- Parameters
mixed_wig (string) –
The path of the wig file to parse. The wig file code for the 2 strands:
The positive coverage values for the forward strand
The negative coverage values for the reverse strand
This parameter is incompatible with for_wig and rev_wig parameter.
for_wig (string) – The path of the wig file to parse. The wig file code for forward strand only. This parameter is incompatible with mixed_wig parameter.
rev_wig (string) – The path of the wig file to parse. The wig file code for reverse strand only. This parameter is incompatible with mixed_wig parameter.
__weakref__
¶list of weak references to the object (if defined)
- static
is_comment_line
(line)[source]¶
- Parameters
line (string) – line to parse.
- Returns
True if line is a comment line. False otherwise.
- Return type
boolean
is_data_line
(line)[source]¶
- Parameters
line – line to parse.
- Returns
True if it’s a data line, False otherwise
is_declaration_line
(line)[source]¶A single line, beginning with one of the identifiers variableStep or fixedStep, followed by attribute/value pairs for instance:
fixedStep chrom=chrI start=1 step=10 span=5
- Parameters
line (string) – line to parse.
- Returns
True if line is a declaration line. False otherwise.
- Return type
boolean
- static
is_track_line
(line)[source]¶A track line begins with the identifier track and followed by attribute/value pairs for instance:
track type=wiggle_0 name="fixedStep" description="fixedStep format" visibility=full autoScale=off
- Parameters
line (string) – line to parse.
- Returns
True if line is a track line. False otherwise.
- Return type
boolean
parse
()[source]¶Open a wig file and parse it. read wig file line by line check the type of line and call the corresponding method accordingly the type of the line: - comment - track - declaration - data see - https://wiki.nci.nih.gov/display/tcga/wiggle+format+specification - http://genome.ucsc.edu/goldenPath/help/wiggle.html for wig specifications. This parser does not fully follow these specification. When a score is negative, it means that the coverage is on the reverse strand. So some positions can appear twice in one block of declaration (what I call a chunk).
- Returns
a Genome coverage corresponding to the wig files (mixed strand on one wig or two separate wig)
- Return type
Genome
object
parse_data_line
(line, strand_type)[source]¶
- Parameters
line (string) – line to parse. It must not a comment_line, neither a track line nor a declaration line.
- Raises
ValueError – if strand_type is different than ‘mixed’, ‘-‘, ‘+’
parse_declaration_line
(line)[source]¶Get the corresponding chromosome create one if necessary, and set the current_chunk and current_chromosome.
- Parameters
line – line to parse. The method
is_declaration_line()
must return True with this line.
parse_track_line
(line, strand_type='')[source]¶fill the genome infos with the information found on the track.
- Parameters
line – line to parse. The method
is_track_line()
must return True with this line.