wig

This module allow to parse wig files (wig file specifications are available here: https://wiki.nci.nih.gov/display/tcga/wiggle+format+specification, http://genome.ucsc.edu/goldenPath/help/wiggle.html). The wig file handle by this modules slightly differ fom de canonic specifications as it allow to specify coverage on forward and reverse strand. If the coverage score is positive that mean that it’s on the forward strand if it’s negative, it’s on the reverse strand.

The WigParser and helpers

The craw.wig.WigParser allow to parse the wig file. It read the file line by line, test the category of the line trackLine, declarationLine or dataLine and call the right method to parse the line and build the genome object.

The classes craw.wig.VariableChunk and craw.wig.FixedChunk are not keep in the final data model,

they are just used to parsed the data lines and convert the wig file information (step, span) in coverages for each positions.

The data model to handle the wig information

The craw.wig.Genome objects contains craw.wig.Chromosome (each chromosomes ar unique and the names of chromosomes are unique). Each chromosomes contains the coverage for the both strands. To get the coverage for region or a position just access it with indices or slices as traditional python list, tuple, on so on. The slicing return two lists. The first list correspond to the coverage on this particular region for the forward strand, the second element for the reverse strand. By default the chromosomes are initialized with 0.0 as coverage for all positions.

All information specified in the track line are stored in the infos attribute of craw.wig.Genome as a dict.

wig API reference

class craw.wig.Chromosome(name, size=1000000)[source]

Handle chromosomes. A chromosome as a name and contains Chunk objects (forward and reverse)

__getitem__(pos)[source]
Parameters

pos – a position or a slice (0 based) if pos is a slice the left indice is excluded

Returns

the coverage at this position or corresponding to this slice.

Return type

a list of 2 list of float [[float,..],[float, ..]]

Raises

IndexError – if pos is not in coverage or one bound of slice is out the coverage

__init__(name, size=1000000)[source]
Parameters
  • name (str) –

  • size (the default size of the chromosome. Each time we try to set a value greater than the chromosome the chromosome size is doubled. This is to protect the machine against memory swapping if the user provide a wig file with very big chromosomes.) –

__len__()[source]
Returns

the actual length of the chromosome

Return type

int

__setitem__(pos, value)[source]
Parameters
  • pos (int or slice object) – the postion (0-based) to set value

  • value (float or iterable of float) – value to assign

Raises
  • ValueError – when pos is a slice and value have not the same length of the slice

  • TypeError – when pos is a slice and value is not iterable

  • IndexError – if pos is not in coverage or one bound of slice is out the coverage

__weakref__

list of weak references to the object (if defined)

_estimate_memory(col_nb, mem_per_col)[source]
Parameters
  • col_nb (int) – the number of column of the new array or the extension

  • mem_per_col (int) – the memory needed to create or extend an array with one col and 2 rows fill with 0.0

Returns

the estimation of free memory available after creating or extending chromosome

Return type

int

_extend(size=1000000, fill=0.0)[source]

Extend this chromosome of the size size and fill with fill. :param size: the size (in bp) we want to increase the chromosome. :type size: int :param fill: the default value to fill the chromosome. :type fill: float or nan :raise MemoryError: if the chromosome extension could overcome the free memory.

class craw.wig.Chunk(**kwargs)[source]

Represent the data following a declaration line. The a Chunk contains sparse data on coverage on a region of one chromosomes on both strand plus data contains on the declaration line.

__init__(**kwargs)[source]
Parameters

kwargs (dictionary) – the key,values pairs found on a Declaration line

__weakref__

list of weak references to the object (if defined)

is_fixed_step()[source]

This is an abstract methods, must be implemented in inherited class :return: True if i’s a fixed chunk of data, False otheweise :rtype: boolean

parse_data_line(line, chrom, strand_type)[source]

parse a line of data and append the results in the corresponding strand This is an abstract methods, must be implemented in inherited class.

Parameters
  • line (string) – line of data to parse (the white spaces at the end must be strip)

  • chrom (Chromosome object.) – the chromosome to add coverage data

  • strand_type (string '+' , '-', 'mixed') – which kind of wig is parsing: forward, reverse, or mixed strand

class craw.wig.FixedChunk(**kwargs)[source]

The FixedChunk objects handle data of ‘fixedStep’ declaration line and it’s coverage data

__init__(**kwargs)[source]
Parameters

kwargs (dictionary) – the key,values pairs found on a Declaration line

is_fixed_step()[source]
Returns

True

Return type

boolean

parse_data_line(line, chrom, strand_type)[source]

parse line of data following a fixedStep Declaration. add the result on the corresponding strand (forward if coverage value is positive, reverse otherwise) :param line: line of data to parse (the white spaces at the end must be strip) :type line: string :param chrom: the chromosome to add coverage data :type chrom: Chromosome object. :param strand_type: which kind of wig is parsing: forward, reverse, or mixed strand :type strand_type: string ‘+’ , ‘-‘, ‘mixed’

class craw.wig.Genome[source]

A genome is made of chromosomes and some metadata, called infos

__delitem__(name)[source]

remove a chromosome from this genome

Parameters

name (string) – the name of the chromosome to remove

Returns

None

__getitem__(name)[source]
Parameters

name (string) – the name of the chromosome to retrieve

Returns

the chromosome corresponding to the name.

Return type

Chromosome object.

__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

__weakref__

list of weak references to the object (if defined)

add(chrom)[source]

add a chromosome in to a genome. if a chromosome with the same name already exist the previous one is replaced silently by this one.

Parameters

chrom (Chromosome object.) – a chromosome to ad to this genome

Raise

TypeError if chrom is not a Chromosome object.

class craw.wig.VariableChunk(**kwargs)[source]

The Variable Chunk objects handle data of ‘variableStep’ declaration line and it’s coverage data

If in data there is negative values this indicate that the coverage match on the reverse strand. the chunk start with the smallest position and end to the higest position whatever on wich strand are these position. This mean that when the chunk will be convert in Coverage, the lacking positions will be filled with 0.0.

for instance:

variableStep chrom=chr3 span=2 10 11 20 22 20 -30 25 -50

will give coverages starting at position 10 and ending at 26 for both strands and with the following coverages values

for = [11.0, 11.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 22.0, 22.0, 0.0, 0.0, 0.0, 0.0, 0.0]
rev = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 30.0, 30.0, 0.0, 0.0, 0.0, 50.0, 50.0]
is_fixed_step()[source]
Returns

False

Return type

boolean

parse_data_line(line, chrom, strand_type)[source]

Parse line of data following a variableStep Declaration. Add the result on the corresponding strand (forward if coverage value is positive, reverse otherwise)

Parameters
  • line (string) – line of data to parse (the white spaces at the end must be strip)

  • chrom (Chromosome object.) – the chromosome to add coverage data

  • strand_type (string '+' , '-', 'mixed') – which kind of wig is parsing: forward, reverse, or mixed strand

Raises

ValueError – if strand_type is different than ‘mixed’, ‘-‘, ‘+’

exception craw.wig.WigError[source]

Handle error related to wig parsing

__weakref__

list of weak references to the object (if defined)

class craw.wig.WigParser(mixed_wig='', for_wig='', rev_wig='')[source]

class to parse file in wig format. at the end of parsing it returns a Genome object.

__init__(mixed_wig='', for_wig='', rev_wig='')[source]
Parameters
  • mixed_wig (string) –

    The path of the wig file to parse. The wig file code for the 2 strands:

    • The positive coverage values for the forward strand

    • The negative coverage values for the reverse strand

    This parameter is incompatible with for_wig and rev_wig parameter.

  • for_wig (string) – The path of the wig file to parse. The wig file code for forward strand only. This parameter is incompatible with mixed_wig parameter.

  • rev_wig (string) – The path of the wig file to parse. The wig file code for reverse strand only. This parameter is incompatible with mixed_wig parameter.

__weakref__

list of weak references to the object (if defined)

static is_comment_line(line)[source]
Parameters

line (string) – line to parse.

Returns

True if line is a comment line. False otherwise.

Return type

boolean

is_data_line(line)[source]
Parameters

line – line to parse.

Returns

True if it’s a data line, False otherwise

is_declaration_line(line)[source]

A single line, beginning with one of the identifiers variableStep or fixedStep, followed by attribute/value pairs for instance:

fixedStep chrom=chrI start=1 step=10 span=5
Parameters

line (string) – line to parse.

Returns

True if line is a declaration line. False otherwise.

Return type

boolean

static is_track_line(line)[source]

A track line begins with the identifier track and followed by attribute/value pairs for instance:

track type=wiggle_0 name="fixedStep" description="fixedStep format" visibility=full autoScale=off
Parameters

line (string) – line to parse.

Returns

True if line is a track line. False otherwise.

Return type

boolean

parse()[source]

Open a wig file and parse it. read wig file line by line check the type of line and call the corresponding method accordingly the type of the line: - comment - track - declaration - data see - https://wiki.nci.nih.gov/display/tcga/wiggle+format+specification - http://genome.ucsc.edu/goldenPath/help/wiggle.html for wig specifications. This parser does not fully follow these specification. When a score is negative, it means that the coverage is on the reverse strand. So some positions can appear twice in one block of declaration (what I call a chunk).

Returns

a Genome coverage corresponding to the wig files (mixed strand on one wig or two separate wig)

Return type

Genome object

parse_data_line(line, strand_type)[source]
Parameters

line (string) – line to parse. It must not a comment_line, neither a track line nor a declaration line.

Raises

ValueError – if strand_type is different than ‘mixed’, ‘-‘, ‘+’

parse_declaration_line(line)[source]

Get the corresponding chromosome create one if necessary, and set the current_chunk and current_chromosome.

Parameters

line – line to parse. The method is_declaration_line() must return True with this line.

parse_track_line(line, strand_type='')[source]

fill the genome infos with the information found on the track.

Parameters

line – line to parse. The method is_track_line() must return True with this line.