This file explains what the data provided are, and what format should the file be.



1. The eight data files store the counts in the top 100 highly-expressed genes, as well as the surrounding sequence for each nucleotide.

   The names of the files are like An.csv.

   A--name of dataset. w: Wold, b: Burge, g: Grimmond

   b--the number of sub-datasets.



2. gene_names.txt stores the names of the top 100 genes in each sub-dataset.



3. Each data file should be read by the package using:

   oriData <- read.csv(file)

   data <- expData(oriData, llen, rlen)



4. The data file should be exactly in a particular format, and the data files given provide examples.



5. This package itself will not justify the correctness of the format. Please make sure you have done that.



6. The details of the format are:

   

   It should have 4 columns with names: index, tag, seq, and count.

   index is an index for the gene from where this count comes. Often, it is repetitive 1, repetitive 2, ...

   tag is an integer value, 0 means to consider this count, any other value means this count should not be taken into account. In our files, -2 means the UTR part, and -1 means the further 100 bp. The user can use any integer other than 0 to denote the discarded counts.

   seq is the nucleotide of this position. Must be capital T or A or C or G. No other characters accepted. No little characters accepted. No missing values accepted. If the number of missing values is small, you can use T (or A, C, G) for them; this should not change the result significantly.

   count is the count of reads starting at this position.

   

   For each gene (or each group of positions that have the small level of expression, like exon or isoform), a distinguished index should be used. Each gene (or group) may include positions in both strand (like data generated by Illumina) or single strand (like data generated by ABi).

   Within each gene (or group), the positions should be in the 5 prime to 3 prime order for each strand. There should be no gaps or missing values.

   So actually, for each gene in Illumina outputs, the data are comprised of two halves. The first half are the data from the forward strand, and the second half are the data from the second strand.

   For each gene in ABi outputs, there are no such two halves.

   For each gene or each half, the nucleotides retained for analysis should be surrounded with long-enough nonretained nucleotides.

   For example, if you want to consider left 40 bp and right 40 bp as surrounding sequences, then there should be at least 40 bp in both sides of nucleotides retained.



