1 Introduction
Genetically (or biologically) informative sequences can be defined as those which are either close to a known genetically important sequence or are far from sequences known to be noninformative. The first criterion seems to be more practical, however it is limited since it tries to reproduce what is already known. The second principle is more fundamental and more convenient for mathematical formalization and statistical inference. When employing this principle, the problem is how to define the noninformative genetic sequence (we call it the genetic noise), i.e. the sequence which has no genetically or biologically important information.
A model of the genetic noise is also crucial for statistical hypotheses testing, the phylogenetic tree reconstruction, simulations of the (neutral) evolutions, and in assessing the variability and uncertainty.
Genome regions whose evolution is not subjected to natural selection pressure and hence evolve with a neutral mutation rate can be viewed as the genetic noise. Those regions could be parts of non-coding regions of genoms of primitive species.
A generic formulation of empirical findings is sometimes called a stylized fact. The definition of the genetic noise should be consistent with the stylized facts about non-coding DNA sequences as well as with a probabilistic model of their evolution. Thus, the general aim of our investigation is to specify and to test statistically the basic properties of non-coding DNA sequences implied by a model of DNA evolution (Markov property, homogeneity, long-range dependence, reverse-complement symmetry, CpG content, etc.). In this work we focus on symmetry/asymmetry properties of two complementary DNA strands.
Chargaff’s second parity rule. The simplest hypothesis of DNA strand symmetry (sometimes referred to as
Chargaff’s second parity rule) states that proportions of nucleotides of the same base pair are approximately equal within single DNA strands (Rudner
et al.,
1968), i.e. %A ≈ %T and %C ≈ %G. Since the lagging strand is read in the reverse order, an extension of this first-order symmetry to higher-orders is called
reverse-complement symmetry, or
intra-strand parity (ISP) (Powdel
et al.,
2009), or simply
strand symmetry (Baisnée
et al.,
2002; Zhang and Huang,
2008). Although rather natural, this universal phenomenon of strand symmetry in the chromosomes needs explicit description and explanation. Actually, it may be the effect of a wide range of mechanisms operating at multiple orders and length scales (Baisnée
et al.,
2002).
Thus far the issue about strand symmetry, its origins and biological significance is controversial. On the one hand, results of empirical studies using various asymmetry measures and visualization tools show that for long DNA sequences (approximate) strand symmetry generally holds with rather rare exceptions. The fact that the strand symmetry should hold at the equilibrium state is also derived theoretically (Sueoka,
1995; Lobry,
1995). Baisnée
et al. (
2002) defined strand symmetry indices through relative
${L_{1}}$ distance between the observed frequencies of respective reverse-complementary oligonucleotides and compare them with critical values calculated for completely random sequences. In Kong
et al. (
2009), various symmetry indices (reverse, complement and inverse symmetry indices, global as well as segmental) based on
${L_{2}}$ distance have been calculated for 786 complete chromosomes. The authors have found that reverse-complement symmetry (inverse-complement plus reverse-symmetry in terms of the authors) is prevalent in complex patterns in most chromosomes. Rosandić
et al. (
2016) considered 20 symbolic quadruplets of trinucleotides obtained via interstrand mirror symmetry mappings (direct, reverse complement, complement, and reverse) and demonstrated quadruplet’s symmetries in chromosomes of wide range of organisms, from Escherichia coli to human genomes. Powdel
et al. (
2009) have noticed another strand symmetry manifestation, intra-strand frequency distribution parity (ISFDP), which represents closeness of frequency distributions between the complementary mono/oligonucleotides. This general feature (with rare exceptions) was observed in chromosomes of bacteria, archaea and eukaryotes. It has been also noticed that the frequency of an genomic word is more similar to the frequency of its reversed complement than to the frequencies of other words of equivalent composition. This phenomenon is called exceptional symmetry. Afreixo
et al. (
2017) proposed a new measure to evaluate the exceptional symmetry effect based on discrepancy between frequency of symmetric word pair and frequencies of word pairs of equivalent composition. They identified words that show high symmetry effect across the 31 species, and across the 9 animal species studied. Fractal-like symmetry structures are considered in Petoukhov
et al. (
2018). Sobottka and Hart (
2011) proposed a model based on a hidden Markov process for approximating the distributions of primitive DNA sequences. The model provides an alternative interpretation of strand symmetry and describes new symmetries in bacterial genomes. Cristadoro
et al. (
2018) introduced flexible statistical measures of symmetry and used them to define an extended Chargaff symmetry. The definition actually coincides with global strand symmetry of genoms defined and studied in Simons
et al. (
2005). Domain models introduced in Cristadoro
et al. (
2018) alow to explain simultaneously symmetries as well as non-random structures in genetic sequences and unravel previously unknown symmetries, which are organized hierarchically through different scales.
On the other hand, statistical analyzes of the genomic sequences (Shporer
et al.,
2016; Tavares
et al.,
2018), especially those based on Markov-type models (Hart and Martínez,
2011; Hart
et al.,
2012), have demonstrated significant deviations from the second Chargaff’s parity rule and its extensions. A statistical IS-Poisson model introduced in Shporer
et al. (
2016) assumes that frequencies of oligonucleotides (DNA
k-mers) follow the Poisson distribution. The model allows to conclude that for
k-mers with low
k (even for nucleotides,
$k=1$) violations of symmetry, although extremely small, are significant. In Tavares
et al. (
2018), both the distance distributions and the frequencies of symmetric words in the human DNA have been compared. The results obtained suggest that some asymmetries in the human genome go far beyond Chargaff’s rules.
One of the explanations of strand asymmetry (skew), i.e. violation of symmetry, is mutation bias. When investigating asymmetries in mutation patterns, phylogenetic estimation based on maximum likelihood can be applied. Usually independent evolution models completely determined by nucleotide substitution rates are employed, see, e.g. Faith and Pollock (
2003), Marin and Xia (
2008). Note that mathematical models for evolutionary inference considered in Parks (
2015) also assume independent evolution. However, Siepel and Haussler (
2004) presented extensions of standard phylogenetic models with context-dependent substitution and showed that the new models improve goodness of fit substantially for both coding and non-coding data. Moreover, considering context dependence leads to much larger improvements than does using a richer substitution model or allowing for rate variation across sites, under the assumption of site independence. We refer to Bérard and Guéguen (
2012) for a more recent application of context-dependent substitution models in a phylogenetic context.
In this paper,
DNA strand local symmetry introduced in Židanavičiūtė (
2010) is tested for the longest non-coding (in the both leading and lagging strands) sequences of bacterial genomes taken from GenBank (
https://www.ncbi.nlm.nih.gov/genbank/). Validity of a special regression-type probabilistic structure of the data is supposed. This structure is compatible with probability distribution of random nucleotide sequences at a steady state of a context-dependent reversible Markov evolutionary process (Jensen,
2005), see also Arndt
et al. (
2003), Lunter and Hein (
2004). The null hypothesis of strand local symmetry is rejected in majority of bacterial genomes suggesting that even neutral mutations are skewed with respect to leading and lagging strands.
The rest of the paper is organized as follows. In the next section the definition of
strand local symmetry is presented and characterization of this property in terms of generalized logits is given. Results of statistical analysis are discussed in Section
3. We end with some concluding remarks.