Informatica logo


Login Register

  1. Home
  2. Issues
  3. Volume 30, Issue 3 (2019)
  4. Local Symmetry of Non-Coding Genetic Seq ...

Informatica

Information Submit your article For Referees Help ATTENTION!
  • Article info
  • Full article
  • Cited by
  • More
    Article info Full article Cited by

Local Symmetry of Non-Coding Genetic Sequences
Volume 30, Issue 3 (2019), pp. 553–571
Marijus Radavičius   Tomas Rekašius   Jurgita Židanavičiūtė  

Authors

 
Placeholder
https://doi.org/10.15388/Informatica.2019.218
Pub. online: 1 January 2019      Type: Research Article      Open accessOpen Access

Received
1 November 2018
Accepted
1 May 2019
Published
1 January 2019

Abstract

The simplest hypothesis of DNA strand symmetry states that proportions of nucleotides of the same base pair are approximately equal within single DNA strands. Results of extensive empirical studies using asymmetry measures and various visualization tools show that for long DNA sequences (approximate) strand symmetry generally holds with rather rare exceptions. In the paper, a formal definition of DNA strand local symmetry is presented, characterized in terms of generalized logits and tested for the longest non-coding sequences of bacterial genomes. Validity of a special regression-type probabilistic structure of the data is supposed. This structure is compatible with probability distribution of random nucleotide sequences at a steady state of a context-dependent reversible Markov evolutionary process. The null hypothesis of strand local symmetry is rejected in majority of bacterial genomes suggesting that even neutral mutations are skewed with respect to leading and lagging strands.
Due to symmetry, the nature is perfect.
Spices of asymmetry make it beautiful.

1 Introduction

Genetically (or biologically) informative sequences can be defined as those which are either close to a known genetically important sequence or are far from sequences known to be noninformative. The first criterion seems to be more practical, however it is limited since it tries to reproduce what is already known. The second principle is more fundamental and more convenient for mathematical formalization and statistical inference. When employing this principle, the problem is how to define the noninformative genetic sequence (we call it the genetic noise), i.e. the sequence which has no genetically or biologically important information.
A model of the genetic noise is also crucial for statistical hypotheses testing, the phylogenetic tree reconstruction, simulations of the (neutral) evolutions, and in assessing the variability and uncertainty.
Genome regions whose evolution is not subjected to natural selection pressure and hence evolve with a neutral mutation rate can be viewed as the genetic noise. Those regions could be parts of non-coding regions of genoms of primitive species.
A generic formulation of empirical findings is sometimes called a stylized fact. The definition of the genetic noise should be consistent with the stylized facts about non-coding DNA sequences as well as with a probabilistic model of their evolution. Thus, the general aim of our investigation is to specify and to test statistically the basic properties of non-coding DNA sequences implied by a model of DNA evolution (Markov property, homogeneity, long-range dependence, reverse-complement symmetry, CpG content, etc.). In this work we focus on symmetry/asymmetry properties of two complementary DNA strands.
Chargaff’s second parity rule. The simplest hypothesis of DNA strand symmetry (sometimes referred to as Chargaff’s second parity rule) states that proportions of nucleotides of the same base pair are approximately equal within single DNA strands (Rudner et al., 1968), i.e. %A ≈ %T and %C ≈ %G. Since the lagging strand is read in the reverse order, an extension of this first-order symmetry to higher-orders is called reverse-complement symmetry, or intra-strand parity (ISP) (Powdel et al., 2009), or simply strand symmetry (Baisnée et al., 2002; Zhang and Huang, 2008). Although rather natural, this universal phenomenon of strand symmetry in the chromosomes needs explicit description and explanation. Actually, it may be the effect of a wide range of mechanisms operating at multiple orders and length scales (Baisnée et al., 2002).
Thus far the issue about strand symmetry, its origins and biological significance is controversial. On the one hand, results of empirical studies using various asymmetry measures and visualization tools show that for long DNA sequences (approximate) strand symmetry generally holds with rather rare exceptions. The fact that the strand symmetry should hold at the equilibrium state is also derived theoretically (Sueoka, 1995; Lobry, 1995). Baisnée et al. (2002) defined strand symmetry indices through relative ${L_{1}}$ distance between the observed frequencies of respective reverse-complementary oligonucleotides and compare them with critical values calculated for completely random sequences. In Kong et al. (2009), various symmetry indices (reverse, complement and inverse symmetry indices, global as well as segmental) based on ${L_{2}}$ distance have been calculated for 786 complete chromosomes. The authors have found that reverse-complement symmetry (inverse-complement plus reverse-symmetry in terms of the authors) is prevalent in complex patterns in most chromosomes. Rosandić et al. (2016) considered 20 symbolic quadruplets of trinucleotides obtained via interstrand mirror symmetry mappings (direct, reverse complement, complement, and reverse) and demonstrated quadruplet’s symmetries in chromosomes of wide range of organisms, from Escherichia coli to human genomes. Powdel et al. (2009) have noticed another strand symmetry manifestation, intra-strand frequency distribution parity (ISFDP), which represents closeness of frequency distributions between the complementary mono/oligonucleotides. This general feature (with rare exceptions) was observed in chromosomes of bacteria, archaea and eukaryotes. It has been also noticed that the frequency of an genomic word is more similar to the frequency of its reversed complement than to the frequencies of other words of equivalent composition. This phenomenon is called exceptional symmetry. Afreixo et al. (2017) proposed a new measure to evaluate the exceptional symmetry effect based on discrepancy between frequency of symmetric word pair and frequencies of word pairs of equivalent composition. They identified words that show high symmetry effect across the 31 species, and across the 9 animal species studied. Fractal-like symmetry structures are considered in Petoukhov et al. (2018). Sobottka and Hart (2011) proposed a model based on a hidden Markov process for approximating the distributions of primitive DNA sequences. The model provides an alternative interpretation of strand symmetry and describes new symmetries in bacterial genomes. Cristadoro et al. (2018) introduced flexible statistical measures of symmetry and used them to define an extended Chargaff symmetry. The definition actually coincides with global strand symmetry of genoms defined and studied in Simons et al. (2005). Domain models introduced in Cristadoro et al. (2018) alow to explain simultaneously symmetries as well as non-random structures in genetic sequences and unravel previously unknown symmetries, which are organized hierarchically through different scales.
On the other hand, statistical analyzes of the genomic sequences (Shporer et al., 2016; Tavares et al., 2018), especially those based on Markov-type models (Hart and Martínez, 2011; Hart et al., 2012), have demonstrated significant deviations from the second Chargaff’s parity rule and its extensions. A statistical IS-Poisson model introduced in Shporer et al. (2016) assumes that frequencies of oligonucleotides (DNA k-mers) follow the Poisson distribution. The model allows to conclude that for k-mers with low k (even for nucleotides, $k=1$) violations of symmetry, although extremely small, are significant. In Tavares et al. (2018), both the distance distributions and the frequencies of symmetric words in the human DNA have been compared. The results obtained suggest that some asymmetries in the human genome go far beyond Chargaff’s rules.
One of the explanations of strand asymmetry (skew), i.e. violation of symmetry, is mutation bias. When investigating asymmetries in mutation patterns, phylogenetic estimation based on maximum likelihood can be applied. Usually independent evolution models completely determined by nucleotide substitution rates are employed, see, e.g. Faith and Pollock (2003), Marin and Xia (2008). Note that mathematical models for evolutionary inference considered in Parks (2015) also assume independent evolution. However, Siepel and Haussler (2004) presented extensions of standard phylogenetic models with context-dependent substitution and showed that the new models improve goodness of fit substantially for both coding and non-coding data. Moreover, considering context dependence leads to much larger improvements than does using a richer substitution model or allowing for rate variation across sites, under the assumption of site independence. We refer to Bérard and Guéguen (2012) for a more recent application of context-dependent substitution models in a phylogenetic context.
In this paper, DNA strand local symmetry introduced in Židanavičiūtė (2010) is tested for the longest non-coding (in the both leading and lagging strands) sequences of bacterial genomes taken from GenBank (https://www.ncbi.nlm.nih.gov/genbank/). Validity of a special regression-type probabilistic structure of the data is supposed. This structure is compatible with probability distribution of random nucleotide sequences at a steady state of a context-dependent reversible Markov evolutionary process (Jensen, 2005), see also Arndt et al. (2003), Lunter and Hein (2004). The null hypothesis of strand local symmetry is rejected in majority of bacterial genomes suggesting that even neutral mutations are skewed with respect to leading and lagging strands.
The rest of the paper is organized as follows. In the next section the definition of strand local symmetry is presented and characterization of this property in terms of generalized logits is given. Results of statistical analysis are discussed in Section 3. We end with some concluding remarks.

2 Local Symmetry

In this section we present the formal definition of local symmetry (Židanavičiūtė, 2010) and recall necessary notions and facts about discrete Markov random fields and loglinear modelling.

2.1 Complementary Transformation

Nucleotide sequences $x={x_{[n]}}$ are sequences of elements $({x_{i}},i\in [n])$ with values from the alphabet $\mathcal{A}:=\{\texttt{A},\texttt{C},\texttt{G},\texttt{T}\}$. Here $[n]:=[1,n]=\{1,\dots ,n\}$ is an interval of (positive) integers.
If $x=({x_{1}},\dots ,{x_{n}})$ is the leading strand of a DNA sequence, then the complementary one (the lagging strand read in the opposite direction) is denoted by ${x^{\ast }}=({x_{1}^{\ast }},\dots ,{x_{n}^{\ast }})$, where ${x_{i}^{\ast }}$ is the complementary nucleotide to ${x_{i}}$ in the ith base pair, and ${x_{\ast }}={x_{\ast [n]}}:=({x_{\ast 1}},\dots ,{x_{\ast n}})=({x_{n}^{\ast }},\dots ,{x_{1}^{\ast }})$. This determines the complementary transformation. For instance,
(1)
\[\begin{aligned}{}x& =({x_{1}},\dots ,{x_{n}}):\hspace{1em}\overrightarrow{\dots \texttt{CGGATTTAGCTA}\dots }\hspace{0.1667em},\end{aligned}\]
(2)
\[\begin{aligned}{}{x^{\ast }}& =({x_{1}^{\ast }},\dots ,{x_{n}^{\ast }}):\hspace{1em}\stackrel{\gets }{\dots \texttt{GCCTAAATCGAT}\dots }\hspace{0.1667em},\end{aligned}\]
(3)
\[\begin{aligned}{}{x_{\ast }}& =({x_{n}^{\ast }},\dots ,{x_{1}^{\ast }}):\hspace{1em}\overrightarrow{\dots \texttt{TAGCTAAATCCG}\dots }\hspace{0.1667em}.\end{aligned}\]
Chargaff and his colleagues (Rudner et al., 1968) have noticed that
\[\begin{aligned}{}& \big|\big\{t\in [n]:{x_{t}}=\texttt{A}\big\}\big|\approx \big|\big\{t\in [n]:{x_{t}}=\texttt{T}\big\}\big|,\\ {} & \big|\big\{t\in [n]:{x_{t}}=\texttt{C}\big\}\big|\approx \big|\big\{t\in [n]:{x_{t}}=\texttt{G}\big\}\big|,\end{aligned}\]
($|A|$ is the number of elements in a set A) which actually means that
\[ \big|\big\{t\in [n]:{x_{t}}=\nu \big\}\big|\approx \big|\big\{t\in [n]:{x_{t}^{\ast }}=\nu \big\}\big|,\hspace{1em}\forall \nu \in \mathcal{A}.\]
Thus, if x is treated as a random sequence, the last expression can be interpreted and generalized as follows: a probabilistic law generating x is invariant with respect to the complementary transformation $x\to {x_{\ast }}$.

2.2 Basics of Markov Random Fields

Let us start with basic notation and notions. Set $\mathcal{N}=[n]$, fix some positive integer $m<n/2$ and define the m-interior ${\mathcal{N}_{m}^{\circ }}$, the m-boundary $\partial {\mathcal{N}_{m}}$ of $\mathcal{N}$, and a collection of neighbourhoods:
(4)
\[\begin{aligned}{}& {\mathcal{N}^{\circ }}={\mathcal{N}_{m}^{\circ }}:=[m+1,\dots ,n-m],\hspace{2em}\partial \mathcal{N}=\partial {\mathcal{N}_{m}}:=\mathcal{N}\setminus {\mathcal{N}_{m}^{\circ }},\\ {} & \mathcal{N}(\ell )={\mathcal{N}_{m}}(\ell ):=[\ell -m,\ell +m]\setminus \{\ell \},\hspace{1em}\ell \in {\mathcal{N}^{\circ }}.\end{aligned}\]
Given $x\in {\mathcal{A}^{n}}$ and a set of indices $J\subset \mathcal{N}$, let ${x_{J}}:=({x_{i}},i\in J)$ denote the corresponding subsequence of x treated as an element of ${\mathcal{A}^{|J|}}$.
Definition 1.
A random sequence $x\in {\mathcal{A}^{n}}$ is called an m-order Markov random field (MRF) with the state space $\mathcal{A}$ and the collection of neighbourhoods (4) iff $\forall a\in {\mathcal{A}^{n}}$ and $\forall \ell \in {\mathcal{N}_{m}^{\circ }}$
(5)
\[ \mathbf{P}\big\{{x_{\ell }}={a_{\ell }}\mid {x_{i}}={a_{i}},\hspace{0.1667em}i\in [n],\hspace{0.1667em}j\ne \ell \big\}=\mathbf{P}\big\{{x_{\ell }}={a_{\ell }}\mid {x_{{\mathcal{N}_{m}}(\ell )}}={a_{{\mathcal{N}_{m}}(\ell )}}\big\}.\]
A MRF x is called an m-order homogeneous MRF (m-MRF) if its m-order marginal conditional probabilities given in the right-hand side of (5) are independent of the site $\ell \in {N^{\circ }}$.
Definition 2.
For a fixed reference value $r\in \mathcal{A}$ and given m-order marginal conditional probabilities
(6)
\[ p(v|u):=\mathbf{P}\{{x_{m+1}}=v\mid {x_{{\mathcal{N}_{m}}(m+1)}}=u\},\]
the respective generalized logit ${\Lambda _{v}}(u)={\Lambda _{v|r}}(u)$ of a state $v\in \mathcal{A}$ versus r, given the neighbouring values $u\in {\mathcal{A}^{2m}}$, is defined as
(7)
\[ {\Lambda _{v|r}}(u):=\log \bigg(\frac{p(v|u)}{p(r|u)}\bigg),\]
where we set $\log (0/0)=0$ and $\log (p/0)=\infty $ for $p>0$.
Suppose that values of m-MRF x are fixed on the boundary $\partial \mathcal{N}$: ${x_{\partial \mathcal{N}}}=b$ a.s. for some $b\in {\mathcal{A}^{2m}}$. Denote
\[ {\mathcal{X}_{b}}:=\big\{w\in {\mathcal{A}^{n}}:\hspace{2.5pt}{w_{\partial \mathcal{N}}}=b\big\}.\]
From Hammersley–Clifford theorem (Besag, 1974), we obtain the following statement.
Proposition 1.
Suppose the distribution of m-MRF x is positive on ${\mathcal{X}_{b}}$, i.e. $\mathbf{P}\{x=w\}>0$ for all $w\in {\mathcal{X}_{b}}$. Then the distribution of x is uniquely determined by the family of generalized logits ${\Lambda _{v|r}}(u)$, $r,v\in \mathcal{A}$, $u\in {\mathcal{A}^{2m}}$, which for $w\in {\mathcal{A}^{2m+1}}$, take the following form
(8)
\[ {\Lambda _{{w_{m+1}}|r}}({w_{{\mathcal{N}_{m}}(m+1)}})={\sum \limits_{j=1}^{m+1}}\big[{\lambda _{m}}({w_{[j,m+j]}})-{\lambda _{m}}\big({w_{[j,m+j]}^{(r)}}\big)\big],\]
and in general depend on $M=(|\mathcal{A}|-1)|\mathcal{A}{|^{m}}$ free scalar parameters. Here ${\lambda _{m}}$: ${\mathcal{A}^{m+1}}\to \mathbf{R}$ is an arbitrary function and ${w^{(r)}}$ is obtained from w by substituting r for ${w_{m+1}}$.
The statement is well-known, it is just rewritten in the notation introduced above.

2.3 Local Symmetry: Definition and Characterization

Let us recall that DNA strand symmetry means that probability distribution of oligonucleotides (sequences of adjacent nucleotides) of the both complementary strands of DNA, read in the respective direction, are similar in some sense. Having in mind the definition of m-MRF, the following formal definition of DNA strand symmetry can be given in terms of complementary transformation $w\to {w_{\ast }},w\in {\mathcal{A}^{2m+1}},$ defined in Section 2.1.
Definition 3 (See Židanavičiūtė, 2010).
A random sequence ${x_{[n]}}$ is m-order locally symmetric ($m<n/2$) iff
(9)
\[ \mathbf{P}\{{x_{\ell }}=v\mid {x_{{\mathcal{N}_{m}}(\ell )}}=u\}=\mathbf{P}\big\{{x_{\ell }}={v^{\ast }}\hspace{0.1667em}\big|\hspace{0.1667em}{x_{{\mathcal{N}_{m}}(\ell )}}={u^{\ast }}\big\}\]
for all $\ell \in {\mathcal{N}^{\circ }},\hspace{2.5pt}v\in \mathcal{A},u\in {\mathcal{A}^{2m}}$.
Thus, for locally symmetric sequence, the marginal conditional distributions given m nearest neighbours (from the each side) are invariant under the complementary transformation. Under the assumption that DNA sequence x is m-MRF, the local strand symmetry can be expressed in terms of the conditional distributions $p(v|u)$ and/or the generalized logits ${\Lambda _{v}}(u)$.
Table 1
Nucleotide recoding rule.
Purine (Bonds) Pyrimidine s
Weak (2 bonds) A $(\hspace{2.5pt}=\hspace{2.5pt})$ T $s=-1$
Strong (3 bonds) G $(\hspace{2.5pt}\equiv \hspace{2.5pt})$ C $s=+1$
y $y=-1$ $y=+1$
For characterization of local symmetry in terms of the generalized logits, it is convenient to change the initial alphabet $\mathcal{A}=\{\texttt{A},\texttt{C},\texttt{G},\texttt{T}\}$ of nucleotides v to ${\mathcal{A}_{1}^{2}}={\mathcal{A}_{1}}\times {\mathcal{A}_{1}}$, ${\mathcal{A}_{1}}:=\{-1,+1\}$, via mapping $v\to z:=(s,y)$ by the rule indicated in Table 1. The components $s=s(v)\in {\mathcal{A}_{1}}$ and $y=y(v)\in {\mathcal{A}_{1}}$ of a nucleotide $v\in \mathcal{A}$ represent its bonding property strong versus weak and its hydrophobic property pyrimidine (large molecule, less hydrophobic) versus purine (small molecule, more hydrophobic), respectively.
Now, let $x=({x_{1}},\dots ,{x_{k}})\in {\mathcal{A}^{k}}$ be a nucleotide sequence in the leading strand of DNA and let ${x^{\ast }}$ be its complement read from the left to the right but taken in the common direction. Set
(10)
\[\begin{aligned}{}& z=z(x):=(\overrightarrow{s},\overrightarrow{y})\in {\mathcal{A}_{1}^{2k}},\hspace{1em}\hspace{2.5pt}{(\overrightarrow{s})_{i}}:=s({x_{i}}),\hspace{1em}\hspace{2.5pt}\hspace{2.5pt}{(\overrightarrow{y})_{i}}:=y({x_{i}}),\hspace{1em}i=1,\dots ,k,\end{aligned}\]
(11)
\[\begin{aligned}{}& {z_{\ast }}={z_{\ast }}(x):=({\overrightarrow{s}_{\ast }},{\overrightarrow{y}_{\ast }})=(\stackrel{\gets }{s},-\stackrel{\gets }{y}),\end{aligned}\]
(12)
\[\begin{aligned}{}& {(\stackrel{\gets }{s})_{i}}={(\overrightarrow{s})_{k-i+1}},\hspace{2.5pt}{(\stackrel{\gets }{y})_{i}}={(\overrightarrow{y})_{k-i+1}},\hspace{1em}i=1,\dots ,k.\end{aligned}\]
Then $z({x_{\ast }})={z_{\ast }}(x)$. To illustrate the notation we apply them to the nucleotide sequence from (1) (to save space here and below we will omit the numeral 1):
\[\begin{aligned}{}& x=({x_{1}},\dots ,{x_{n}}):\hspace{1em}\dots \texttt{CGGATTTAGCTA}\dots \hspace{0.1667em},\\ {} & s=({s_{1}},\dots ,{s_{n}}):\hspace{1em}\dots \texttt{+++-----++--}\dots \hspace{0.1667em},\\ {} & y=({y_{1}},\dots ,{y_{n}}):\hspace{1em}\dots \texttt{+---+++--++-}\dots \hspace{0.1667em},\\ {} & {x_{\ast }}=({x_{n}^{\ast }},\dots ,{x_{1}^{\ast }}):\hspace{1em}\dots \texttt{TAGCTAAATCCG}\dots \hspace{0.1667em},\\ {} & {\overrightarrow{s}_{\ast }}=({s_{n}},\dots ,{s_{1}}):\hspace{1em}\dots \texttt{--++-----+++}\dots \hspace{0.1667em},\\ {} & {\overrightarrow{y}_{\ast }}=-({y_{n}},\dots ,{y_{1}}):\hspace{1em}\dots \texttt{+--++---+++-}\dots \hspace{0.1667em}.\end{aligned}\]
In what follows we identify $p(v\mid u)$ with $p(z(v)\mid z(u))$ and ${\Lambda _{v|r}}(u),\hspace{0.1667em}r=\texttt{A},$ with ${\Lambda _{z(v)}}(z(u))$.
Let us introduce functions that are symmetric (antisymmetric) with respect to the complementary transformation $z\to {z_{\ast }},\hspace{0.1667em}z\in {\mathcal{A}_{1}^{2k}},$ defined in (10)–(12).
Definition 4.
A function $\psi :{\mathcal{A}_{1}^{2k}}\to \mathbf{R}$ is called symmetric (antisymmetric) with respect to the complementary transformation $w\to {w_{\ast }}$ iff $\psi (w)=\psi ({w_{\ast }})$ (respectively, $\psi (w)=-\psi ({w_{\ast }})$) for all $w\in {\mathcal{A}_{1}^{2k}}$.
Proposition 2.
Let $p(\eta \mid w),\eta \in {\mathcal{A}_{1}^{2}}$, $w\in {\mathcal{A}_{1}^{4m}}$, denote the m-order conditional probabilities of a bivariate random sequence $z(x)$ obtained from the nucleotide sequence $x\in {\mathcal{A}^{2m+1}}$ via z-transform (10). The following statements are equivalent:
  • (a) the sequence x and the marginal conditional probabilities $p(\cdot \mid \cdot )$ are m-order locally symmetric;
  • (b) there exist a symmetric function $\psi :{\mathcal{A}_{1}^{4m}}\to \mathbf{R}$ and two antisymmetric functions ${\psi _{-}}:{\mathcal{A}_{1}^{4m}}\to \mathbf{R}$ and ${\psi _{+}}:{\mathcal{A}_{1}^{4m}}\to \mathbf{R}$ such that
    (13)
    \[\begin{aligned}{}& \log \bigg(\frac{p(-,+\mid w)}{p(-,-\mid w)}\bigg)={\psi _{-}}(w),\end{aligned}\]
    (14)
    \[\begin{aligned}{}& \log \bigg(\frac{p(+,+\mid w)}{p(+,-\mid w)}\bigg)={\psi _{+}}(w),\end{aligned}\]
    (15)
    \[\begin{aligned}{}& \log \bigg(\frac{p(+,+\mid w)\cdot p(+,-\mid w)}{p(-,+\mid w)\cdot p(-,-\mid w)}\bigg)=\psi (w),\hspace{1em}\forall w\in {\mathcal{A}_{1}^{4m}}.\end{aligned}\]
    Another form of (13)–(15) expressed in terms of the generalized logits ${\Lambda _{s,y}}(w)$:
    (16)
    \[\begin{aligned}{}{\Lambda _{-,+}}(w)& ={\psi _{-}}(w),\end{aligned}\]
    (17)
    \[\begin{aligned}{}{\Lambda _{+,-}}(w)& =\frac{1}{2}\big(\psi (w)-{\psi _{+}}(w)+{\psi _{-}}(w)\big),\end{aligned}\]
    (18)
    \[\begin{aligned}{}{\Lambda _{+,+}}(w)& =\frac{1}{2}\big(\psi (w)+{\psi _{+}}(w)+{\psi _{-}}(w)\big).\end{aligned}\]
Proof.
From the definition of generalized logits (7) and the recoding rule defined in Table 1 and (10), (11), we obtain, for all $w\in {\mathcal{A}_{1}^{4m}}$,
\[\begin{aligned}{}{\Lambda _{-,+}}(w)& =\log \bigg(\frac{p(-,+\mid w)}{p(-,-\mid w)}\bigg)={\psi _{-}}(w),\\ {} {\Lambda _{+,+}}(w)-{\Lambda _{+,-}}(w)& =\log \bigg(\frac{p(+,+\mid w)}{p(+,-\mid w)}\bigg)={\psi _{+}}(w),\\ {} {\Lambda _{+,+}}(w)+{\Lambda _{+,-}}(w)-{\Lambda _{-,+}}(w)& =\log \bigg(\frac{p(+,+\mid w)\cdot p(+,-\mid w)}{p(-,-\mid w)\cdot p(-,+\mid w)}\bigg)=\psi (w).\end{aligned}\]
Let us check that the functions ${\psi _{-}}(w),{\psi _{+}}(w)$ and $\psi (w)$ possess the respective properties. By the definition of the local symmetry
(19)
\[ p(s,y\mid w)=p\big(s,-y\hspace{0.1667em}\big|\hspace{0.1667em}{w^{\ast }}\big),\hspace{1em}\forall w\in {\mathcal{A}_{1}^{4m}}.\]
Consequently, for all $w\in {\mathcal{A}_{1}^{4m}}$,
(20)
\[\begin{aligned}{}{\psi _{-}}(w)& =\log \bigg(\frac{p(-,+\mid w)}{p(-,-\mid w)}\bigg)=\log \bigg(\frac{p(-,-\mid {w^{\ast }})}{p(-,+\mid {w^{\ast }})}\bigg)\end{aligned}\]
(21)
\[\begin{aligned}{}& =-\log \bigg(\frac{p(-,+\mid {w^{\ast }})}{p(-,-\mid {w^{\ast }})}\bigg)=-{\psi _{-}}({w^{\ast }}).\end{aligned}\]
Thus, ${\psi _{-}}(u)$ is antisymmetric. Analogously, for all $w\in {\mathcal{A}_{1}^{4m}}$,
(22)
\[\begin{aligned}{}{\psi _{+}}(w)& =\log \bigg(\frac{p(+,+\mid w)}{p(+,-\mid w)}\bigg)=\log \bigg(\frac{p(+,-\mid {w^{\ast }})}{p(+,+\mid {w^{\ast }})}\bigg)\end{aligned}\]
(23)
\[\begin{aligned}{}& =-\log \bigg(\frac{p(+,+\mid {w^{\ast }})}{p(+,-\mid {w^{\ast }})}\bigg)=-{\psi _{+}}({w^{\ast }})\end{aligned}\]
and
(24)
\[\begin{aligned}{}\psi (w)& :=\log \bigg(\frac{p(+,+\mid w)p(+,-\mid w)}{p(-,+\mid w)p(-,-\mid w)}\bigg)\end{aligned}\]
(25)
\[\begin{aligned}{}& =\log \bigg(\frac{p(+,-\mid {w^{\ast }})p(+,+\mid {w^{\ast }})}{p(-,-\mid {w^{\ast }})p(-,+\mid {w^{\ast }})}\bigg)=\psi ({w^{\ast }}).\end{aligned}\]
The proof is completed.  □
When estimating the generalized logits ${\Lambda _{\tau }}(w)$ one needs some parametrization. Below convenient parametric representations for symmetric and antisymmetric functions are presented.
According to the recoding rule defined in Table 1 and (10)–(12), $z=(s,y),\hspace{0.1667em}s,y\in {\mathcal{A}_{1}^{2m}},$ and hence in the sequel we deal with functions $\psi (s,y)$, $\psi :{\mathcal{A}_{1}^{k}}\times {\mathcal{A}_{1}^{k}}\to \mathbf{R}$, $k:=2m$.
Let $J\subset K:=\{1,\dots ,k\}$. Define the conjugate set ${J_{\ast }}$ of the set J by
(26)
\[ {J_{\ast }}:=k+1-J=\{k+1-j:j\in J\}.\]
For a given sequence $s=({s_{1}},\dots ,{s_{k}})\in {\mathcal{A}_{1}^{k}}$, denote
(27)
\[ {s^{J}}:=\prod \limits_{i\in J}{s_{i}},\hspace{1em}{s^{\varnothing }}:=1.\]
Any function $\psi :{\mathcal{A}_{1}^{k}}\times {\mathcal{A}_{1}^{k}}\to \mathbf{R}$ has the unique representation
(28)
\[ \psi (s,y)=\sum \limits_{{J^{\prime }},J\subset K}{a_{{J^{\prime }}J}}\hspace{0.2778em}{s^{{J^{\prime }}}}{y^{J}},\hspace{1em}s,y\in {\mathcal{A}_{1}^{k}},\]
where summation is over all subsets of K (including the empty set ∅), ${a_{{J^{\prime }}J}}={a_{{J^{\prime }}J}}(\psi ),\hspace{0.2778em}{J^{\prime }},J\subset K,$ are free parameters determining the function ψ. In general, there are ${4^{k}}$ free parameters.
For a symmetric (antisymmetric) function ψ, we have
(29)
\[ \psi (\stackrel{\gets }{s},-\stackrel{\gets }{y})=\psi (s,y)\hspace{1em}\big(\mathrm{respectively},\psi (\stackrel{\gets }{s},-\stackrel{\gets }{y})=-\psi (s,y)\big).\]
Consequently, in the case of the symmetric ψ, for all $s,y\in {\mathcal{A}_{1}^{k}}$,
(30)
\[ \sum \limits_{{J^{\prime }},J\subset K}{a_{{J^{\prime }}J}}\hspace{0.2778em}{s^{{J^{\prime }}}}{y^{J}}=\sum \limits_{{J^{\prime }},J\subset K}{(-1)^{|J|}}{a_{{J^{\prime }}J}}\hspace{0.2778em}{s^{{J^{\prime }_{\ast }}}}{y^{{J_{\ast }}}}=\sum \limits_{{J^{\prime }},J\subset K}{(-1)^{|J|}}{a_{{J^{\prime }_{\ast }}{J_{\ast }}}}\hspace{0.2778em}{s^{{J^{\prime }}}}{y^{J}},\]
and hence
(31)
\[ {a_{{J^{\prime }}J}}={(-1)^{|J|}}{a_{{J^{\prime }_{\ast }}{J_{\ast }}}},\hspace{1em}{J^{\prime }},J\subset K.\]
If ${J_{\ast }}=J$ and ${J^{\prime }_{\ast }}={J^{\prime }}$ (i.e. the both subsets ${J^{\prime }}$ and J are self-conjugate), the set J has an even number of elements and the equations (31) become the identities. Thus, there are no restrictions on the parameter ${a_{{J^{\prime }}J}}$ values in this case. Let ${k_{\ast }}={k_{\ast }}(k)$ denote the total number of the self-conjugate subsets of K.
Let τ be some total order (enumeration of elements) in the class of pairs $({J^{\prime }},J)$ of the set K. Equations (31) imply that, for not self-conjugate pairs $({J^{\prime }},J)$, $({J^{\prime }},J)\ne ({J^{\prime }_{\ast }},{J_{\ast }})$, values of the coefficients ${a_{{J^{\prime }},J}},\hspace{0.2778em}\tau ({J^{\prime }},J)<\tau ({J^{\prime }_{\ast }},{J_{\ast }}),$ uniquely determine values of the remaining coefficients ${a_{{J^{\prime }},J}},\hspace{0.2778em}\tau ({J^{\prime }},J)>\tau ({J^{\prime }_{\ast }},{J_{\ast }})$. Define
(32)
\[\begin{aligned}{}{\mathcal{K}_{2}}& :=\big\{({J^{\prime }},J):\tau ({J^{\prime }},J)<\tau ({J^{\prime }_{\ast }},{J_{\ast }})\big\},\end{aligned}\]
(33)
\[\begin{aligned}{}{\mathcal{K}_{20}}& :=\big\{({J^{\prime }},J):\tau ({J^{\prime }},J)=\tau ({J^{\prime }_{\ast }},{J_{\ast }})\big\}.\end{aligned}\]
From (28), (31), (32) and (33) we derive a general parametric form of a symmetric function ψ:
(34)
\[ {\psi _{S}}(s,y)=\sum \limits_{({J^{\prime }},J)\in {\mathcal{K}_{20}}}{a_{{J^{\prime }}J}}{s^{{J^{\prime }}}}{y^{J}}+\sum \limits_{({J^{\prime }},J)\in {\mathcal{K}_{2}}}{a_{{J^{\prime }}J}}\big({s^{{J^{\prime }}}}{y^{J}}+{(-1)^{|J|}}{s^{{J^{\prime }_{\ast }}}}{y^{{J_{\ast }}}}\big).\]
It has
(35)
\[ {k_{S}}={k_{S}}(k):={k_{\ast }^{2}}+({4^{k}}-{k_{\ast }^{2}})/2\]
free parameters.
The case of antisymmetric function differs from that of symmetric function only in additional minus sign in equations (31). For self-conjugate pairs $({J^{\prime }},J)$, these equations hold if and only if ${a_{{J^{\prime }},J}}=0$. Thus, the first summand in (34) and in (35) disappears giving the function
(36)
\[ {\psi _{A}}(s,y)=\sum \limits_{(J,{J^{\prime }})\in {\mathcal{K}_{2}}}{a_{J,{J^{\prime }}}}\big({s^{J}}{y^{{J^{\prime }}}}-{(-1)^{|J|}}{s^{{J^{\prime }_{\ast }}}}{y^{{J_{\ast }}}}\big)\]
with
(37)
\[ {k_{A}}={k_{A}}(k):=\big({4^{k}}-{k_{\ast }^{2}}\big)/2\]
free parameters.
For $m=1$, ${k_{\ast }}=2$, thus ${k_{A}}=({4^{2}}-{2^{2}})/2=6$ and ${k_{S}}={k_{\ast }^{2}}+{k_{A}}=10$. Then symmetric (34) and antisymmetric (36) functions in a general form are given by
\[\begin{aligned}{}{\psi _{S}}(s,y)& ={a_{\varnothing \varnothing }}+{a_{\varnothing \{12\}}}{y_{1}}{y_{2}}+{a_{\{12\}\varnothing }}{s_{1}}{s_{2}}+{a_{\{12\}\{12\}}}{s_{1}}{s_{2}}{y_{1}}{y_{2}}\\ {} & \hspace{1em}+{a_{\varnothing \{1\}}}({y_{1}}+{y_{2}})+{a_{\{1\}\varnothing }}({s_{1}}-{s_{2}})\\ {} & \hspace{1em}+{a_{\{1\}\{1\}}}({s_{1}}{y_{1}}-{s_{2}}{y_{2}})+{a_{\{1\}\{2\}}}({s_{1}}{y_{2}}-{s_{2}}{y_{1}})\\ {} & \hspace{1em}+{a_{\{1\}\{12\}}}({s_{1}}{y_{1}}{y_{2}}-{s_{2}}{y_{1}}{y_{2}})+{a_{\{12\}\{1\}}}({s_{1}}{s_{2}}{y_{1}}+{s_{1}}{s_{2}}{y_{2}}),\\ {} {\psi _{A}}(s,y)& ={a_{\varnothing \{1\}}}({y_{1}}-{y_{2}})+{a_{\{1\}\varnothing }}({s_{1}}+{s_{2}})\\ {} & \hspace{1em}+{a_{\{1\}\{1\}}}({s_{1}}{y_{1}}+{s_{2}}{y_{2}})+{a_{\{1\}\{2\}}}({s_{1}}{y_{2}}+{s_{2}}{y_{1}})\\ {} & \hspace{1em}+{a_{\{1\}\{12\}}}({s_{1}}{y_{1}}{y_{2}}+{s_{2}}{y_{1}}{y_{2}})+{a_{\{12\}\{1\}}}({s_{1}}{s_{2}}{y_{1}}-{s_{1}}{s_{2}}{y_{2}}),\end{aligned}\]
respectively.
Remark 1.
An ordered sequence of symbols $x=\overrightarrow{x}$ is said to be palindromic iff $\overrightarrow{x}=\stackrel{\gets }{x}$. We refer to the mapping $\overrightarrow{x}\to \stackrel{\gets }{x}$ as palindromic transformation. In particular, for a DNA sequence x, the sequence $(x,{x_{\ast }^{\ast }})$ (here ${x_{\ast }^{\ast }}={({x_{\ast }})^{\ast }}={({x^{\ast }})_{\ast }}$) is palindromic and for a palindromic DNA sequence x, we have $s(x)=s({x_{\ast }})$ and $y(x)=-y({x_{\ast }})$. Note that the mapping ${\mathcal{A}_{1}}\to \{+1,-1\}$ is a palindromic transform of the binary alphabet ${\mathcal{A}_{1}}$. Thus, the transform $y(x)\to y({x_{\ast }})$ is a superposition of two palindromic transforms: the transform of ordering of the sequence $y(x)$ elements and the transform of their alphabet.
Palindromic distributions are defined as those invariant under some palindromic operation. For instance, palindromic Bernoulli distributions (Marchetti and Wermuth, 2016) and palindromic Ising models (Marchetti and Wermuth, 2017) are invariant with respect to palindromic transforms of the alphabet. Formulas (34) and (36) are analogues of the characterization of palindromic Bernoulli distribution in terms of log-linear parameters of multivariate Bernoulli distribution given in Marchetti and Wermuth (2016).

3 Statistical Analysis

In this section, the first-order local symmetry of the longest non-coding sequences of bacterial genomes is tested by making use of its characterization in terms of generalized logits. A special regression-type probabilistic structure is imposed on the data.

3.1 Regression-Type Probabilistic Structure of the Data

Let us introduce the following data structure of the observed sequence $x\in {\mathcal{A}^{n}}$ with $n=({n_{m}}+1)\cdot (m+1)-1$, the quantity ${n_{m}}$ being an integer:
(38)
\[ \mathcal{D}:=\big\{({v_{\ell }},{z_{\ell }}),\ell \in S\big\},\hspace{1em}S={S_{n,m}}=\{1,2,\dots ,{n_{m}}\},\]
where ${v_{\ell }}:={x_{(m+1)\ell }}$ is a response variable and ${z_{\ell }}={x_{{U_{m}}((m+1)\ell )}}\in {\mathcal{A}^{2m}}$ is a vector of explanatory variables, $\ell \in S$.
Assumption (Am):
  • 1. $\{{v_{\ell }},\hspace{2.5pt}\ell \in S\}$ are conditionally independent given $\{{z_{\ell }},\hspace{2.5pt}\ell \in S\}$,
  • 2. the conditional distribution of ${v_{\ell }}$ when value of ${z_{\ell }}$ is given does not depend on the site $\ell \in S$.
Assumption (Am) ensures that usual conditions of the generalized logit model with the response variable $v\in \mathcal{A}$ and the vector $z\in {\mathcal{A}^{2m}}$ of explanatory variables are satisfied, see Agresti (1990), Stokes et al. (2001).
Remark 2 (Compatible evolutionary models).
Suppose that a DNA sequence x is an outcome of a “long” homogeneous Markov evolution and hence has a stationary distribution. Assumption (Am) imposed on x is compatible with some common DNA evolutionary models. In particular, assumption (Am) with $m=2$ hold for the independent codon evolution (Goldman and Yang, 1994). Assumption (Am) is also fulfilled if x is generated by m-MRF. Thus, it is valid in case of time-reversible, site-homogeneous and context-dependent Markov evolution model with m-order nearest neighbour interactions (see, e.g. Jensen, 2005). However, it is satisfied for some non-homogeneous, say $(m+1)$-periodic, MRF of order m as well.
In general, the introduced regression-type data structure supplemented with a saturated generalized logit model for m-order conditional probabilities does not determine the distribution of x. However, if assumption (Am) holds for $S={\mathcal{N}^{\circ }}$ (to be precise, for all shifts $((m+1)S+\ell )\cap {\mathcal{N}^{\circ }}$ of the set of central nucleotides $(m+1)S$ by ℓ, $\ell =1,\dots ,m-1$, simultaneously), then, due to Hammersley–Clifford theorem (Proposition 1), x is m-MRF, and m-order generalized logits take the form of (8) and determine the distribution of x.

3.2 Testing of Local Symmetry

We analyse data of bacterial genomes (1221 genomes) taken from the database GenBank (https://www.ncbi.nlm.nih.gov/genbank/). In order to bypass the data sparsity problem the longest non-coding (for the both strands) DNA sequences are extracted from each genome. Assuming that the extracted sequences satisfy assumption (Am) with $m=2$ we test their first order local symmetry.
info1231_g001.jpg
Fig. 1
The length distribution density of the longest non-coding sequences of bacteria genomes plotted in a logarithmic scale.
In Fig. 1, the length distribution density of the extracted sequences is plotted in a logarithmic scale. The sequence lengths range from 1891 to 42901 with median 6605 and mean 7721. About a half of the sequences have length between 6000 and 8000. Since we assume (Am) with $m=2$, the logit analysis is based on three-dimensional contingency tables (64 cells) of nonintersecting triplets in the DNA sequences. The average and median of cell counts in the tables are 40 and 34, respectively. The percentage of cells with less than 6 counts does not exceed 1%. Thus we can ignore p-value approximation problems incident to statistical analysis of sparse contingency tables (Agresti, 1990).
Generalized logit model is fitted to the data and the Wald criterion is applied to test if the coefficients of the generalized logit model satisfy conditions implied by antisymmetric (36) and symmetric (34) components of generalized logits specified in Proposition 2.
In Figs. 2–4, values of the logarithmized Student statistic (the Student statistic S transformed by $S\to \mathrm{sgn}(S){\log _{2}}(1+|S|)$) for testing the significance of the coefficients of response functions ${\psi _{-}},{\psi _{+}}$ and ψ defined in (13)–(15), respectively, are presented. For better visibility of the logarithmized Student statistic distributions, we use the violin plot (Hintze and Nelson, 1998; Wickham, 2016), which combines a box plot and a kernel density plot that is rotated and placed on each side, to show the distribution shape of the data. The first 6 coefficients represent the antisymmetric part of the response functions and the last 10 represent the symmetric part. According to Proposition 2, in case of the local symmetry, the first 2 response functions should be antisymmetric while the last one should be symmetric. Hence the last 10 and, respectively, the first 6 coefficients should be insignificant. In the figures, the approximate critical value obtained by 3σ rule (i.e. for the significance level ≈0.0054) corresponds to y-coordinates ±2.
First response function (expected to be antisymmetric). The distributions of its coefficient estimates are represented in Fig. 2. The coefficient estimates of the antisymmetric part (white violins) have skewed distributions, especially the second, which is left-skewed and has large positive bias, and the third, which is right-skewed and has large negative bias. The distributions in the symmetric part (grey violins) are quite symmetric about zero. A large proportion of the non-coding DNA sequences (>40%) has significant (at the approximate significance level of 0.005) 7th coefficient (7th parameter) expected to be zero in the case of local symmetry.
info1231_g002.jpg
Fig. 2
Distribution of the logarithmized Student statistic of the 1st response function coefficients: the first 6 coefficients represent the antisymmetric part, the last 10 – the symmetric part (expected to be null).
info1231_g003.jpg
Fig. 3
Distribution of the logarithmized Student statistic of the 2nd response function coefficients: the first 6 coefficients represent antisymmetric part, the last 10 – the symmetric part (expected to be null).
In what follows only violations of local symmetry (grey violins) are discussed.
Second response function (expected to be antisymmetric). A major part (>70%) of the non-coding sequences has significant 7th coefficient (23rd parameter) expected to be zero in the case of local symmetry. A large proportion of the sequences also exhibits significant deviations from 0 of the 8th coefficient (24th parameter).
info1231_g004.jpg
Fig. 4
Distribution of the logarithmized Student statistic of the 3rd response function coefficients: the first 6 coefficients represent antisymmetric part (expected to be null), the last 10 – the symmetric part.
Third response function (expected to be symmetric). The second coefficient (34th parameter) expected to be zero in case of local symmetry shows a clear tendency to deviate significantly from 0.
In Fig. 5, centres of 8 clusters obtained using the standard R function for k-means clustering (, 2018) of 48-dimensional vectors of the estimated model parameters (i.e. estimated coefficients of the all three response functions) are drawn. The coordinates of each centre are joint thus representing 8 different patterns of their interrelationships. The centre of the 8th cluster represents DNA sequences which approximately satisfy the local symmetry hypothesis. The sequences of the third cluster are also rather close to symmetry. Clusters 8 and 3, however, apparently differ in the regions $[17,19]$ and $[19,41]$. All the clusters are similar in $[1,6]$. In the grey zones (regions $[7,16]$ and $[23,38]$), we have two triplets of similar clusters: $(1,2,6)$ and $(4,5,7)$. The 39th parameter for cluster 1 clearly differs from that of clusters 2 and 6 having the opposite sign. The same applies to clusters 4, 5 and 7, respectively. Clusters 2 and 6, as well as 5 and 7, exhibit some discrepancy in values of parameter 41. Cluster 5 also has specific values in the region $[18,19]$.
Note that the deviations of the parameter estimates in the grey region, i.e. their deviations from the DNA local symmetry hypothesis, are quite symmetric, see also Figs. 2–4. This observation is consistent with the ISFDP property noticed in Powdel et al. (2009).
info1231_g005.jpg
Fig. 5
Lines represent the patterns of 8 clusters obtained via k-means clustering from 48-dimensional data of the logarithmized Student statistics. The grey region indicates the model parameters vanishing under the null hypothesis of the local symmetry.

3.3 Concluding Remarks

Elements of DNA sequences x are treated as random variables taking values from the alphabet $\mathcal{A}:=\{\texttt{A},\texttt{C},\texttt{G},\texttt{T}\}$. A definition of the local symmetry of x of order m is given and is characterized in terms of generalized logits (Židanavičiūtė, 2010). To test the first order local symmetry of non-coding sequences of bacteria genoms a special regression-type structure is imposed on probability distribution of x (assumption (Am) with $m=2$). It defines a generalized logit model with 48 scalar parameters. In the case of the first order local symmetry, 22 of them should vanish.
The generalized logit model was fitted to the longest non-coding sequences of 1221 bacteria genomes taken from GenBank and Wald test was applied to check the null hypothesis of the first order local symmetry.
Conclusions:
  • 1. Most of the non-coding sequences of bacteria genomes do not possess the first order local symmetry.
  • 2. The deviations from the local symmetry of the non-coding sequences are pretty symmetric: the sample distributions of estimates of the model parameters that should vanish in case of the local symmetry are very close to symmetric one. Apparently this symmetry is related to intra-strand frequency distribution parity noticed in Powdel et al. (2009).
  • 3. As a by-product of the statistical analysis of the local symmetry, we show that distributions of adjacent nucleotides are not independent even for the non-coding sequences of bacteria genoms. Hence independent evolution models (see, e.g. Faith and Pollock, 2003; Marin and Xia, 2008) are not consistent with the data of bacteria genomes.

Further work

A natural next step is to study higher order asymmetry patterns. Under assumptions (Am) with $m=2$, for the statistical analysis of the second order local asymmetries the saturated generalized logit model can be applied. Then the analysis is based on 5-dimensional contingency tables (1024 cells). Hence for the data of the longest non-coding bacterial sequences, the average cell frequency in the contingency tables is less than 3, thus indicating their sparsity. A straightforward solution of the sparsity problem by joining all non-coding sequences of each genome seems to be inappropriate because of heterogeneity of DNA sequences (see, e.g. Cristadoro et al., 2018). Special statistical methods are needed to deal with both the sparsity and heterogeneity.

Acknowledgements

The authors are grateful to Nanny Wermuth for relevant references and stimulating discussions on palindromic graphical models and to the anonymous referee for constructive comments.

References

 
Agresti, A. (1990). Categorical Data Analysis. John Wiley & Sons, New York.
 
Arndt, P.F., Burge, Ch.B., Hwa, T. (2003). DNA sequence evolution with neighbor-dependent mutation. Journal of Computational Biology, 10(3–4), 313–322.
 
Afreixo, V., Rodriges, J.M.O.S., Bastos, C.A.C., Tavares, A.H.M.P., Silva, R.M. (2017). Exceptional symmetry by genomic word: a statistical analysis. Interdisciplinary Sciences Computational Life Sciences, 9, 14–23.
 
Baisnée, P.-F., Hampson, S., Baldi, P. (2002). Why are complementary DNA strands symmetric? Bioinformatics, 18(8), 1021–1033.
 
Bérard, J., Guéguen, L. (2012). Accurate estimation of substitution rates with neighbor-dependent models in a phylogenetic context. Systematic Biology, 61(3), 510–521.
 
Besag, J. (1974). Spatial interactions and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, 36, 192–236.
 
Cristadoro, G., Esposti, M.D., Altmann, E.G. (2018). The common origin of symmetry and structure in genetic sequences. Scientific Reports, 8, 158171644.
 
Faith, J.J., Pollock, D.D. (2003). Likelihood analysis of asymmetrical mutation bias gradients in vertebrate mitochondrial genome. Genetics, 165(2), 735–745.
 
Goldman, N., Yang, Z. (1994). A codon-based model of nucleotide substitution for protein-coding DNA sequences. Molecular Biology and Evolution, 11, 725–736.
 
Hart, A., Martínez, S. (2011). Statistical testing of chargaff’s second parity rule in bacterial genome sequences. Stochastic Models, 27, 272–317.
 
Hart, A., Martínez, S., Olmos, F. (2012). A gibbs approach to chargaff’s second parity rule. Journal of Statistical Physics, 146, 408–422.
 
Hintze, J.L., Nelson, R.D. (1998). Violin plots: a box plot-density trace synergism. The American Statistician, 52(2), 181–184.
 
Jensen, J.L. (2005). Context dependent DNA evoliutionary models. Research Reports, 458.
 
Kong, S.-G., Fan, W.-L., Chen, H.-D., Hsu, Z.-T., Zhou, N., Zheng, B., Lee, H.-C. (2009). Inverse symmetry in complete genomes and whole-genome inverse duplication. PLOS ONE, Nov. 09. https://doi.org/10.1371/journal.pone.0007553.
 
Lobry, J.R. (1995). Properties of a general model of DNA evolution under no-strand-bias conditions. Journal of Molecular Evolution, 40, 326–330. Journal of Molecular Evolution, 41, 680.
 
Lunter, G., Hein, J. (2004). A nucleotide substitution model with nearest-neighbour interactions. Bioinformatics, 20(18), 216–223.
 
Marchetti, G.M., Wermuth, N. (2016). Palindromic Bernoulli distributions. Electronic Journal of Statistics, 10(2), 2435–2460. also on. arXiv:1510.09072.
 
Marchetti, G.M., Wermuth, N. (2017). Explicit, identical maximum likelihood estimates: for some cyclic Gaussian and cyclic Ising models. Stat, 6(1).
 
Marin, A., Xia, X. (2008). GC skew in protein-coding genes between the leading and lagging strands in bacterial genomes: new substitution models incorporating strand bias. Journal of Theoretical Biology, 253, 508–513.
 
Parks, S.L. (2015). Mathematical Models and Statistics for Evolutionary Inference. PhD thesis, University of Cambridge, Cambridge.
 
Petoukhov, S., Petukhova, E., Svirin, V. (2018). New symmetries and fractal-like structures in the genetic coding system. In: Hu, Z., Petoukhov, S., Dychka, I., He, M. (Eds.), Advances in Computer Science for Engineering and Education, ICCSEEA 2018. Advances in Intelligent Systems and Computing, Vol. 754. Springer, Cham, pp. 588–600.
 
Powdel, B.R., Satapathy, S.S., Kumar, A., Jha, P.K., Buragohain, A.K., Borah, M., Ray, S.K. (2009). A study in entire chromosomes of violations of the intra-strand parity of complementary nucleotides Chargaff’s second parity rule. DNA Research, 16(6), 325–343.
 
R Core Team (2018). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. https://www.R-project.org/.
 
Rosandić, M., Vlahović, I., Glunčić, M., Paar, V. (2016). Trinucleotide’s quadruplet symmetries and natural symmetry law of DNA creation ensuing Chargaff’s second parity rule. Journal of Biomolecular Structure and Dynamics, 34(7), 1383–1394.
 
Rudner, R., Karkas, J.D., Chargaff, E. (1968). Separation of B. subtilis DNA into complementary strands. III. Direct analysis. Proceedings of the National Academy of Sciences of the USA, 60, 921–922.
 
Shporer, S., Chor, B., Rosset, S., Horn, D. (2016). Inversion symmetry of DNA k-mer counts: validity and deviations. BMC Genomics, 17(696), 1–13.
 
Siepel, A., Haussler, D. (2004). Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Molecular Biology and Evolution, 21(3), 468–488.
 
Simons, G., Yao, Y.-C., Morton, G. (2005). Global Markov models for eukaryote nucleotide data. Journal of Statistical Planning and Inference, 130, 251–275.
 
Sobottka, M., Hart, A.G. (2011). A model capturing novel strand symmetries in bacterial DNA. Biochemical and Biophysical Research Communications, 410, 823–828.
 
Stokes, M.E., Davis, C.S., Koch, G.S. (2001). Categorical Data Analysis Using the SAS System. SAS Institute, Cary, NC.
 
Sueoka, N. (1995). Intrastrand parity rules of DNA base composition and usage biases of synonymous codons. Journal of Molecular Evolution, 40, 318–325.
 
Tavares, A.H., Raymaekers, J., Rousseeuw, P.J., Silva, R.M., Bastos, C.A.C., Pinho, A., Brito, P., Afreixo, V. (2018). Comparing reverse complementary genomic words based on their distance distributions and frequencies. Interdisciplinary Sciences Computational Life Sciences, 10(1), 1–11.
 
Wickham, H. (2016). R: ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York. 2016.
 
Zhang, S.-H., Huang, Y.-Z. (2008). Characteristics of oligonucleotide frequencies across genomes: Conservation versus variation, strand symmetry, and evolutionary implications. Nature Precedings. hdl:10101/npre.2008.2146.1.
 
Židanavičiūtė, J. (2010). Dependence Structure Analysis of Categorical Variables with Applications in Genetics. Doctoral thesis, Vilnius Gediminas Technical University, Vilnius, Lithuania. Retrieved from https://vb.vgtu.lt/object/elaba:2115290/2115290.pdf.

Biographies

Radavičius Marijus
marijus.radavicius@mii.vu.lt

M. Radavičius, Assoc. Prof. Dr., is a senior researcher at Institute of Data Science and Digital Technologies and a professor at Institute of Applied Mathematics, Vilnius University. He received a PhD degree (probability and statistics) in 1982 from the Steklov Institute of Mathematics of Russian Academy of Sciences (St. Petersburg Department). His major research interests include asymptotic statistics, nonparametric and adaptive estimation, dimension reduction and data sparsity, cluster analysis, applications of statistics in life sciences, medicine, linguistics and education.

Rekašius Tomas
tomas.rekasius@vgtu.lt

T. Rekašius, Assoc. Prof. Dr., is working at Department of Mathematical Statistics, Vilnius Gediminas Technical University. He received a PhD degree (mathematics) in 2007 from Vilnius Gediminas Technical University and Institute of Mathematics and Informatics, Vilnius. His major research interests include bioinformatics, applications of statistics in life sciences and medicine.

Židanavičiūtė Jurgita
jurgita.zidanaviciute@vgtu.lt

J. Židanavičiūtė, Dr., received a master’s degree in statistics from 2003 and a PhD degree in mathematics from 2010 from Vilnius Gediminas Technical University. She has been working at Vilnius Gediminas Technical University for 15 years. Her major research interests is applications of statistics in engineering, medicine and other fields.


Reading mode PDF XML

Table of contents
  • 1 Introduction
  • 2 Local Symmetry
  • 3 Statistical Analysis
  • Further work
  • Acknowledgements
  • References
  • Biographies

Copyright
© 2019 Vilnius University
by logo by logo
Open access article under the CC BY license.

Keywords
generalized logit DNA strand symmetry Markov random field characterization hypothesis testing

Metrics
since January 2020
1151

Article info
views

699

Full article
views

520

PDF
downloads

208

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

  • Figures
    5
  • Tables
    1
info1231_g001.jpg
Fig. 1
The length distribution density of the longest non-coding sequences of bacteria genomes plotted in a logarithmic scale.
info1231_g002.jpg
Fig. 2
Distribution of the logarithmized Student statistic of the 1st response function coefficients: the first 6 coefficients represent the antisymmetric part, the last 10 – the symmetric part (expected to be null).
info1231_g003.jpg
Fig. 3
Distribution of the logarithmized Student statistic of the 2nd response function coefficients: the first 6 coefficients represent antisymmetric part, the last 10 – the symmetric part (expected to be null).
info1231_g004.jpg
Fig. 4
Distribution of the logarithmized Student statistic of the 3rd response function coefficients: the first 6 coefficients represent antisymmetric part (expected to be null), the last 10 – the symmetric part.
info1231_g005.jpg
Fig. 5
Lines represent the patterns of 8 clusters obtained via k-means clustering from 48-dimensional data of the logarithmized Student statistics. The grey region indicates the model parameters vanishing under the null hypothesis of the local symmetry.
Table 1
Nucleotide recoding rule.
info1231_g001.jpg
Fig. 1
The length distribution density of the longest non-coding sequences of bacteria genomes plotted in a logarithmic scale.
info1231_g002.jpg
Fig. 2
Distribution of the logarithmized Student statistic of the 1st response function coefficients: the first 6 coefficients represent the antisymmetric part, the last 10 – the symmetric part (expected to be null).
info1231_g003.jpg
Fig. 3
Distribution of the logarithmized Student statistic of the 2nd response function coefficients: the first 6 coefficients represent antisymmetric part, the last 10 – the symmetric part (expected to be null).
info1231_g004.jpg
Fig. 4
Distribution of the logarithmized Student statistic of the 3rd response function coefficients: the first 6 coefficients represent antisymmetric part (expected to be null), the last 10 – the symmetric part.
info1231_g005.jpg
Fig. 5
Lines represent the patterns of 8 clusters obtained via k-means clustering from 48-dimensional data of the logarithmized Student statistics. The grey region indicates the model parameters vanishing under the null hypothesis of the local symmetry.
Table 1
Nucleotide recoding rule.
Purine (Bonds) Pyrimidine s
Weak (2 bonds) A $(\hspace{2.5pt}=\hspace{2.5pt})$ T $s=-1$
Strong (3 bonds) G $(\hspace{2.5pt}\equiv \hspace{2.5pt})$ C $s=+1$
y $y=-1$ $y=+1$

INFORMATICA

  • Online ISSN: 1822-8844
  • Print ISSN: 0868-4952
  • Copyright © 2023 Vilnius University

About

  • About journal

For contributors

  • OA Policy
  • Submit your article
  • Instructions for Referees
    •  

    •  

Contact us

  • Institute of Data Science and Digital Technologies
  • Vilnius University

    Akademijos St. 4

    08412 Vilnius, Lithuania

    Phone: (+370 5) 2109 338

    E-mail: informatica@mii.vu.lt

    https://informatica.vu.lt/journal/INFORMATICA
Powered by PubliMill  •  Privacy policy