A Reversible Hiding Technique Using LSB Matching for Relational Databases

Data hiding technique is an important multimedia security technique and has been applied to many domains, for example, relational databases. The existing data hiding techniques for relational databases cannot restore raw data after hiding. The purpose of this paper is to propose the first reversible hiding technique for the relational database. In hiding phase, it hides confidential messages into a relational database by the LSB (Least-Significant-Bit) matching method for relational databases. In extraction and restoration phases, it gets the confidential messages through the LSB and LSB matching method for relational databases. Finally, the averaging method is used to restore the raw data. According to the experiments, our proposed technique meets data hiding requirements. It not only enables to recover the raw data, but also maintains a high hiding capacity. The complexity of our algorithms shows their efficiencies.


Introduction
The data hiding technique is a kind of multimedia security technique. It hides secret information into another unimportant original data, and the original data is still meaningful. Because the original data is not gibberish, the technique could defraud the adversary and send the secret information. Data hiding technology can be grouped into non-reversible and reversible data hiding schemes Chen and Guo, 2020;Chen Y. et al., 2020). The image of the former will be completely damaged and cannot be restored after hiding the secret information. However, the image of the latter, after retrieving the secret information, can still be recovered to the original image. The data hiding technique has been applied to many domains, such as videos, images for medicine (Lu et al., 2015a), digital sounds, etc. An interesting application of this technique is applied in the relational database.
IBM's first relational database product for business purposes was released in 1981, and the relational database has become an important tool for enterprises and the government agencies to save data today. Enterprises use it to save employee data, customer data, product data, etc. The government uses it to save financial data, tax data, judicature data, etc. Original relational databases only save data; afterwards, Data Warehouse and Data Mining techniques make us analyse the relational database and find out the relationships among data, and then provide analytical results to enterprises for decision-making (Ke and Wang, 2006). Therefore, data has become an important resource for enterprises. Because of the importance of data, Taiwan has implemented Personal Data Protection Act in 2012; and thus the question is how to protect data from being stolen by someone else. We discuss security issues in the relational databases in the next section.
The data saved in relational databases is digital data, and it is the same as videos and images, which can all be copied easily. After the Internet has appeared, digital data can be delivered easily to others, resulting that the problem of data theft is becoming more and more serious. Relational databases have an authority control mechanism; only legitimate users can access the data in the database, and thereby it can prevent data from being stolen by unauthorized users. However, if legitimate users steal data and sell it to B, and then B asserts the data belongs to him, how can we prove who owns the data? The ownership information in the relational database can prove to whom these data belong.
Furthermore, let us consider another case; a data owner needs data mining service, so he transmits the relational database to a data mining company (Kamran et al., 2013). When he transmits, the attacker may steal and tamper the relational database, so the recipient will receive the tampered relational database. When it happens, how can we prove the relational database is original, and the data is not tampered? Tamper proofing can help us.
A kind of technique, called watermarking relational databases, can achieve abovementioned purposes. Its concept comes from the data hiding technique. Its purpose is to hide invisible digital watermark into relational databases for ownership protection and tamper proofing (Agrawal and Kiernan, 2002;Iftikhar et al., 2015;Ke and Wang, 2006). This paper is organized as follows: Section 2 will briefly describe the development history of watermark relational databases. We carefully explain our proposed method in Section 3. The experimental results are given in Section 4. Finally, the conclusion summarizes the results obtained and proposes future work for this article.

Related Works
Using a digital watermark to prove copyright of digital media has been a well-known technique. The first ones who proposed a new idea of using a digital watermark to secure a database is Zane in 2000 (Khanna andZane, 2000), and then Rakesh Agrawal and Jerry Kiernan proposed the first scheme for Watermarking relational databases in 2002 (Agrawal and Kiernan, 2002). Their scheme uses a hash function to pick out the LSB (Least-Significant-Bit) of some tuples they want to mark, and then they embed the watermark by setting the selected LSB. If the hash function value of the concatenation of the private key and the primary key is even, the selected LSB will be set to 0; otherwise the selected LSB will be set to 1.
According to the scientific studies of Bhesaniya et al. (2014) Dwivedi et al. (2014 Mohanpurkar and Joshi (2011), the Watermarking relational databases can include two kinds of databases: data distortion watermarking relational databases and data distortionfree watermarking relational databases. Agrawal's and Kiernan's scheme is improved by embedding fingerprints instead of meaningless bits in Li et al. (2003). Mehta and Rao proposed to embed the digital watermark in two attributes, namely the LSB of the digital attribute and the SS of the date attribute (Mehta and Rao, 2011). Hanyurwimfura et al. proposed a digital watermark could be expressed by the horizontal shifting position of a non-numeric attribute of selected tuples (Hanyurwimfura et al., 2010). Melkundi et al. proposed embedding a digital watermark into a textual attribute and a numeric attribute, and finally used Levenshtein Distance to verify the extracted watermark and the original watermark (Melkundi and Chandankhede, 2015). In Kamran et al. (2013), a high robustness and distortion minimization scheme for data distortion watermarking relational databases was proposed. This scheme first divides the data set into several data partitions, and then uses thresholds and hash functions to select tuples for embedding the watermark. If the embedded watermark bit is 1, the selected tuple value is added to the percentage value; if the embedded watermark bit is 0, the selected tuple value is subtracted from the percentage value. Experimental results show that this scheme can not only resist six attacks, but also minimize data distortion.
In the data distortion watermarking relational database, Zhang et al. proposed a subdomain in 2006, called the reversible watermarking relational database (Zhang et al., 2006). They use histogram extension techniques to implement a reversible watermarking scheme. The idea of reversible watermarking relational databases was derived from the image because the data in the database will be distorted after embedding a digital watermark. However, for some data, for example, categorical data, it cannot tolerate any distortion; otherwise, it will become useless (Li et al., 2004); therefore, we need this technique to restore raw data in the database. Gupta and Pieprzyk proposed a reversible blind watermarking scheme that can resist secondary watermarking attacks (Gupta and Pieprzyk, 2009). This scheme will analyse the features and select the appropriate watermark features, then use the genetic algorithm (GA) to generate the watermark. Therefore, the watermark will evolve from a random binary string to the best watermark information string, and obtain the best fitness value (β). The best fitness value is used to embed the watermark bit and ensure that the data quality is not affected. Experimental results show that RRW can retrieve the original data and watermark after malicious attacks.
The next domain we want to introduce is data distortion-free watermarking relational databases. Its main concept is to use the function of the database to generate a watermark, and the watermark will not be directly embedded in the database. Therefore, this technology can maintain the integrity of the original data without causing data distortion. Li et al. proposed the first scheme in 2004 (Li et al., 2004). Their scheme generates a fragile Table 1 The overview of D.

Tuple index
watermark based on the group's data order (ascending order means watermark bit = 0; descending order means watermark bit = 1). The fragile watermark is used to detect any modification in a relational database. In 2014, Camara et al. proposed a fragile watermarking technology that generates watermarks based on data partitions in a relational database (Camara et al., 2014). Encrypt the watermark and record it in a certificate authority. A certification authority (CA) is a trusted party that can detect suspicious databases. When we want to validate the database, we first generate a watermark from the data partition and then compare the watermark with the original watermark retrieved from the CA. Our proposed scheme is inspired from Lu et al.'s data hiding technology (Lu et al., 2015a(Lu et al., , 2015b. They are based on the LSB matching method to devise a dual imagingbased reversible hiding technique. First, it copies the original image into the same images. By LSB matching method, it embeds confidential messages into two image pixel values, respectively. Lastly, through the LSB and LSB matching methods, it will get the confidential messages. Through the averaging method, it will restore the original image.

Basic Concept
In this section, we propose a new reversible data hiding technique using LSB matching method for relational databases. The main difference between the proposed method and Lu et al.'s method is that a confidential message is hidden in the digital attributes of a relational database, not a pixel in an image. After hiding confidential messages (for example, the digital watermark of the database owner) into a relational database, the data in the database will be distorted, but the proposed method can restore the original data. Because the research is not robust enough to resist malicious attacks, it is a reversible data hiding technique for relational databases. Moreover, Section 4 also proves the proposed technique is a linear time algorithm (the running time linearly increases as the size of a data set); therefore, our technique is suitable for relational databases with a large amount of data.
Database relation D is the original data set with primary key attributes P and n attributes, whose scheme is (P , A 0 , . . . , A n−1 ). Table 1 gives an overview of D. Therefore, this method can hide all numeric attributes in D. However, to simplify the description, only one numerical attribute A i is selected to explain the method. Figure 1 is the LSB matching method of relational database inspired by Mielikainen (2006). This steganography is designed to hide binary messages into tuple values in a relational database. Assume that the encrypted confidential message (CM) is a binary number, s1 is the first bit of the CM, s2 is the second bit of the CM, s3 is the third bit of the CM, and s4 is the fourth bit of the CM; i is the attribute index; j is the tuple index. First, calculate the LSB of the tuple j ( In the second layer, F (·) is used to check whether the value of F (·) is the same as s2. In layer 3, pretend tuple values are generated. Here, tuple j (A i ) and tuple j +1 (A i ) are the original tuple values; tuple j '(A i ) and tuple j +1 '(A i ) pretend to be tuple values after hiding s1 and s2. The situations are explained as follows: Through the LSB matching method of the relational database, tuple j '(A i ) and tuple j +1 '(A i ) are embedded in the secret messages s1 and s2, and tuple j '(A i2 ) and tuple j +1 '(A i2 ) are embedded in the secret messages s3 and s4. Table 2 shows when the secret messages are hidden in two pairs of tuple values at the same time through the LSB matching method of a relational database, and under what circumstances can the two pairs of tuple values be restored to the original tuple values. 0 means that the tuple value does not need to be modified; +1 means that the tuple value is increased by 1; −1 means that the tuple value is subtracted by 1. We use the symbol x to indicate that the recovered tuple Table 2 The rule table for relational databases.

Cases
The tuple value modification statuses Original tuple values restoration statuses Table 3 The extraordinary process of modification rule table.

Rules
Cases The final modified pretend tuple values value calculated by the averaging method is different from the original tuple value. To handle the case where the tuple value cannot be restored, we modify Table 2 and describe the new modification rules for relational databases in Table 3. Taking Case 6 as an example, the first pair of tuple j (A i ) is 0 and tuple j +1 (A i ) is +1. The second pair of tuple j (A i2 ) is 0 and tuple j +1 (A i2 ) is +1 of the tuple value modification statuses; In this case, the recovered tuple value calculated by the averaging method is different from the original tuple value, so we need to use Rule 2 in Table 3. Therefore, the new tuple value modification state is that the first pair of tuple j (A i ) is 0, the tuple j +1 (A i ) is +1, and the second pair of tuple j (A i2 ) is 0, the tuple j +1 (A i2 ) is −1. Therefore, the final modified pretend tuple values are: tuple j '(A i ) does not need to be changed,

Hiding Phase
The private key was originally used to encrypt secret messages. It is assumed that the length of the encrypted confidential message is m bits, and the attribute to be hidden is located in A i . First, copy A i into the same two attributes A i and A i2 . In Fig. 1, two tuple values are used: tuple j (A i , A i2 ) and tuple j +1 (A i , A i2 ) as sets to hide all sets to obtain D' (hidden data set after CM) The overall steps are as follows: First, use Table 2 to hide CM into D. If something happens during the hiding process, use Table 3 to modify the pretend tuple value. By the averaging method, the pretend tuple values can be restored: tuple j (A i ), tuple j (A i2 ), tuple j +1 (A i ), and tuple j +1 (A i2 ) can be recovered to the original tuple j (A i ) and tuple j +1 (A i ), respectively.
The complete explanation for hiding phases is as follows: 1) Copy A i into the same two attributes A i and A i2 .
2) Each 2 tuple values tuple j (A i , A i2 ) and tuple j +1 (A i , A i2 ) are used as a set, and for each set, 4 consecutive bits are obtained from the CM. Therefore, four consecutive bits are retrieved from the CM. s1 is first, s2 is second, s3 is third, and s4 is fourth bits. Next, through Fig. 1, s1 and s2 are hidden in tuple j (A i ) and tuple j +1 (A i ), respectively, and then s3 and s4 are hidden in tuple j (A i2 ) and tuple j +1 (A i2 ), respectively. For example, after hiding 4 continuous bits into tuples by Fig. 1, if the first pair tuple values are tuple j (A i ) = 0, tuple j +1 (A i ) = +1, and the second pair tuple values are tuple j (A i2 ) = 0, tuple j +1 (A i2 ) = +1; then this condition belongs to Case-6. However, according to the restoration state of the original tuple value, the original tuple j +1 (A i ) cannot be restored. Therefore, Table 3 is used to adjust the pretend tuple values. 3) Cases 3, 6, 7, 9-11, 16: In Table 2, there are 7 cases (Cases 3, 6, 7, 9-11, 16) that cannot recover the original tuple value by averaging. If this happens, use Table 3 to modify the pretend tuple value. As just mentioned, according to Table 3, Case 6 will set tuple j +1 (A i ) to +1 in the first pair tuple values, and tuple j +1 (A i2 ) in the second pair tuple values is also set to −1.

4) Steps 2) and 3) are repeated in order to hide all CM into the tuple values.
Algorithm 1 is a hiding function that implements the concepts of Fig. 1, Table 2 and  Table 3, and Fig. 2 is Algorithm 1 flow chart. Algorithm 2 is a function that retrieves an assigned bit of CM. As shown in Table 1, tuple j (A i ) and tuple j (A i2 ), and tuple j +1 (A i ) and tuple j +1 (A i2 ) are two pairs of tuple values in D. In lines 5-11, Algorithm 1 first retrieves tuples from D, and temporarily saves them into tempDBValue array (a two-dimensional array); hence lines 12-60 can hide confidential message into tempDBValue array, and then tempDBValue array is updated to D in line 61. Line 12 is that when a tuple number is even, lines 13-60 begin to hide CM through the LSB matching method for relational databases and Modification rule table. Every time, lines 13-19 get 4 continuous bits from CM by Get_CMbit function (Algorithm 2), and they are stored in s1, s2, s3, and s4, respectively. Lines 20-45 determine which situation tuple j (A i , A i2 ) and tuple j +1 (A i , A i2 ) belong to. Afterwards, if some cases happen, Table 3 is further used to adjust tempDBValue array in lines 46-60. Therefore, CM is hidden into tempDBValue array.

Extraction and Restoration Phases
CM and recovery raw data are extracted separately in this phase. Firstly, we extract CM from the first pair tuple values: tuple j (A i ), tuple j +1 (A i ), and the second pair tuple values: tuple j (A i2 ), tuple j +1 (A i2 ) through the LSB and LSB matching method for relational databases. s1 is obtained from tuple j (A i ) by the LSB equation. After that, s2 is gained by equation (1). Substitute x = tuple j (A i ) and y = tuple j +1 (A i ) into F (·), and the value for F (·) represents s2. In the same way, s3 and s4 are obtained from the second pair tuple values: tuple j (A i2 ), tuple j +1 (A i2 ). Lastly, CM is decrypted by the same private key.
During restoration phases, the average value of the two tuple values can be calculated through the averaging method to recover the raw data, namely, using equation (2) to compute (tuple j (A i ) + tuple j (A i2 ))/2 we can restore the original tuple j (A i ); by computing (tuple j +1 (A i )+tuple j +1 (A i2 ))/2 we can restore the original tuple j +1 (A i ). The following formula is equation (1) (Lu et al., 2015a): (1) The following formula for the averaging method is equation (2): (2) Algorithm 3 Extract Restore function. Algorithm 3 is an extract restore function that extracts CM from D' and restores the raw data, and Fig. 3 is Algorithm 3 flow chart. In lines 5-12, Algorithm 3 first retrieves tuples from D', and temporarily saves them into tempDBValue array (a two-dimensional array); hence, lines 14-18 can get CM, and then lines 19-20 restore raw data into tempDBValue array, and, finally, tempDBValue array is updated to D in line 21. Line 13 is that when tuple number is even, it calculates the CM through LSB and LSB matching method for relational databases in lines 14-18, and restores the raw data through averaging method in lines 19-20. Lastly, this function returns CM and D.

Example
We have already proved our algorithms are feasible on a computer with Intel(R) Core(TM)2 Quad CPU, 4 GB of RAM, and MySQL 5.6 is our database. In this section, we illustrate our experimental examples to let readers understand the proposed scheme.
Take Fig. 5 as an example, we illustrate extraction and restoration phases as follows: Extract CM and recovery raw data separately. In the first set, s1 = 1 is gotten from tuple j (A i

Capacity and Complexity Analysis of Our Method
We all know that the data-hiding technique has two important requirements, the integrity of raw data and data hiding capacity. With respect to the integrity of raw data, since our proposed technique is reversible, it can recover the raw data. Therefore, the following paragraphs analyse data hiding capacity and the complexity of our algorithms:

Data Hiding Capacity
The data hiding capacity of our technique depends on the database. The more tuples in D, the more hiding capacity of our technique. Assume there are n tuples in D, and then the maximum hiding capacity = 2n bits.
Our technique uses each 2 tuple values: tuple j (A i , A i2 ) and tuple j +1 (A i , A i2 ) as a set, and then hides a bit into each tuple value, i.e. hiding four bits into tuple j (A i ), tuple j (A i2 ), tuple j +1 (A i ), tuple j +1 (A i2 ), respectively. Therefore, the watermark bits we intend to hide must be multiple of four. Furthermore, if n is odd, the last tuple cannot be used to hide confidential messages.

The Complexity of Our Algorithms
According to Franco-Contreras et al. (2014), Xie et al. (2016), computation time is an important issue we should consider. Thus, we analyse the complexity of Algorithm 1 and Algorithm 3, and try to prove whether they are efficient or not. We assume there are n tuples in D, and the time of updating the value in the database is TDB (it is determined by the database and the amount of data, so we ignore it); moreover, assume situations A-D in Fig. 1 will happen averagely, and furthermore assume the probability of occurrence of the cases in Table 3 is very small, so we can ignore them. Therefore, the number of executions of every instruction is shown in comments of Algorithms 1 and 3. Because line 12 in Algorithm 1 and Algorithm 3 limits tuple number to even numbers, the numbers of executions of the instructions after line 12 will be reduced to n/2.
As mentioned above, time complexity of our algorithms in Algorithms 1 and 3 are both linear time algorithms. According to "Linear time is the best possible time complexity in situations where the algorithm has to sequentially read its entire input (Sudhan and Kalaiarasan, 2017)", our technology reads the entire database sequentially and meets the linear time requirements; therefore, our proposed technique is not only efficient in hiding phase but also efficient in extraction and restoration phases.

Discussion and Conclusion
In this study, we proposed a reversible hiding technique for relational databases by using the LSB matching method. In the experiments, we proved that our technique meets data hiding requirements, and it not only enables to recover the raw data, but also maintains a high hiding capacity. Additionally, we analyse the complexity of our algorithms to demonstrate our technique is efficient. Therefore, according to Xie et al. (2016), our algorithms can be applied to big data databases. To the best of our knowledge, it is the first reversible hiding technique for relational databases. Furthermore, we want to discuss two issues: 1) Application: In our opinion, our technique is most suitable for statistical databases.
Statistical databases (SDB) have an inference problem: an attacker may infer the correct data from known information and well-chosen queries (Tendick and Matloff, 1994); however, data perturbation can solve this problem. Data perturbation is a technique that adds noises into the sensitive data. Its method is to change data through mathematic formulas (such as adding and subtracting) (Adam and Worthmann, 1989;Taneja et al., 2014). After SDB passes above-mentioned preprocess, it transforms these data into a perturbed SDB (Adam and Worthmann, 1989;Taneja et al., 2014). The searchers only query the perturbed SDB, so they never get correct results. Therefore, it is hard for people to infer the correct data (Adam and Worthmann, 1989). Because our technique must copy data into the same two data in the relational database, an attacker may doubt whether this database has confidential messages or not. Therefore, attribute A i2 can be put into the perturbed SDB after hiding phases, i.e. regarding the hiding method as mathematic formulas and regarding A i2 as perturbed data. Because A i2 is in the perturbed SDB and A i is in the original SDB, it can prevent the original SDB from having an additional data, so the attacker does not doubt if the SDB has confidential messages. During extraction and restoration phases, confidential messages is extracted and raw data is recovered through attribute A i in the SDB and attribute A i2 in the perturbed SDB. 2) Future work: Because our research is not robust enough to resist malicious attacks, it is just a data hiding technique for relational databases. In the future, we will strengthen the robustness in order to make our technique become watermarking relational databases.