 Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 Frame-synchronous Blind Audio Watermarking for Tamper Proofing and Self-Recovery Hwai-Tsu Hu * , Ying-Hsiang Lu Department of Electronic Engineering, National I-Lan University, Yilan, Taiwan Received 25April 2019; received in revised form 28 May 2019; accepted 22 August 2019 DOI: https://doi.org/10.46604/aiti.2020.4138 Abstract This paper presents a lifting wavelet transform (LWT)-based blind audio watermarking scheme designed for tampering detection and self-recovery. Following 3-level LWT decomposition of a host audio, the coefficients in selected subbands are first partitioned into frames for watermarking. To suit different purposes of the watermarking applications, binary information is packed into two groups: frame-related data are embedded in the approximation subband using rational dither modulation; the source-channel coded bit sequence of the host audio is hidden inside the 2 nd and 3 rd -detail subbands using 2 N -ary adaptive quantization index modulation. The frame-related data consists of a synchronization code used for frame alignment and a composite message gathered from four adjacent frames for content authentication. To endow the proposed watermarking scheme with a self-recovering capability, we resort to hashing comparison to identify tampered frames and adopt a Reed–Solomon code to correct symbol errors. The experiment results indicate that the proposed watermarking scheme can accurately locate and recover the tampered regions of the audio signal. The incorporation of the frame synchronization mechanism enables the proposed scheme to resist against cropping and replacement attacks, all of which were unsolvable by previous watermarking schemes. Furthermore, as revealed by the perceptual evaluation of audio quality measures, the quality degradation caused by watermark embedding is merely minor. With all the aforementioned merits, the proposed scheme can find various applications for ownership protection and content authentication. Keywords: Blind audio watermarking, lifting wavelet transform, 2 N -ary adaptive quantization modulation, rational dither modulation, tamper proofing, self-recovery 1. Introduction In the age of cloud sharing and mobile access, digital resources (such as speech, image, audio and video files) on the Internet keep increasing dramatically in recent years. Ironically, owing to the availability of convenient computer software, tampering multimedia data is also rampant nowadays. Protection against intellectual property infringement thus becomes an important issue. Digital watermarking is considered a promising countermeasure to cope with this issue [1-2]. Digital watermarks can be embedded in noise-tolerant multimedia signals to fulfill the goals of content authentication, copyright protection, covert communication, etc. Based on the information required for extraction, watermarking schemes can be divided into non-blind and blind categories. Non-blind schemes require the original image and/or watermark for extraction, whereas blind schemes require neither. Depending on the application scenario, audio watermarks can also be classified as robust or fragile. Robust watermarking is meant to be resilient to modification attempts, whereas fragile watermarking makes the embedded information sensitive to any modifications [3]. Among the audio watermarking schemes developed for content authentication, the early purpose of the embedded watermark was focused on the detection and localization of tampered area [4-6]. * Corresponding author. E-mail address: hthu@niu.edu.tw Tel.: +886-3-9317343; Fax: +886-3-9369507 Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 19 A potential application of fragile watermarking in the field of audio processing is the self-recovery technique, which embeds a watermark into the audio itself to combat the tampering situations. The embedded watermark is often a compressed version of the original content generated via data compression and coding schemes. The amount of the watermark that survives the tampering can help the receiver to not only locate the tampering areas but also to recover the lost content with a certain quality. A few self-recovery schemes for image signals have been proposed so far, such as [7-10]. Speech signal self-recovery was also attempted in [11-13]; nonetheless, studies of the self-recovery schemes for audio signals were relatively limited. The method proposed in [14] divides the audio into 4 segments and embeds the feature parameters of every segment into the less significant bits (LSBs) of another randomly selected segment. For this method, self-recovery is feasible only if the LSBs are completely retrievable. By contrast, the method in [15] embeds the control bits for self-recovery in the integer Discrete Cosine Transform (intDCT) domain and then employs a compressive sensing technique to retrieve the tampered intDCT coefficients. Although this method is capable of recovering the audio signal tampered by content replacement attacks, it can only restore the attacked signals up to 0.6 % with acceptable quality. Furthermore, the size of the replaced segment must remain identical to enable tampering detection and signal recovery. In fact, it is more common to encounter a situation where the replaced segment holds a different size. The method in [16] attempts to solve such a size discrepancy problem using a synchronization strategy. However, the adopted synchronization strategy appears oversimplified. When the length of the received audio signal is shorter than that of the original, it simply adds a set of zeros at the end of the audio instead of aligning the signal back to the correct position. One common drawback of the foregoing audio watermarking schemes for self-recovery is that they all lack the countermeasures to cope with cropping and/or time-shifting attacks. A minor time mismatch can disrupt the watermark extraction for subsequent self-recovery. Motivated by the work done in [14-16], we propose an efficient blind audio watermarking scheme that is capable of achieving tamper proofing and self-recovery in the presence of arbitrary content replacement attacks. The remainder of this paper is organized as follows. Section 2 presents two watermarking schemes designed for attaining frame-synchronous blind audio watermarking in the lifting wavelet domain. Section 3 outlines the procedures used in watermark embedding and extraction. The framework for self-recovery is discussed in Section 4. Section 5 evaluates the proposed scheme in terms of imperceptibility, temper proofing, self-recovery, and processing time. In order to illustrate the advantages of the proposed scheme more clearly, Section 5 also provides a comparative evaluation between the proposed self-recovery scheme and the one in [16]. Finally, conclusions are given in Section 6. 2. LWT-based Watermarking Schemes Among the transforms used to perform audio watermarking, DWT appears to be the most popular due to its perfect reconstruction and good multi-resolution characteristics. In particular, many DWT-based schemes take advantage of quantization index modulation (QIM) [17] to achieve effective watermark embedding. To reduce computational and memory overhead, we adopted a lifting scheme to implement the DWT in this study. A lifting wavelet transform (LWT) comprises three steps: split, prediction, and update for signal decomposition, and another three steps: update, prediction, and merge are needed for signal reconstruction. The LWT saves computational time and enables frequency localization to overcome the weakness of the traditional wavelet. It is regarded as the second-generation wavelet transform [18]. Fig. 1 presents the procedural flow for watermark generation and embedding. As illustrated in the right branch of Fig. 1, we first apply a 3-level LWT to decompose a host audio signal into one approximation subband and three detail subbands, each corresponding to a specific frequency range. In particular, the Daubechies-8 basis [19] is used as a wavelet function in the process of LWT. Note that audio watermarking is preferably implemented in low-frequency subbands with relatively high intensity, as these subbands are more tolerable to signal alteration with less impairment in perceptual quality. Theoretically, for audio sampled at 44.1 kHz, the 3 rd -level approximation subband spans a frequency range from 0 to 2756 (= 3 22050 / 2 ) Hz, Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 20 which is suitable for robust watermarking. Hence, in our design the approximation subband is reserved for embedding the crucial information including synchronization code, frame index, and hash data derived from the channel-coded bit stream. The 2 nd and 3 rd -level detail subbands are used to hide fragile watermarks that is responsible for data authentication and signal recovery. Fig. 1 Watermark generation and embedding 2.1. Rational dither modulation Following the application of LWT to the audio signal, we employ the rational dither modulation [20, 21] to carry out binary embedding in the approximation subband. Let (3) ( ) a c n denote the th n coefficient in the -level approximation subband. By referring to the QIM [17], the embedding of a binary bit  ( ) 0,1aw n  into (3) ( ) a c n can be formulated as   (3) ( )(3) (3) ( ) ( ) ( ) 1 ˆ ( ) sgn ( ) ( ) 2 2 2 a na a a n a k c n w n c n c n w n                  (1) where  sgn  ,  ,    represent the sign, absolute, and floor functions, respectively. ( )n stands for the step size for quantizing coefficients. The subscript ‘a’ in the symbol ( )aw n implies the targeted approximation subband wherein the watermark bit is embedded and extracted. The use of the magnitude rather than amplitude in Eq. (1) aims at excluding the trouble of sign flipping. In accordance with the formulation in Eq. (1), the watermarking error, which is defined as the difference between (3) ˆ ( ) a c n and (3) ( ) a c n , can be assumed to have a uniform distribution over ( ) ( ) / 2, / 2 n n     with a variance of 2 ( ) / 12 n  . One of the key features of the RDM lies in the acquisition of ( )n , which is recursively derivable from previous coefficients through      1010 log 1/121/ 2 2 (3) 20 ( ) 1 1 ˆ ( ) 10 repF BL n a i c n i L             (2) 3 rd Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 21 where L stands for the length of the involved coefficients. Similar to the manner in [20, 21], the embedding strength is adaptively controlled at the maximum tolerable level of the human auditory system [22, 23]. The term   2010 repF B signifies a multiplicative factor for adjusting the embedding strength. ( ) rep F B is the auditory masking threshold in unit of decibel for the Bark scale repB . ( ) 0.275 15.025 rep rep F B B      (3) with  representing a clearance gap for imperceptibility. repB can be obtained from a representative frequency repf using the following empirical formula [24]:    1 1 213tan 0.00076 3.5 tan ( / 7500)rep rep repB f f     (4) Here we chose the center of the -level approximation subband as the representative frequency, i.e. 3 1 0.5 2 2 s rep f f    (5) where s f is the sampling frequency. Overall, Eq. (2) jointly takes into account of the psychoacoustic features (i.e., ( ) rep F B ), the quantization error distribution (i.e.,  1010log 1 / 12 ), and the root-mean-square of previously processed coefficients (i.e., (3) ˆ{ ( ) 1, , } a c n i i L  ). The watermark extraction in RDM requires the derivation of quantization step ( )n  based on the same formula presented in Eq. (2). Subsequent to the acquisition of the 3 rd -level approximation coefficient (3) ( ) a c n , the bit ( )aw n residing in (3) ( ) a c n can be determined by (3) ( ) ( ) ( ) mod 2 0.5 , 2 a a n c n w n             (6) where  mod ,x y denotes the modulo operation, which returns the remainder after the division of x by y . The tilde symbol atop a participating variable implicates the effect due to possible attacks. 2.2 N 2 -ary adaptive quantization index modulation Analogous to RDM, the N 2 -ary adaptive quantization index modulation (AQIM) modifies the coefficient magnitude according to a N 2 -ary number  ( ) 0,1, , 2 1Ndw n   .   ( ) ( ) ( ) ( ) ( ) 1 ˆ ( ) sgn ( ) max 0, ( ) 2 2 2 j dj j d k d d k dN N k c n w n c n c n w n                      (7) where max  denotes the maximum value drawn from a set of data. ( ) ( ) j d c n is the th n coefficient in the th j -level detail subband. The floor function within the above equation may, however, render a negative value that is illegitimate to the definition of a magnitude. In case a negative outcome occurs, we simply replace the negative value with zero. In contrast to the case in RDM, the quantization step k  is computed from the energy of all the coefficients in a frame indexed by an integer k . Given that the watermarking errors maintain a power ratio  in decibels, the relationship betwee k  and  can be mathematically expressed as follows: 3 rd Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 22       1 2 ( ) 0 10 1 2 ( ) ( ) 0 1 2 ( ) 0 2 1 ( ) 10 1 ˆE ( ) ( ) 1 ( ) / 12 f c f L j d if L j j d d ic L j d if k c i L c i c i L c i L                    (8) where f L is the frame length. cL denotes the number of the coefficients involved in the quantization. Basically, c fL L . By referring to Eqs. (3) and (4), the value of  can also be estimated as  10( ) 10 log 1/ 12repF B    (9) In [20, 25-27], it was demonstrated that the quantization step size can be adaptively retrieved from a watermarked audio as long as the energy level remains unchanged throughout the watermarking process. The modification on ( ) ( ) j d c n in Eq. (7) inevitably cause variations in energy, which makes the retrieved k  different from the one used in watermark embedding. Thus, the recovered watermark bits may become inaccurate. This conflict can be settled by first minimizing the overall energy variation of the first c L coefficients and then tuning the other coefficients in the range between 1 c L  and f L . More specifically, we first sort the coefficient magnitudes, termed ( ) ( ) ( ) j i d i l c l  , in descending order: 0 1 1 1 ˆ ˆ ˆ ˆ ˆ( ) ( ) ( ) ( ) ( ) ci i L l l l l l             (10) where i l , which is drawn from  0,1, , 1cL  , signifies the index associated the th i largest magnitude. When applying Eq. (7) to the th i l coefficient, the optimal solution 1( )il is ( ) 1 ˆ ˆ( ) ( ) ( ) j i i d i l l c l   (11) and the suboptimal 2 ( ) i l becomes    ( ) 2 ˆ ˆ ˆˆ( ) , if ( ) ( ) & ( ) ( ) ˆ ( ) , otherwise. j i k i d i i k i i k l l c l l l l               (12) In general, coefficients with large magnitudes contribute more variations in energy. To minimize the overall energy variation in a frame, we select between 1 ( ) ' i l s and 2 ( ) ' i l s for the coefficient magnitudes in the top o L ranks.         11 2 2 2 2 0, , 1 0 {1,2} ˆˆ arg min ( ) ( ) ( ) ( ) o i i o o i LL i n i i i i n i L i i L n n l l l l                 (13) subject to the constraint that the accumulated energy must be less than the overall energy, i.e.       1 11 1 1 2 2 2 2 2 ( ) ( ) ( ) 0 0 0 ˆ( ) ( ) ( ) ( ) ( ) f fc c c i o c L LL L L j j j n i i d i d d i i L i i L i l l c l c i c i                   (14) The search for  ˆin in this study is done by a brutal force approach. Thus, the required computation is exponentially proportional to o L . This study chooses o L as 8. Substituting ˆ ( )in i for the magnitudes in  ˆ ( ) 0 1i ol i L    , i.e. Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 23 ˆ ˆ ˆ( ) ( ) ( ) ii i n i l l l      , yields the least energy variation achievable by the o L coefficients. Once the magnitude for the th i l coefficient is determined, the corresponding detail coefficients can be modified as follows:    ( ) ( ) ˆ ( ) ˆ ( ) sgn ( ) ; 0,1, , 1; 0,1, , 1 ( ) ij j d i d i c i c i l c l c l i L l L l           (15) where  represents an infinitesimal number added to the denominator to avoid dividing by zero. The violation of the constraint (14) implies that the energy collected from the first c L coefficients exceeds the total amount. It will be impossible to compensate for the excessive portion by regulating the energy over the remaining coefficients, i.e.,  ( ) 1( ) , , , 1jd c c fc i i L L L  . In case the inequality (14) cannot maintain after adjusting the first oL coefficients, we then proceed with the next o L coefficients in the top ranks, i.e.,  ( )ˆ ( ) , , 2 1jd i o oc l i L L  , and rerun the adjustment process. Previously altered detail coefficients in the sorted sequence shall remain intact. The adjustment process continues until the constraint shown in Eq. (14) is satisfied. Finally, to ensure a perfect match with the original energy level, we use the remaining ( )f cL L coefficients to absorb the energy discrepancy       1/ 2 1 1 2 2 ( ) ( ) ( ) ( ) 0 0 1 2 ( ) ˆ( ) ( ) ˆ ( ) ( ) ; , , 1. ( ) f c f c L L j j d d j j i i d d c fL j d i L c i c i c k c k k L L c i                           (16) After completing the watermark embedding, we reconstruct the audio by taking the inverse LWT with respect to all subband coefficients. To retrieve the embedded watermark bits from the watermarked audio, we follow the same steps used in the embedding process. The quantization step size k  can be obtained using Eqs. (8). The th i 2 -ary N number, termed ( ) d w i , is determined based on the QIM rule: ( ) ( ) ( ) mod 2 0.5 , 2 j d N N d k c i w i             (17) 3. Self-recovery Framework One of the main features of the proposed watermarking scheme is the self-recovery capability. In order to achieve tamper proofing and self-recovery concurrently, we incorporate the source-channel coding and hashing techniques into the proposed watermarking scheme. The basic idea is to use the frame-partitioned source-channel coded data as the watermark in the embedding phase, and examine the watermark for tamper detection and self-recovery in the extraction phase. In this study, we adopt a MPEG-1 audio layer III codec (termed MP3 for short) to perform a lossy data-compression of the host audio. In consideration of the limited watermarking capacity, the audio signal is encoded at a very low bitrate of 16 kilobits per second (kbps). This is actually achieved by down-sampling the audio by a factor of 4 and then applying the MP3 codec to convert the audio to a bit stream of 64 kbps. The bit stream is further divided into frames of size 2448 (=21538), which can be regarded as 2 message words, each containing 153 bytes. In this study, Reed-Solomon (RS) codes on the Galois fields 8 (2 )GF [28] are employed to recover the information destroyed by tampering attempts. For each message word, we use a  255, 153 RS code to form an augmented word of length 255. This arrangement enables the RS code to correct 51   255 153 / 2  errors in a row of 255 symbols. In other words, the tolerable tampering rate of the applied RS code is 20% (i.e., 51/255). As a result, the number of bits that were supposed to Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 24 embed in each frame is expanded from 2448 to 4080 (=22558). Given that the sampling rate of the audio is 44100 Hz, this amount of binary bits shall be embedded into a frame with its length no less than 441002448/16000. Hence we choose the frame length as 6656 samples and embed 4080 bits into the 3 rd and 2 nd -level detail subbands after performing 3-level LWT decomposition. As shown in Fig. 2, the 3 rd and 2 nd -level detail subbands respectively consist of 832 and 1664 coefficients in each frame. We tactically embedded 816 octal numbers (3 bits per coefficient) into the 3 rd -level detail subband and 1632 binary bits (1 bit per coefficient) into 2 nd -level detail subband using the 2 -ary N AQIM discussed in Section 2, thus rendering a total of 4160 bits to accommodate the need of channel-coded data for audio recovery. The extra 80 bits (= 4160 4080 ) are reserved for the need of file headers. Also note that we have applied a distinct 2 -ary N AQIM to each detail subband. The use of 8-ary AQIM in the 3 rd -level detail subband stems from the consideration that this subband usually contains relatively higher intensity than that found in the 2 nd -level detail subband. According to the formula for 2 -ary N AQIM given in Eq. (8), a high energy level also leads to a large quantization step that is supposedly more capable of resisting against malicious attacks. 4. Procedures for Watermark Embedding and Extraction 4.1. Watermark embedding The procedure for watermarks embedding is detailed in the following. Step. 1: Apply the 64 kbps MP3 codec to a down-sampled audio signal and convert the output file into a bit stream. Compose the bit sequence as an array of message words with a size of 153 bytes (or equivalently counted as 153 8-bit symbols). Step. 2: Append the parity symbols to each message word after applying a  255, 153 RS encoder. Step. 3: The symbols are scrambled among words via the use of . 1key . This operation allows the RS code to detect and correct symbol errors in the corrupted word. Step. 4: Divide the message array into groups, each holding 4 consecutive words. For each group,  Record the frame index as a 16-bit integer and encode this integer using a  31, 16 BCH encoder [29].  Record the total number of frames as a 16-bit integer and encode this integer using a  31, 16 BCH encoder.  Use . 2key to randomly permute the symbol sequence in each message word.  Apply the MD5 hash algorithm [30] to the first 153 symbols in each word and draw 16 hash bits from each hashed output to form a composite hash representation of 64 bits.  Pack the frame-related information as a bit sequence of length 128, as shown in Fig. 2. Step. 5: Perform 3-level LWT on the host audio Step. 6: Partition the coefficients in each subband into frames. For a frame length of 6656 audio samples, there are 832 and 1664 coefficients contained in the 3rd and 2nd-level subbands, respectively. Step. 7: Distribute the channel-coded symbols obtained in Steps. 3 and 4 to the audio frames. For each frame,  Embed the synchronization code and frame-related information alternately into the 3rd-level approximation subband.  Embed 2448 bits into the 3rd-level detail subband using 8-ary AQIM.  Embed 1632 bits into the 2nd-level detail subband using binary AQIM. Step. 8: Take inverse LWT to obtain the watermarked audio. Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 25 Fig. 2 Arrangement of watermark bits 4.2. Watermark extraction Fig. 3 outlines the procedure for watermark extraction, tampering detection, and self-recovery. The required steps are outlined as follows: Step. 1: Conduct 3-level LWT. Step. 2: Extract the embedded bits from the approximation subband using RDM discussed in Section 2. Step. 3: Apply a matched filter to the extracted bit sequence. The synchronization code in reverse order serves as the filter coefficients. Given that  ( ) 0,1n  denotes the synchronization code of length syncl , feeding the extracted ( )aw n into the matched filter results in, Fig. 3 Block diagram of watermark extraction for tampering detection and audio recovery    1 0 ( ) 2 ( 1 ) 1 2 ( ) 1 syncl sync a i M n l i w n i         (18) Ideally, a salient peak of value sync l occurs whenever ( ) a w n perfectly matches with the synchronous code. 255 bytes 255 bytes 255 bytes 255 bytes : : : 255 bytes 255 bytes 255 bytes 255 bytes Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 26 Step. 4: For each frame,  Retrieve the frame-related data and the hash bits from two consecutive frames; acquire the number of total frames and frame index using a  31, 16 BCH decoder.  Extract 2448 bits from the 3rd-level detail subband using 8-ary AQIM.  Extract 1632 bits from the 2nd-level detail subband using binary AQIM.  Rearrange these (2448+1632) bits as two words, each comprising 255 8-bit symbol (1 byte per symbol).  Place these two words in the corresponding index entry. Step. 5: Use . 2key to restore the symbol sequence in each message word. Step. 6: Use . 1key to restore the original permutation of the symbol array. Step. 7: Pass the message word to the RS decoder to obtain the source-coded audio symbols. Step. 8: Generate the hash bits from each word using the MD5 hash algorithm and compare these bits with those recorded in the approximation subband. If the hash bits are identical, the symbol sequence is assigned to the location indicated by the frame index. Otherwise, the frame is labeled as tampered at the receiver. Step. 9: Use a MP3 decoder to decompress the audio signal from the extracted watermark bits and up-sample the output by a factor of 4. Step. 10: If the audio frame has been tampered, then we substitute the up-sampled audio signal for the tampered audio content. Since the RS decoding process is capable of removing 51 (=  255 153 / 2 ) errors in a row of 255 symbols, tampering is recoverable as long as the tampering rate is below 0.2 (=51/255); otherwise, the RS decoder and recovery process fail. 5. Performance Evaluation The test materials in the following experiments comprised twenty-four 30-second music clips collected from a variety of compact discs, including vocal arrangements and ensembles of musical instruments. The music clips can be classified into four categories: classical (3), pop (7), rock (7), soundtracks (7). All audio signals were sampled at 44.1 kHz with 16-bit resolution. The parameters used in the proposed watermarking scheme were set as follows: 2  , 0 8L  , 128 sync l  ; rep f  1378.1, 4134.4, and 8268.8 Hz for the 3 rd -level approximation subband, 3 rd -level detail subband, and 2 nd -level detail subband, respectively; The back-tracing length L used in the RDM was set to 416. 816cL  and 832fL  were chosen for the 8-ary AQIM in the 3 rd -level detail subband, while 1632 c L  and 1664 f L  were for the binary AQIM in the 2 nd -level detail subband. 5.1 Imperceptibility test The quality of the watermarked audio signal was evaluated using the SNR defined in Eq. (19) along with the perceptual evaluation of audio quality (PEAQ) metric [31].   2 10 2 ( ) 10 log ˆ( ) ( ) n n s n SNR s n s n               (19) Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 27 where ( )s n and ˆ( )s n denote the original and watermarked audio signals, respectively. The PEAQ simulates the subjective evaluation of human subjects. It renders an objective difference grade (ODG) between -4 and 0, signifying a perceptual impression from “very annoying” to “imperceptible”. In this study, the PEAQ metric for the imperceptibility test was an implementation released by the TSP Lab at McGill University [32]. Table 1 summarizes the experiment results with respect to the test materials. For each audio signal of 30 seconds long, there are over 198 frames of size 6656 can be embedded and at least 99 of them contain the synchronization code. Embedding the synchronization codes and frame-related data into the 3 rd -level approximation subband rendered an average SNR of 28.34 dB, which led to an average ODG score around -0.30. The subsequent embedding of the channel coded symbols into the 3 rd and 2 nd -level detail subbands brought the SNR to 27.36 dB and caused the ODG to slightly drop to -0.44. Such a result suggests that the proposed watermarking schemes merely have minor influence on perceptual quality. Moreover, for all audio files in the test, the embedded synchronization codes were perfectly detected using the matched filter. Fig. 4 shows one such example, wherein the peaks with a height of 128 repeatedly appear for every 1664 approximation coefficients. Table 1 Quality measures of the watermarked audio after applying RDM and 𝟐𝑵-ary AQIM to the audio signals Quality measure RDM (in 3 rd -level app. Subband) N 2 -ary AQIM (in 3 rd & 2 nd -level detail subbands) RDM+ N 2 -ary AQIM SNR [dB] Mean 28.34 35.59 27.36 Standard deviation 0.25 3.92 0.46 ODG Mean -0.30 -0.22 -0.44 Standard deviation 0.38 0.26 0.43 (a) Audio siganl (b) Output of the matched fliter Fig. 4 Matched filtering with respect to the watermark bits obtained by the RDM 5.2 Tamper detection and recovery A representative audio signal was employed to demonstrate the competence of the proposed scheme for tampering detection and localization. We conducted three types of attacks (namely, deletion, substitution, and insertion) on the audio signal with the self-recovering watermark embedded. The deletion attack cropped the leading 25000 samples of the watermarked audio signal. The substitution attack replaced the watermarked audio signal with zero over the range between 325001 and 375000. For the insertion attack, we appended 50000 samples of random noise at the end of the watermarked audio signal. Both the substitution and insertion attacks represent possible attempts on counterfeiting the audio signal. Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 28 (a) Audio with fragile watermarks embeded (b) Tampered audio (c) Output of the matched fliter Fig. 5 Illustration of three types of tampering attack (a) Audio with fragile watermarks embeded (b) Tampered audio (c) Recoverd audio Fig. 6 Illustration of tampering detection and audio signal recovery Fig.5 (a) and (b) respectively present the original and tampered watermarked audio signals. The watermark bits hidden in the 3 rd -level approximation subband are extracted, bipolarized (i.e.,    0,1 1,1  ), and finally fed into a matched filter. Fig. Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 29 5 (c) depicts the output of the matched filter. A sharp peak with its magnitude greater than a predefined threshold (e.g., 80) can serve as an indicator to demarcate the frame boundary. The tampered signal is then processed using the watermark extraction and self-recovery procedures discussed in Sections 2-4. More specifically, subsequent to frame synchronization, the hash bits are used to verify the veracity of individual frame content. As shown in Fig. 6(b), a nonzero level (delineated as a bold solid red line) signifies intact frames and a zero level specifies the occurrence of tampering. All the data extracted from the 2 nd and 3 rd -level detail subbands are then employed to reconstruct an MP3 decompressed version of the audio signal. Eventually, the lost contents in the tampered frames are replaced by the reconstructed ones, which are drawn in red in Fig. 6(c). This typical example demonstrates that our scheme not only accurately locates the tampered audio frames but also possesses a self-recovering capability. 5.3 Processing Time The proposed self-recovery scheme comprises five basic modules to carry out watermark embedding. The first module involves source-channel encoding and hashing technique jointly used to constitute the bit sequence for audio recovery. As the watermarking is accomplished in three low-to-middle frequency subbands, we need a 3-level LWT and another inverse LWT to decompose and recompose the audio signal. These two transformations are extra computational burdens for the watermarking performed in the LWT domain. The operations situated in between the LWT and ILWT contain two sorts of watermark embedding, namely, the synchronization code sequence in the 3 rd level approximation subband and the channel-coded bit stream in both the 2 nd and 3 rd detail subbands. As for the process of watermark extraction, we only need a 3-level LWT to decompose the watermarked audio. The detection of the synchronization code enables the alignment of frame boundary, which facilitates the watermark retrieval in the 2 nd and 3 rd detail subbands. Possible errors in the watermark bits shall be amended with the assistance of the RS channel coder. Eventually, the veracity and integrity of the received audio can be authenticated using hashing comparison. Table 2 Processing time required for each program module in watermark embedding and extraction processes Program Modules in Watermark Embedding Processing Time [sec] Mean Standard deviation Conduct (1) Data encoding & hashing (2) Bit arrangement 6.377 0.193 Perform LWT 0.947 0.009 Embed sync_code using RDM 0.780 0.019 Embed channel-coded data using AQIM 2.852 0.075 Perform ILWT 0.954 0.008 Overall 11.910 0.258 Program Modules in Watermark Extraction Processing Time [sec] Mean Standard deviation Perform LWT 0.957 0.013 Align frames via the detection of sync_code 0.027 0.007 Extract watermark (coded data) 0.029 0.004 Perform channel-decoding and hashing comparison 1.267 0.037 Restore the tampered signal if necessary - - Overall 2.281 + 0.048 + We implemented the proposed watermarking algorithm in a Matlab environment operating with a 4 GHz Intel(R) Core(TM) i7-4790K CPU and 32 GB RAM. Table 2 lists the average computation time for the twenty-four 30-second audio signals in the test set. In general, it takes 11.91 seconds to complete the watermark embedding for an audio file of 30 second long. Among the five modules in the whole process, the data encoding and hashing consume about 53.54% of the computational time. The actual embedding in LWT subbands requires 5.533 seconds in total. Compared to the lengthy computation required in the embedding process, the time spent on watermark extraction is greatly reduced while extracting watermark bits from the 2 nd and 3 rd detail subbands using AQIM. Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 30 5.3 Comparative Evaluation In order to illustrate the advantages of our proposed scheme more clearly, we make a comparison between ours and the scheme proposed by Gomez-Ricardez and Garcia-Hernandez in [16]. The scheme in [16] is chosen for comparison based on the following two similarities. First, just like the manner we have done in this study, it employs a channel coder to protect the watermark. Second, this scheme is also claimed to be robust against the content replacement attack if the affecting portion is less than 20% of the whole audio. Table 3 summarizes the comparison. The method in [16] is indeed capable of restoring the substituted segment when the size of substitution remains unchanged. Restoring the audio segment destroyed by the insertion attack is also possible if the tampered area is accurately located and the whole audio is properly trimmed and aligned. However, dealing with the deletion attack is problematic. For example, deleting a small section of the audio in the middle, then shifting the remaining part ahead, and finally padding zeros at the end can easily cripple the watermark extraction for the scheme in [16]. The cause is ascribable to the fact that the deletion misplaces a large portion of the watermarked audio and thus devastates the channel code information. For the same sake, the scheme in [16] cannot survive the cropping or time-shifting attacks, which are known to disrupt the frame synchronization for correct watermark extraction. By contrast, with the incorporation of the self-synchronization feature discussed in Section 2, the proposed scheme can withstand all the aforementioned attacks (i.e., insertion, deletion, substitution, cropping, and time-shifting). Table 3 Comparison results Attack types Resistance The proposed Scheme in [16] Cropping / Time shifting Yes No Deletion Yes No Substitution Yes Partially feasible Insertion Yes Partially feasible LSB erasure Yes No Another advantage of the proposed scheme is that it is quite capable of resisting against minor attacks such as LSB erasure. Table 4 presents the extraction results when 1 and 2 LSBs are deliberately obliterated. The results indicate that even in the case of 2 LSBs erasure the proposed scheme can perfectly extract 19 out of 24 embedded watermarks from the 2 nd detail subband. Moreover, because the maximum BER (i.e., 1.172%) is less than 20% 1/8, the Reed-Solomon code capable of correcting 20% erroneous 8-bit symbols is sufficient to recover the original watermark bits. Table 4 Comparison results # of erasure bits Erase 2 LSBs Erase 1 LSB Embedding location 2 nd detail subband 3 rd detail subband 2 nd detail subband 3 rd detail subband # of watermarks without errors 19 21 22 23 # of watermarks with errors 5 3 2 1 Largest BER among the watermarks with errors 1.172% 0.081% 0.223% 0.001% 6. Conclusion In this paper, we have proposed a novel watermarking scheme to not only authenticate the veracity and integrity of the received audio but enable the recovery of tampered contents via the exploitation of source-channel coding. After the application of 3-level LWT to the audio signal, the proposed scheme performed two types of watermarking processes in a frame-synchronous manner. A compressed version of the original signal protected with the RS code was embedded into the 3 rd and 2 nd -level detail subbands using 2 -ary N AQIM, while the frame-related data and hash bits were embedded into the 3 rd -level approximation subband using RDM. The experiment results indicated that the watermark embedding resulted in an average SNR of 27.36 dB and an average ODG score around -0.44 for a test set of twenty-four audio clips, suggesting that the Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 31 watermarked audio is nearly perceptually indistinguishable from the original one. In the phase of watermarking extraction, the RDM proved to be effective in tracking synchronization codes, thus facilitating the frame alignment and watermark extraction. The 2 -ary N AQIM also demonstrated its competence in performing multi-bit data hiding in the LWT domain. More importantly, the ability of tracing frame boundaries empowered the proposed scheme to combat with the cropping and replacement attacks that no previous self-recovery watermarking schemes could easily handle. As there is plenty of room for hiding extra information in the 3 rd -level approximation subband, our future work will be focused on adding other robust watermarks to reinforce copyright protection. Conflicts of Interest The authors declare no conflict of interest. Acknowledgment This research work was supported by the Ministry of Science and Technology (MOST), Taiwan, under grant 107-2221-E-197-021. References [1] N. Cvejic and T. Seppänen, Digital audio watermarking techniques and technologies: applications and benchmarks. Hershey: Information Science Reference, IGI Global, 2008. [2] X. He, Watermarking in audio: key techniques and technologies, Youngstown, N.Y.: Cambria Press, 2008. [3] M. Steinebach and J. Dittmann, “Watermarking-Based Digital Audio Data Authentication,” EURASIP Journal on Advances in Signal Processing, vol. 2003, no. 10, pp. 1001-1015, 2003. [4] M. Q. Fan, P. P. Liu, H. X. Wang, and H. J. Li, “A semi-fragile watermarking scheme for authenticating audio signal based on dual-tree complex wavelet transform and discrete cosine transform,” International Journal of Computer Mathematics, vol. 90, no. 12, pp. 2588-2602, 2013. [5] Ghobadi, A. Boroujerdizadeh, A. H. Yaribakht, and R. Karimi, “Blind audio watermarking for tamper detection based on LSB,” Proc. 2013 15th International Conference on Advanced Communications Technology (ICACT), IEEE Press, January 2013, pp. 1077-1082. [6] N. N. Hurrah, S. A. Parah, N. A. Loan, J. A. Sheikh, M. Elhoseny, and K. Muhammad, “Dual watermarking framework for privacy protection and content authentication of multimedia,” Future Generation Computer Systems, vol. 94, pp. 654-673, 2019. [7] H. He, F. Chen, H. Tai, T. Kalker, and J. Zhang, “Performance analysis of a block-neighborhood-based self-recovery fragile watermarking scheme,” IEEE Transactions on Information Forensics and Security, vol. 7, no.1, pp. 185-196, 2011. [8] Q. Han, L. Han, E. Wang, and J. Yang, “Dual Watermarking for Image Tamper Detection and Self-Recovery,” 2013 9th International Conference on Intelligent Information Hiding and Multimedia Signal Processing, October 2013, pp. 33-36. [9] X. Zhang, Z. Qian, Y. Ren, and G. Feng, “Watermarking with flexible self-recovery quality based on compressive sensing and compositive reconstruction,” IEEE Transactions on Information Forensics and Security, vol. 6, no. 4, pp. 1223-1232, 2011. [10] W. L. Tai and Z. J. Liao, “Image self-recovery with watermark self-embedding,” Signal Processing: Image Communication, vol. 65, pp. 11-25, July 2018. [11] S. Sarreshtedari, M. A. Akhaee, and A. Abbasfar, “A watermarking method for digital speech self-recovery,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 23, no.11, pp. 1917-1925, 2015. [12] W. Lu, Z. Chen, L. Li, X. Cao, J. Wei, N. Xiong, et al., “Watermarking Based on Compressive Sensing for Digital Speech Detection and Recovery (†),” Sensors, vol. 18, no. 7, pp. 2390, 2018. [13] S. Li, Z. Song, W. Lu, D. Sun, and J. Wei, “Parameterization of LSB in Self-Recovery Speech Watermarking Framework in Big Data Mining,” Security and Communication Networks, 2017. [14] F. Chen, H. He, and H. Wang, “A fragile watermarking scheme for audio detection and recovery,” 2008 Congress on Image and Signal Processing, vol. 5, pp. 135-138, 2008. Advances in Technology Innovation, vol. 5, no. 1, 2020, pp. 18-32 32 [15] Menendez-Ortiz, C. Feregrino-Uribe, J. J. Garcia-Hernandez, and Z. J. Guzman-Zavaleta, “Self-recovery scheme for audio restoration after a content replacement attack,” Multimedia Tools and Applications, vol. 76, no. 12, pp. 14197-14224, June 2017. [16] J. J. Gomez-Ricardez and J. J. Garcia-Hernandez, “An audio self-recovery scheme that is robust to discordant size content replacement attack,” 2018 IEEE 61st International Midwest Symposium on Circuits and Systems (MWSCAS), 2018, pp. 825-828. [17] Chen and G. W. Wornell, “Quantization index modulation: A class of provably good methods for digital watermarking and information embedding,” IEEE Trans. Information Theory, vol. 47, no. 4, pp. 1423-1443, 2001. [18] W. Sweldens, “The lifting scheme: A custom-design construction of biorthogonal wavelets,” Applied and computational harmonic analysis, vol. 3, no. 2, pp. 186-200, 1996. [19] Daubechies, Ten lectures on wavelets. Philadelphia, 1992. [20] H. T. Hu and L. Y. Hsu, “A DWT-based rational dither modulation scheme for effective blind audio watermarking,” Circuits, Systems, and Signal Processing, vol. 35, no. 2, pp. 553-572, 2016. [21] H. T. Hu and L. Y. Hsu, “Supplementary schemes to enhance the performance of DWT-RDM-based blind audio watermarking,” Circuits, Systems, and Signal Processing, vol. 36, no. 5, pp. 1890-1911, 2017. [22] X. He and M. S. Scordilis, “An enhanced psychoacoustic model based on the discrete wavelet packet transform,” Journal of the Franklin Institute, vol. 343, no. 7, pp. 738-755, 2006. [23] T. Painter and A. Spanias, “Perceptual coding of digital audio,” Proc. IEEE, vol. 88, no. 4, pp. 451-515, 2000. [24] H. Traunmüller, “Analytical expressions for the tonotopic sensory scale,” The Journal of the Acoustical Society of America, vol. 88, no. 1, pp. 97-100, 1990. [25] H. T. Hu, L. Y. Hsu, and H. H. Chou, “Variable-dimensional vector modulation for perceptual-based DWT blind audio watermarking with adjustable payload capacity,” Digital Signal Processing, vol. 31, pp. 115-123, 2014. [26] H. T. Hu and L. Y. Hsu, “Robust, transparent and high-capacity audio watermarking in DCT domain,” Signal Processing, vol. 109, pp. 226-235, 2015. [27] H. Hu and T. Lee, “High-Performance Self-Synchronous Blind Audio Watermarking in a Unified FFT Framework,” IEEE Access, vol. 7, pp. 19063-19076, 2019. [28] S. Lin and D. J. Costello, Error Control Coding, Second Edition: Prentice-Hall, Inc., 2004. [29] G. Forney, Jr., “On decoding BCH codes,” IEEE Trans. Information Theory, vol. 11, no. 4, pp. 549-557, 1965. [30] B. den Boer and A. Bosselaers, “Collisions for the compression function of MD5,” Workshop on the Theory and Application of Cyptographic, Berlin, Heidelberg, pp. 293-304, 1994. [31] ITU-R Recommendation BS.1387, “Method for objective measurements of perceived audio quality,” December 1998. [32] P. Kabal, “An examination and interpretation of ITU-R BS.1387: Perceptual evaluation of audio quality,” TSP Lab Technical Report, Dept. Electrical & Computer Engineering, McGill University, 2002. Copyright© by the authors. Licensee TAETI, Taiwan. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY-NC) license (https://creativecommons.org/licenses/by-nc/4.0/).