4experiments and pilot study-pascual.pmd


R.M. Pascual and R.C.L. Guevara

5

SCIENCE DILIMAN  (JANUARY-JUNE 2017) 29:1, 5-36

Experiments and Pilot Study Evaluating
the Performance of Read ing Miscue Detector
and Automated Read ing Tutor for Filipino:
A Children’s Speech Technology
for Improving Literacy

Ronald M. Pascual*
Far Eastern University Manila

Rowena Cristina L. Guevara
University of the Philippines Diliman

ABSTRACT

The latest advances in speech processing technology have allowed the

development of automated reading tutors (ART ) for improving children's

literacy. A n  ART is a computer-assisted learning system based on oral

reading fluency (ORF) instruction and automated speech recognition (ASR)

technology. However, the design of an ART system is language-specif ic, and

thus, requires developing a system specif ically for the Filipino language.

In a previous work, the authors have presented the development of the

children's Filipino speech corpus (CFSC) for the purpose of designing an

ART in Filipino. In this paper, the authors present the evaluation of the ART

in Filipino which integrates a reference verif ication (RV)- and word duration

analysis-based reading miscue detector (RMD), a user interface, and a

feedback and instruction set. The authors also present the performance

evaluation of the RMD in offline tests, and the effectiveness of the ART as

shown by the results of the intervention program, a month-long pilot study

that involved the use of the ART by a small group of students. Offline test

results show that the RMD's performance (i.e. , FA rate ≈ 3% and MDerr rate

≈ 5%) is at par with those from state-of-the-art RMDs reported in the
literature. The results of the ART intervention experiment showed that the

students, on the average, have improved in their words correct per minute

(WCPM) rate by 4.66 times, in their ORF-16 scores by 6.0 times, and in their

reading comprehension exam scores by 4.4 times, after using the ART.

Key words: Reading miscue detector, automated reading tutor, reference

verif ication, word duration analysis, Filipino speech

_______________
*Corresponding Author

ISSN 0115-7809 Print / ISSN 2012-0818 Online


A Children’s Speech Technology for Improving Literacy

6

INTRODUCTION

The latest advances in the speech processing technology, coupled with the current

problems that the country's primary education system are facing, such as the poor

reading performance of the students and the shortage of teachers, inspired the

authors to focus on the development of an automated reading tutor (ART) for

improving Filipino children's literacy.  An ART is a computer-assisted learning system

based on oral reading fluency (ORF) instruction and automated speech recognition

(ASR) technology. The main task that an ART performs is the automatic detection of

reading miscues or disfluencies in an input speech. Through the reading miscue

detector (RMD), the ART is capable of “listening” to the reader and spot reading

errors so that it may offer help (e.g. , by modeling the correct pronunciation of a

text passage) whenever necessary. Figure 1 presents an overview of the ART system

and its basic components. It must be noted that, unlike a conventional automatic

speech recognizer, the RMD knows in advance the desired or target speech pattern

that should be uttered by the learner or user. The objective is to identify possible

deviations (miscue, error, or disfluency) from the target speech pattern using a

certain method.

The designs of ART and RMD systems however are language-specif ic. For instance,

the Project LISTEN's reading tutor (Mostow et al. 1994), the Colorado Literacy

Tutor (Hagen et al. 2003), and the system presented by Black et al. (2011) were all

designed for the English language. The RMD systems presented by Liu et al. (2008)

and by Duchateau et al. (2006) were designed for the Chinese Mandarin and Dutch

Figure 1. Overview of the automated reading tutor (ART ) system and its basic
components.


R.M. Pascual and R.C.L. Guevara

7

languages, respectively. Recently, Rahman et al. (2014) made an effort to develop

an ASR system that can be used for ARTs for Malay-speaking children, while Rayner

et al. (2014) developed a rule- and grammar-based computer-assisted language

learning (CALL) system for German-speaking children.

In this study, the authors focused on the development of a system for Filipino, the

national language of the Philippines and a language used in the Philippine basic

education system. Features and orthography of Filipino are very distinct from other

languages, and thus, there is an apparent need to develop a system specif ically

designed for the language. For instance, according to speech rhythm, it has been

shown that Filipino is generally classif ied as a syllable-timed language (Guevara

et al. 2010). In syllable-timed languages, and unlike in stress-timed languages,

such as English, every syllable is perceived as taking up roughly the same amount

of time, except for small variations due to the prosody. Moreover, the authors

decided to develop a system specif ically designed for children in the early grade

levels because, according to educators and reading experts, intervention programs,

such as reading tutorials done at early grades, are most effective and likely ensure

reading success at later grades (Wasik and Slavin 1993; National Reading Panel

2 0 0 0 ) .

However, a diff iculty in developing systems for children is the unavailability of

appropriate children's speech corpus that can be used for a particular application in

a particular language (Gerosa et al. 2009; Russell 2010). It was noted by Russell

(2010) and suggested by Gerosa et al. (2009) that, although an ASR system trained

on adults' speech can employ an adaptation technique that improves its performance

in processing children's speech, it is unlikely that its performance will exceed that

of a counterpart system trained on children's speech. In this study, the authors also

present the development of a children's Filipino speech corpus or the CFSC (Pascual

and Guevara 2012a) that was used for the design and implementation of the RMD

system for Filipino.

Most of the RMD systems that were reported in the literature have generally used

either of the two baseline systems: (1) a conventional ASR; or the (2) reference

verif ication (RV) method. The RMD systems presented by Mostow et al. (1994), Liu

et al. (2008) and Duchateau et al. (2006) all employed the f irst type of baseline

system (i.e. , a conventional ASR) for decoding what the reader has said. In the

conventional ASR baseline system type, the recognition results are compared with

the "reference" (i.e. , the target or expected sound/s in the text to be read) to check

whether there are any deviations (i.e. , reading miscues or disfluencies). Moreover, a

suitable language model (LM) based on the reading text is typically used together

with the ASR baseline system, in order to improve the recognition performance.


A Children’s Speech Technology for Improving Literacy

8

By contrast, other RMD systems, such as by Black et al. (2011) and by Bolaños et al.

(2009), employ the second type of baseline system (i.e. , the RV method) and do

not use an LM. Under the RV framework, the reader's speech data and the reference

are usually forced-aligned while the likelihood of the sounds is calculated. The

system then classif ies whether an input speech sound, in comparison with the

reference, is accepted or rejected. Thus, the RV method may be regarded as a

speech classif ication method rather than a speech recognition method. The RV

method may also similarly be seen as a form of speech verif ication or pronunciation

verif ication. An obvious advantage of the RV method over the conventional ASR

baseline system is that it avoids the problems caused by using a complex or

dynamic LM (Bolaños et al. 2009). The RV method, however, suffers from relatively

lower miscue detection rate due to its observed tendency to usually ignore single

phone or syllable deletion, insertion, or substitution, or even repetitions and self-

corrections. Thus for practical RMD systems, an additional process is usually

integrated to the baseline RV method, in order to achieve a better performance.

In this paper, the authors present the performance evaluation of the RMD system

for the automatic detection of reading miscues in children's Filipino speech. The

RMD system is the core of the ART system that the authors have developed for

Filipino (Pascual and Guevara 2012b). The architecture of the RMD system consists

of two levels: (1) a phone-level RV method on the f irst level; and, (2) a word-level

alignment and word duration analysis (WDA) method on the second level. An

interesting related study by Duong et al. (2011) discusses about the use of word

duration-based template models for automatically assessing children's oral reading

prosody in English. The focus of this study is the use of WDA for automatic detection

of reading miscues and disfluencies for application in ART system for Filipino.

Over the past few years, a number of f ield or pilot studies regarding the

implementation and evaluation of these ARTs in various languages and countries

have been published. For instance, Mills-Tettey et al. (2009) conducted a f ield

study in selected schools in Ghana and Zambia in Africa to investigate the viability

and effectiveness of the use of the Project LISTEN's reading tutor, in order to

improve the reading skills of children in English as a second language (L2). Using

the same ART system, Mostow et al. (2013) made a 7-month study that involved

178 students, and claimed that the use of the ART resulted to improvements

expected from guided oral reading, such as higher gains in fluency and reading

comprehension. Duchateau et al. (2009) presented the evaluation of an ART for

fluency instruction in Dutch as a f irst language (L1), and claimed that the specif ic

ART works satisfactorily for children, even for those with reading disabilities, in a

real school environment. Tsau (2012) conducted a f ield study, which was participated


R.M. Pascual and R.C.L. Guevara

9

in by students in central Taiwan, and reported that their ART system named "My

English Tutor" have successfully enhanced the oral reading fluency (ORF) of the

EFL learners. Similarly, Reeder et al. (2015) employed an ART system for English in

their study conducted in a public elementary school in Vancouver, Canada, and

concluded that the ART system successfully contributed to the reading development

of young learners of English as additional language (EAL).

In this paper, the authors present the evaluation of an ART for Filipino that integrates

an RMD, a user interface, and a feedback and instruction set. The next sections

present a discussion of the results of the ART intervention program, a month-long

pilot study that involved the use of the ART by a small group of students. The

program aims to evaluate the effectiveness of the ART in improving the reading

skills of the learners.

DESIGN AND EVALUATION METHOD FOR THE READING MISCUE
DETECTOR FOR FILIPINO

Children's Fil ipino Speech Corpus and Models

Some studies in the past few years, such as by Kazemzadeh et al. (2005), Batliner et

al. (2005), Cleuren et al. (2008), and Gao et al. (2012), have focused on the

development of children's speech corpora in languages, such as English, Dutch,

Italian, German, Swedish, and Mandarin. The absence of a speech corpus that can be

used for the development of an RMD and ART for Filipino has motivated the authors

to develop a medium-scale, gender- and age-balanced CFSC (Pascual and Guevara

2012). Nearly all of the speakers in the CFSC are native speakers of Filipino.

The CFSC consists of two parts: (1) a part containing good reading pronunciations;

and, (2) a part containing examples of actual reading miscues and disfluencies. The

CFSC provides the following data sets: training data set for the generation of

speech models, reference speech features (such as word durations) set extracted

from good pronunciations, offline test set for the evaluation of the RMD system,

and data set for the analysis of actual reading miscues found in children's Filipino

speech.

The f irst part of the children's speech corpus (i.e. , the good pronunciations) contains

about f ive hours of continuous read speech collected from a total of 37 Grades 2 to

5 students (ages ranging from 7 to 12 years). Out of the 37 students, 17 are girls

and 20 are boys. The second part of the children's speech corpus (i.e. , the part

containing reading miscues) contains about three hours of continuous read speech


A Children’s Speech Technology for Improving Literacy

10

collected from a total of 20 Grades 1 to 3 students (ages ranging from about 6 to

9 years). Of these students, 11 are girls and 9 are boys.

Nearly the entire CFSC contains orthographic transcriptions for the speech data. To

further make the CFSC useful for the RMD design, the authors transcribed a part of

the CFSC at phoneme level. Note that the smallest possible sound unit where a

reading miscue may occur is in a single phone or syllable, thus suggesting the need

for phone-level transcriptions. The phone-level transcription process, which is one

of the most expensive parts of the design, was executed in a semi-automated

method. That is, the speech data were initially machine-transcribed through phoneme

forced-alignment method using a hidden Markov model- or HMM-based speech

modeling toolkit HTK (Young et al. 2006), and the machine transcriptions were

then manually re-aligned afterwards if needed. A bootstrap data set (about 20

minutes of hand-transcribed speech) was used to facilitate the automated

transcription method. The phoneme set used for this study includes a total of 35

phones and diphones as listed in Table 1.

Phone Class Phones / Diphones

Stop /p/, /b/, /t/, /d/, /k/, /g/, /q/ (glottal stop)
Fricative /f/, /v/, /s/, /z/, /sh/
Affricate /j/
Nasal /m/, /n/, /ng/
Lateral Liquid /l/
Retroflex Liquid /r/
Glide /w/, /y/
Vowels /a/, /e/, /i/, /o/, /u/
Diphones /ha/, /he/, /hi/, /ho/, /hu/, /at/, /aw/, /ay/, /oy/
Pause / Silence /pau/

Table 1. Phoneme set used for children’s Fil ipino speech corpus (CFSC) transcriptions

A total of nine Filipino text passages, six of which are short stories while the other

three are expository-type texts, were used for the CFSC recordings. Adarna House,

a publisher of Filipino short stories for children in the Philippines, provided the six

short stories while the other three expository-type texts were adopted from various

school textbooks. All the text passages are age-appropriate and have been suggested

by research collaborators from the College of Education at the University of the

Philippines Diliman.


R.M. Pascual and R.C.L. Guevara

11

Design of the Reference Verif ication-
and Word Duration Analysis-based Reading
Miscue Detector for Fil ipino

For an RMD, the desired or target speech pattern (reference) that should be uttered

by the reader is known in advance, and the objective is to identify possible deviation

(miscue or error) from the reference. As mentioned in section I, the RV method for

detecting reading miscues aligns the input speech with the reference, and decides

through a certain similarity measure whether or not the input speech sounds are

the same as the reference sounds.

There are generally two possible approaches in estimating the likelihood or

similarity of the expected sounds in the reference to those sounds found in the

input speech. The f irst approach is through a time-domain template matching, while

the second approach is through speech model (HMM) comparison using parametric

representation of speech features, such as the Mel frequency cepstral coeff icients

(MFCCs). While template matching-based ASR offers simplicity, it also suffers from

several diff iculties and limitations, such as inability to generalize features from

many speakers and example utterances, low computational eff iciency for larger

number of test/reference patterns, and the inability to incorporate statistical features

from a given data set. The advantages of HMM-based ASR over template matching-

based systems have influenced us in selecting the HMM-based approach for

designing the RV-based RMD in Filipino. In particular, the HMM-based approach

allowed us to practically use all the information available in the training data and

to design an RMD that is speaker-independent. Furthermore, the HMM-based approach

also permitted us to implement a computationally eff icient and practically realizable

RMD.

In this study, the authors' initial approach is to design a baseline system that uses

phone-level RV method. To do this, the authors f irst generated the hidden Markov

models (HMMs) for all the phones in the phoneme set used in this study by training

the system with 1.5 hours of phone-level transcribed children's speech in the

CFSC. The HMMs consist of three states with 39 MFCCs comprising 13 static

coeff icients plus 13 delta coeff icients plus 13 acceleration coeff icients.  An overview

of the Markov model is illustrated in Figure 2. It is worth noting that the 3-state

HMM prototype shown in Figure 2 appears to actually have f ive states. This is only

due to the speech modeling toolkit convention. The f irst and the last states are

non-emitting states, and thus, are part of the network of HMMs, but do not describe

any of the input data. Only the middle three states emit observation vectors.


A Children’s Speech Technology for Improving Literacy

12

Given the reference text passage and an input speech data, the phone-level RV

method then proceeds as follows:

1. Form the reference (or expected) phone symbol sequence (including

symbols for expected short pauses/silences for proper phrasing) from

the reference text passage.

2. Perform an HMM Viterbi-forced alignment process between the

reference phones and the phones found in the input speech data. (This

process produces the log likelihood scores for each phone in the

reference).

3. Using the output likelihood scores ρ
v
 for each phone v from step 2,

perform a threshold-based classif ication to decide whether or not a

reading miscue has occurred. That is,

where P = log likelihood score threshold.

Figure 2. Three-state left-to-right hidden Markov model (HMM) used in this study.
States 1 and 5 are non-emitting. Observation vectors consist of 39 Mel frequency
cepstral coeff icients (MFCCs). The aij’s are the transition probabilities.

1 ( ), min [ ]

0 ( ), min [ ]

v
all v

RV
v

all v

miscue present P

md
miscue absent P








 




(1),


R.M. Pascual and R.C.L. Guevara

13

Figure 3 graphically illustrates the RV method through a specif ic example. In this

example, the reference text is a four-word Filipino phrase /Maraming prutas at

gulay/ (Many fruits and vegetables). The RV process is initiated by phonetically

spelling out the reference, thus producing the phone sequence: /m/, /a/, /r/, /a/, /m/,

/i/, /ng/, /p/, /r/, /u/, /t/, /a/, /s/, /q/, /at/, /g/, /u/, /l/, and, /ay/. The reference phone

sequence is then stored as a text f ile. During the execution of the RV process, the

reference phones are all initially assigned to a starting position and have equal

durations. The Viterbi-forced alignment process proceeds by taking the reference

phones one-by-one and f inding the best alignment (i.e. , the alignment that produces

the maximum likelihood score) with the input speech data. The numbers alongside

the phone symbols in the last tier shown in Figure 3 are the log likelihood scores.

Intuitively, we may decide that there has been a reading miscue (i.e. , deviation from

the expected or reference phones) if a phone likelihood score goes below the

lower threshold. While the previous condition generally applies, it is now worth

noting that due to the nature of the Viterbi-forced alignment and likelihood scoring,

there is also a need to set another threshold (i.e. , an upper threshold) for the

purpose of detecting a miscue. The upper threshold is necessary for the detection

of cases wherein the likelihood scores become too high due to certain types of

miscues or disfluencies, such as vowel elongation or pause/silence prolongation.

Figure 3. Illustration of the application of the reference verif ication (RV) process
used in this study to a specif ic utterance of a sentence in the Filipino reading text.


A Children’s Speech Technology for Improving Literacy

14

As mentioned in the previous section, the phone-level RV method alone is expected

to give a relatively low miscue detection rate at an acceptable false alarm rate due

to its poor ability to detect single-phone or single-syllable errors, as well as

disfluencies like brief hesitation pauses, self-corrections, and repetitions. To address

this problem, the authors' approach was to integrate a second-level (i.e. , word-

level) process that uses a duration-based prosodic feature (i.e. , word durations) to

the baseline phone-level RV process, in order to obtain a better detection of reading

miscues. The idea was based on an initial observation that, in many cases of actual

reading miscues and disfluencies found in the CFSC, the effective word durations

highly deviated from the expected word durations, which are based on good

pronunciation examples. Figure 4 illustrates such a case of a reading miscue found

in a certain f ive-word sentence in the CFSC. Figure 4 shows the speech waveforms

of a good pronunciation example (upper plot) and an utterance containing a reading

miscue (lower plot) for a particular sentence. The machine-detected word boundaries,

indicated by the rectangle edges, are shown beneath the respective plots. The

orthographic transcriptions of the speech data are also shown below each plot.

Note that the duration of word number 3 in the lower plot is signif icantly longer

than that of the upper plot. The observed word duration deviation from that of the

good pronunciation is due to the reading miscue (which is a substitution of the word

"ito'y" for "itong", followed by a self-correction) found in the lower plot.

Figure 4. Illustration of the machine-detected word duration deviation (word number
3) from the reference (upper plot) due to a reading miscue found in the speech data
shown in the lower plot.


R.M. Pascual and R.C.L. Guevara

15

In order to implement the WDA for the purpose of reading miscue detection, the

authors initially extracted the average sentence-normalized word durations from

the good pronunciation examples in the CFSC. The reference word durations were

initially stored in a look-up table. After the initialization process, the main WDA

process then proceeds as follows:

1. Given a certain sentence, form the reference word sequence (i.e., a static

word decoding network where word-ends are connected only to the fan-

in nodes of the alternative pronunciations of the next word appearing in

the reference). For the purpose of this step, the authors used a

pronunciation dictionary of 650 unique words, including alternative

pronunciations and a silence/pause model, found in the reading text

passages.

2. Perform a word-level Viterbi-forced alignment process between the

reference and the input speech data. (This process produces the machine-

detected word boundaries for each word in the reference sentence.)

3. From the machine-detected word boundaries in step 2, calculate the

sentence-normalized durations for each word in the sentence.

(Normalization with respect to local sentence duration is necessary to

compensate for different reading rates.)

4. Calculate the relative deviations (from the normalized reference word

durations) of the normalized word durations found in step 3.

5. Using the relative word-duration deviations δ
v
 for each word v in step 4,

perform a decision process (of whether or not there was a reading miscue)

as follows:

                                                       (2),

where Δ = word duration deviation threshold.

1 ( ), max [ ]

0 ( ), max [ ]

v
all v

WDA
v

all v

miscue present

md
miscue absent





 


 
 




A Children’s Speech Technology for Improving Literacy

16

Figure 5 shows the combined phone-level RV and WDA method, which is simply

referred to here as the "RV-plus-WDA" method, for the f inal design of the RMD for

Filipino. For the RV-plus-WDA method, the miscue detection is now given by the

following classif ication scheme:

where mdRV and mdWDA are given respectively in Equations (1) and (2).

1 ( ), [ ] 0

0 ( ), [ ] 0
RV WDA

RV WDA

miscue present md md
md

miscue absent md md

 
 

 
(3),

Figure 5. Overview of the combined methods, reference verif ication (RV) and word
duration analysis (WDA), for the reading miscue detector (RMD) design.

Input Speech Signal

Feature Extraction (MFCC)
and Preprocessing

HMM-based Phone-
level Reference
Alignment and
Likelihood Scoring

Reference Phone
Sequence

Acoustic Model
(HMM Definitions)

Likelihood
Scores
>Threshold?

Word Duration and
Phrasing Analysis (via
World-level Reference
Alignment)

Word/Pause
Duration within
Bounds?

No Reading Miscue / Disfluency Detected

Reading Miscue /
Disfluency Detected

Word
Pronunciation
Dictionary

Word- / Pause-
Duration
Reference
(“Good Readers”)

N

N

Y

Y


R.M. Pascual and R.C.L. Guevara

17

Read ing Miscue Detector Performance Evaluation Method

In order to evaluate the performances of the RMD systems presented in the previous

section, the authors performed two sets of offline tests that employ various

threshold values. The f irst set of tests was performed to evaluate the performance

of the phone-level RV-based RMD, while the second sets of tests evaluate the

performance of the two-level RV-plus-WDA-based RMD.

For both sets of tests, the authors used an offline test set containing 100 sentences

or a total of 1,030 words that were randomly selected from the CFSC. The authors

considered selecting the test f iles, such that there is a fairly balanced representation

in terms of gender and of age. Among the 100 sentences in the aforementioned

test set, 50 contained at least one reading miscue or disfluency, while the other 50

contain good pronunciations. Analysis of the test set showed that there are seven

types of reading miscues found in children's Filipino speech: (1) par tial word;

(2) hesitation pause; (3) insertion; (4) repetition; (5) substitution; (6) deletion; and,

(7) elongation. Table 2 summarizes the relative frequencies of occurrences of the

aforementioned reading miscues in the test set.

Table 2. Read ing miscue occurrences in the test set

Type of Miscue/Disfluency              Relative Frequency

Hesitation Pause 30.1%
Partial Word 20.4%
Insertion 18.4%
Repetition 13.6%
Substitution 7.8%
Deletion 6.8%
Elongation 2.9%

The performances of RMD systems were evaluated using the two measures

commonly used in literature: the false alarm (FA) rate; and, the reading miscue

detection error (MDerr) rate (Duchateau et al. 2006; Liu et al. 2008; Black et al.

2 0 1 1 ) .

The FA rate is def ined as the number of words erroneously detected as read

incorrectly divided by the total number of correct pronunciations. That is,

                 FA = FP / (TN + FP) (4),


A Children’s Speech Technology for Improving Literacy

18

where FP = number of false positives (i.e. , false detections of a miscue), and TN =

number of true negatives (i.e. , correct detections of the absence of a miscue).

The MDerr rate, also referred to as misdetection (MD) rate, is def ined as the number

of miscues that were not detected divided by the total number of miscues. That is,

                                         MD = FN / (TP + FN) (5),

where FN = number of false negatives (i.e. , undetected miscues), and TP = number

of true positives (correctly detected miscues).

READING MISCUE DETECTOR PERFORMANCE:
TEST RESULTS AND DISCUSSION

To commence with the f irst set of offline tests that evaluate the performance of

the phone-level RV-based RMD, the authors performed offline tests using the test

set described in the previous section for various upper threshold values. For the

offline tests, an initial f ixed lower threshold value of -1000 (log likelihood score)

was employed based on the observation that this setting did not introduce any

false alarm. Figure 6 shows the results of the test runs in terms of FA and MDerr

rates for various upper threshold values. Figure 6 shows that false alarms vanished

at around an upper threshold value of 550. Since the upper threshold generally

imposes a less strict condition than that of the lower threshold, the authors decided

to employ the aforementioned value as a f ixed upper threshold value for all the

Figure 6. False alarm (FA) and miscue detection error (MDerr) rates as functions of
the upper threshold values for the reference verif ication (RV)-based reading miscue
detector (RMD).


R.M. Pascual and R.C.L. Guevara

19

succeeding tests and system operations. Thus, for the rest of this article, the term

"threshold" for log likelihood score actually pertains to the lower threshold.

In addition, we may note that the plots in Figure 6 contain fluctuations due to the

f inite amount of data in the offline test set. For the error rate curves in the

succeeding f igures, the authors minimized the fluctuations, in order to predict the

general behavior of the system as the size of the test set increases. That is, the FA

and MDerr rate curves were modeled based on the widely used assumption in

literature that the probability distribution function (PDF) of the reading miscue

characteristics in the test data follows a normal (Gaussian) distribution. Thus, the

authors f it the error rate curves to an approximation of a Gaussian cumulative

distribution function (CDF) in the least-squares sense.

After setting the upper threshold value for the RMD constant, the authors performed

another set of offline tests for various lower threshold values. Figure 7 shows the

resulting performance of the of phone-level RV-based RMD in terms of FA (dashed

curve) and MDerr (solid curve) rates for various phone likelihood score threshold

(i.e. , lower threshold) values. The trends in Figure 7 show a generally increasing

FA rate and decreasing MDerr rate as the threshold is increased. Since RMDs are

never perfect, it has been customary for ART systems to be biased towards having

lower FA rates at the expense of having higher MDerr rates. This is done, in order

to avoid frustration on the reader with too many unnecessary interventions (Mostow

and Aist 1999). Typically, an FA rate equal to or higher than 10% for reading tutors

is not a good performance compared to state-of-the-art systems that usually have

Figure 7. False alarm (FA) and miscue detection error (MDerr) rates of the phone-
level reference verif ication (RV)-based reading miscue detector (RMD).


A Children’s Speech Technology for Improving Literacy

20

lower FA rates. Taking into consideration Figure 7, for instance, the probable best

case is to adjust the threshold, such that the FA rate is approximately 5%, while the

MDerr rate is approximately 22.5%. However, an MDerr rate of 22.5% may generally

be seen as still an unsatisfactory performance by an RMD. Thus, the succeeding

parts of this section present how much improvement for the RMD's performance

can be achieved by incorporating a second level process for reading miscue detection.

Investigation of the misdetection cases revealed that the previously presented

baseline method is generally unable to detect the following miscues and

disfluencies: (1) single-syllable or single-phone deletion, insertion, or substitution

that has durations of 150-250 milliseconds; (2) brief hesitation pause that is less

than 500 milliseconds; and, (3) some restarts or self-corrections.

The second set of offline test results shown in Figure 8 presents the performance

of the two-level RV-plus-WDA-based RMD in terms of FA (dashed) and MDerr (solid)

rates for various word duration deviation threshold values. Note that, unlike with

those from Figure 7, the trends in Figure 8 show a generally decreasing FA rate and

increasing MDerr rate as the word duration deviation threshold is increased. Note

that higher word duration deviation threshold values result to a less strict miscue

detector, while the opposite is true for higher likelihood score threshold values.

Figure 8. False alarm (FA) and miscue detection error (MDerr) rates for the two-
level reference verif ication and word duration analysis (RV-plus-WDA)-based reading
miscue detector (RMD).


R.M. Pascual and R.C.L. Guevara

21

As discussed in the previous section, the two-level RV-plus-WDA method combines

two different methods that use two different thresholds. The results shown in

Figure 8 imply that the word duration deviation threshold varied while the likelihood

score threshold remained f ixed. The authors have attempted the use of different

combinations of the two thresholds. The results of the aforementioned experiments

showed that the best results (i.e. , lowest overall FA and MDerr rates) were obtained

by making the f irst detector level (i.e. , the phone-level RV method) less strict

while allowing the second detector level (i.e. , the WDA method) catch the

misdetection cases in the former. Specif ically, the authors have f ixed the phone

likelihood score threshold, such that the phone-level RV method alone has an FA

rate of about 2% at an MDerr rate of roughly 30%.

The f inal threshold values used for the f irst and for the second RMD levels,

respectively, are: P = -650 (log likelihood score), and Δ = 70% (word duration
deviation). This combination seems to give the lowest overall FA and MDerr rates

while having a good FA-to-MDerr rate ratio. In particular, the approximate error

rates for the threshold combination are FA rate = 3% and MDerr rate = 5%. The FA

and MDerr rates for the selected threshold combination may also be deduced from

the tabular summary of the various combinations of the threshold values. Table 3

may also be used as a guide in predicting how the RMD system will perform in case

another threshold combination is desired.

Table 3. False alarm (FA) and miscue detection error (MDerr) rates
for various combinations of two thresholds, P and Δ

P = - 1000 20% 4 % 2 % 8 % 0 % 12%
P = - 650 20% 2 % 2 % 6 % 0 % 10%
P = - 570 20% 0 % 6 % 2 % 4 % 8 %
P = - 500 26% 0 % 12% 0 % 10% 6 %

FA rate MDerr
rate

FA rate FA rateMDerr
rate

MDerr
rate

ΔΔΔΔΔ = 50% ΔΔΔΔΔ = 755% ΔΔΔΔΔ = 100%

Note. P = Phone-Level Log Likelihood Score Threshold; Δ = Word Duration
Deviation Threshold

In order to obtain a better comparison (independent of threshold) of the system

performances, the receiver operating characteristic (ROC) graphs, as shown in Figure

9, were generated by plotting the FA rates versus the MDerr rates. Figure 9 shows

the ROC graphs of the phone-level RV-based RMD (dashed curve) and the two-level

RV-plus-WDA-based RMD (solid curve). Compared with the phone-level RV method


A Children’s Speech Technology for Improving Literacy

22

alone, the combined RV-plus-WDA method has signif icantly improved the RMD's

performance. Specif ically, Figure 9 shows that, at an FA rate of about 3%, the RV

method alone obtained an MDerr rate of about 25%, while the RV-plus-WDA method

obtained an MDerr rate of about 5%. Thus, at this FA rate, the combined RV-plus-

WDA method provided an MDerr rate absolute improvement of 20% over the RV

method alone.

Figure 9. Receiver operating characteristic (ROC) graphs or error rates trade-off
curves for the phone-level reference verif ication (RV) method (dashed), and for the
the two-level reference verif ication and word duration analysis (RV-plus-WDA)
method (solid) for the reading miscue detector (RMD) design.

Table 4 summarizes and compares the performances of the phone-level RV and the

RV-plus-WDA method at selected operating points. Operating points 1 and 2 are

where both methods obtained the same FA rates, namely 3% and 10%, respectively.

Operating point 3 is known as the equal error rate (EER), where FA and MDerr rates

are equal for a certain method.

Phone-level RV FA 3 % 10% 18%
MDerr 25% 20.5% 18%

RV-plus-WDA FA 3 % 10% 4.25%
MDerr 5 % 3 % 4.25%

RMD Method Error Rate Operating
Point 1

Operating
Point 2

Operating
Point 3

Table 4. False alarm (FA) and miscue detection error (MDerr) rates
for the phone-level reference verification (RV),

and for the two-level reference verification
and word duration analysis (RV-plus-WDA) methods

at selected operating points


R.M. Pascual and R.C.L. Guevara

23

The previous discussions have made clear that the RV-plus-WDA method has a

more superior performance than the phone-level RV method. Nearly about 70% of

the reading miscues that were missed by the phone-level RV method were

successfully detected by the RV-plus-WDA method. One reason that could explain

this result is the inability of the phone-level RV method to detect deviations from

the expected sounds when the deviations happen only for a short period of time. In

particular, the phone-level RV method was observed to have diff iculties in detecting

syllable insertions, deletions, or substitutions, and word restarts or immediate self-

corrections. Upon further investigation, the authors found out that about 57% of the

misdetection cases are syllable or phone insertions, substitutions, and deletions.

Moreover, the durations of the undetected insertions and substitutions mostly range

from about 150 to 250 milliseconds. Brief hesitation pauses, ranging from about

250 to 400 milliseconds, constitute about 28% of the misdetection cases. Self-

corrections, restarts, and repetitions all together make up about 14% of the

misdetection cases. With these reading miscues, the behavior of the phone alignment

method is to align a reference phone to within a beam of few phones in sequence

found in the input speech. In the process of seeking alignment, the system has the

tendency to either skip some inserted phones or ignore missing phones in the input

speech.

Figure 10 shows an example of a reading miscue that was undetected by the phone-

level RV method. As we can see from the f irst plot in Figure 10, the reference

phones that were forced-aligned with the input speech are for the Filipino phrase

/Tinatawag din itong/. The transcription of the actual input speech shown in the

f irst plot however is given as /Tina(ta)wag din (itoy-) itong/, which contains a syllable

deletion (i.e. , syllable /ta/ in /Tinatawag/) and a self-correction for a miscue (i.e. ,

/itoy/). Examination of the log likelihood scores, shown at the bottom of the f irst

plot, reveals that the miscues were undetected (i.e. , the likelihood scores were all

above the threshold). In particular, we can observe that the reference phones for the

word /itong/ were forced-align within the span of the uttered phrase /(itoy-) itong/

without generating a score below the threshold. The second plot of Figure 10

shows the result of the word-level forced-alignment process for implementing the

WDA,  wherein  the  reference text consists of the three words /TINATAWAG/,/DIN/, and

/ITONG/. Note that the second plot shows the same input speech as that of the f irst

plot. The word boundaries shown in the second plot are system-generated and are

used by the system to calculate the word durations. We may observe that the

reference word /ITONG/ was forced-aligned within the span of the uttered phrase

/(itoy-) itong/ which contain a miscue. Consequently, the detected duration of the

word /ITONG/ became signif icantly higher than normal. The normal range for word


A Children’s Speech Technology for Improving Literacy

24

duration is based on measurements made from the good pronunciations in the

CFSC. The third plot in Figure 10 shows the result of word-level alignment process

for an input speech corresponding to a good pronunciation. A signif icant difference

may be observed when the detected relative word duration for the word /ITONG/

in the second plot is compared to that in the third plot. In fact, the system was able

to detect that the word /ITONG/ from the input speech, shown in the second plot,

has a 136% deviation from the normal. Since a 136% deviation is above the

threshold set for the system, the reading miscue is therefore detected by the WDA

method.

The effectiveness of the WDA method in detecting reading miscues in Filipino

may further be explained by two main reasons: (1) the suitability of the WDA

method to the nature of reading miscues in children's Filipino read speech;  and,

(2) the suitability of the WDA method to the nature of the Filipino language.

Table 2 in the previous section has listed the different types of reading miscues/

disfluencies found in children's Filipino read speech taken from the CFSC as: partial

Figure 10.  A specif ic example of a case wherein a reading miscue, undetected by
the phone-level reference verif ication (RV) method, was detected by the word
duration analysis (WDA) method.


R.M. Pascual and R.C.L. Guevara

25

word; hesitation pause; insertion; repetition; substitution; deletion; and, elongation.

An examination of the nature of these reading miscues would show that all of them,

except substitution, are time-dependent. That is, these reading miscues would affect

the effective word durations as measured by the system. Moreover, the authors

have observed from the test set that hesitation pauses, partial words (mostly

followed by a short pause or a restart), and repetitions are the miscue types that

usually cause the largest word duration deviations. Since the aforementioned miscue

types constitute the majority of the miscues found in the test set, the WDA method

therefore generally becomes an effective way of detecting the reading miscues.

Since Filipino is a syllable-timed language, the insertion or deletion of syllables or

words, as well as pauses, definitely affects the effective word durations, as measured

by the RMD system. The WDA method for the RMD may therefore be shown to be

especially effective for syllable-timed languages, such as Filipino.

DESIGN OF THE AUTOMATED READING TUTOR FOR FILIPINO

The ART for Filipino presented in this paper has the following major components:

(1) the RMD; (2) the oral/visual feedback and instruction set; and, (3) the graphical

user interface.

The RMD for Filipino, which is the core of the ART, employs a two-level RV-plus-

WDA architecture as presented in the previous section. The performance measures

for the RMD are the FA rate and the MDerr rate. The best result of the offline

performance evaluation tests shows that the RMD's operating point may be

calibrated, such that the FA rate is approximately 3% while the MDerr rate is

approximately 5%. The FA and MDerr rates that the authors obtained for the RMD

for Filipino prove that it is at par with the state-of-the-art RMDs (Duchateau et al.

2006; Liu et al. 2008; Black et al. 2011) reported in the literature. For comparison,

Table 5 summarizes the performance of the two-level RV-plus-WDA-based RMD

presented in this paper, together with those from other systems reported in the

literature.

The oral/visual feedback and instruction set allows active interaction between the

machine and the child (user or learner). The ORF instruction set used in this study is

a set of pre-recorded speech of a human tutor, who is a reading expert and an

education sector research collaborator from the College of Education at the

University of the Philippines Diliman. Any desired sentence within the instruction

set can automatically be played-back by the ART system whenever there is a need

to model the correct pronunciations of the words in the text passages.


A Children’s Speech Technology for Improving Literacy

26

As briefly discussed in the previous section, nine Filipino text passages were used

for both the speech database collection and the ART system development. The six

short stories were provided through a non-disclosure agreement by Adarna

Publishing House, a leading publisher of Filipino short stories for children. The

three expository texts were adapted from various grade school textbooks used in

the Philippines. Each of the text passages adapted was provided with a corresponding

reading age recommendation by its publisher. Moreover, the text passages have

been selected through the suggestions of research collaborators. The nine text

passages all together have a total of 2,169 words (about 650 of which are unique)

and 290 sentences.

The ART also provides audible comments and visual animations as positive feedback

or "praise" (Mostow and Aist 1999) in response to a perceived good reading

performance of the learner. According to suggestions in the literature, giving positive

feedback is a powerful motivation and it demonstrates that the ART system is a

perceptive and responsive audience for the learner's efforts. In the ART presented

in this study, the authors employed the audible comments: "Mahusay!" (Excellent!),

"Magaling!" (Good!), and "Kahanga-hanga!" (Admirable!). The positive comments are

alternately played-back by the system whenever no reading miscues were detected

from the input speech. Aside from audible comments, the ART also displays an

animated icon whenever the aforementioned comments are being played-back.

Figure 11 shows the graphical user interface of the ART in Filipino. The graphical

user interface of the ART allows the reader to select a story and navigate through

the sentences within the selected story. An important design consideration for the

interface is its simplicity because the intended user may be as young as a Grade 1

student (typically aged 5 to 7).

Black et al. (2011) 8.1% 11.5% English
Liu et al. (2008) 5.82% 9.07% Chinese Mandarin
Duchateau et al. (2006) 2.1% - 8.4% 23.1% - 16.9% Dutch
Two-level RV-plus-WDA 3 % 5 % Filipino

State-of-the-art RMDs False Alarm
(FA) rate

Miscue
Detection Error

(MDerr) rate

Language

Table 5. Summary of specifications
of various state-of-the-art read ing miscue detectors (RMD)


R.M. Pascual and R.C.L. Guevara

27

AUTOMATED READING TUTOR (ART) INTERVENTION PROGRAM:
A PILOT STUDY

Design of the ART Intervention Program

The ART intervention program or the pilot study, a pioneering experiment in

computer-assisted ORF instruction in Filipino, is basically a reading tutorial program

that makes use of the ART presented in the previous section. The within-subjects

experiment involved a group of six grade-2 students from the University of the

Philippines Integrated School (UPIS), and consists of two separate one-month periods.

During the f irst period of the experiment, the experiment group depended only on

regular classroom instruction for improving their reading skills. During the second

period, the ART was used by the experiment group, in addition to the regular

classroom instruction. The ART intervention program allowed the group to use the

ART for about 45 minutes per day, three days per week. In order to evaluate the

effectiveness of the ART in providing reading skill improvement to the students,

three sets of oral reading fluency assessments (ORFA) were administered to the

experiment group. Figure 12 graphically summarizes the pilot study experiment

and its timeline. As we can verify from Figure 12, the f irst ORFA was given prior to

the start of period 1, the second ORFA at the end of period 1, and the third ORFA at

the end of period 2.

All three ORF assessments were administered by an expert, a Filipino teacher who

also has a research experience in automated Filipino essay evaluation. As suggested

Figure 11. The graphical user interface of the automated reading tutor (ART ) for
Filipino.


A Children’s Speech Technology for Improving Literacy

28

by research collaborators, the ORFA employed both familiar and unfamiliar text

passages, in order to obtain a more complete observation regarding the effects of

the use of the ART in student's learning for both text structures. The ORFA also

included a comprehension exam that has an objective-type and an essay-type

components. In this study, the authors employed the following three commonly

used ORFA measures in the literature: (1) number of words correct per minute

(WCPM), (2) 16-point multi-dimensional ORF score; and, (3) reading comprehension

scores. Group gain score analysis, also known as difference score analysis

(Smolkowski 2013), was selected because it provides a simple and precise, yet

unbiased and reliable way of interpreting the true change (Ragosa 1983).  For

instance, the WCPM gain scores clearly and meaningfully tell educators whether

the experiment group improved, retained, or deteriorated, and by precisely how

much, in their reading skills (Smolkowski 2013).

Results of the ART Intervention Program
and the Oral Read ing Fluency (ORF) Assessments

The primary result of the ART Intervention program presented in this study is

expressed in terms of WCPM, the most widely used measure for ORFA (Rasinski

2004). Figure 13 summarizes the average or group WCPM improvements for all

the students in the experiment group. An important pattern that may be noted in

Figure 13 is that the group WCPM slope for the second period became abruptly

higher compared to the slope for the f irst period. Thus, there was a signif icantly

higher improvement in group WCPM in period 2 than that in period 1. It therefore

suggests that, on the average, the reading tutor had a positive effect of accelerating

the improvement of the students’ ORF.

Figure 12. Pilot study experiment and timeline.


R.M. Pascual and R.C.L. Guevara

29

Figure 13. Group average number of words correct per minute (WCPM) for the three
oral reading fluency assessments (ORFA).

In Table 6, the signif icantly higher overall average WCPM gain for period 2 compared

to that for period 1 clearly shows that the experiment group signif icantly improved

their ORF after using the ART. The normal growth (due to the usual classroom

instruction) of the group in reading fluency for one month is shown in Table 6 to be

only 3.53 words per minute. After using the ART for a month however, the

improvement of the group suddenly jumped up to 16.46 words per minute, an

improvement which is 12.93 words per minute higher than the normal.

To have a better idea on the magnitude of improvement in the reading fluency of

the group due to the use of the ART, we may calculate the WCPM gain ratio (i.e. ,

group gain for period 2, divided by the group gain for period 1). The overall WCPM

gain ratio was calculated to be 4.66. This means that the fluency improvement

rate in period 2 (i.e. , when the ART was used) improved by 466% than the normal

Period 1 2.01 5.05 3.53
(Without using the ART) (SD=3.88)

Period 2 18.75 14.17 16.46
(Using the ART) (SD=8.56)

          Note. SD = Standard Deviation

Group Gain (WCPM)

Famil iar
Text

Unfamil iar
Text

Familiar
Text

Table 6. Words correct per minute (WCPM) group gains
for the two experiment periods


A Children’s Speech Technology for Improving Literacy

30

learning rate. In other words, after the ART was used for a month, the experiment

showed that the reading fluency of the group has been accelerated by an amount of

time roughly equivalent to three and a half months.

In order to show that the larger WCPM group gain in period 2 is indeed attributable

to the use of the ART, the authors calculated the correlations between score gains

in period 1, the score gains in period 2, and the score gains in the whole experiment

period. The correlation of the score gains in period 2 and the score gains in the

whole experimental period (i.e. , periods 1 and 2 combined) shows a strong and

signif icant relationship (i.e. , with a coeff icient of 91.17% at p < 0.00005) between

the score gains in period 2 and the whole two-month experimental period. Therefore,

the observed overall improvement of the students in their reading fluency can

indeed be attributed to the use of the ART.

To further illustrate that it is less likely that the reading fluency improvement of

the students in the experiment group is due to their normal learning rate trends or

to random chance, the authors also referred to the ORF norms used for English

(Hasbrouck 2006). In doing this, the authors emphasize that their purpose is not to

directly compare the WCPM levels observed from their experiment to the normal

WCPM levels in English. The normal WCPM levels for English are expected to be

different from normal WCPM levels in Filipino due to the differences in features

and orthography between the two languages. Thus, the authors only referred to the

ORF norms in English for the purpose of generally comparing their experimentally

observed reading fluency improvement trend to the expected normal reading

fluency trend in English. The aforementioned WCPM trend comparison for Grade 2

level is given in Figure 14. A simple graphical analysis on the WCPM trends shown

Figure 14. Comparison of group average number of words correct per minute (WCPM)
trends between US norm (English) and experiment observations (Filipino).


R.M. Pascual and R.C.L. Guevara

31

in Figure 14 would suggest that the reading fluency improvement observed from

the experiment group during period 2 is high relative to English ORF norm, and

significantly higher than the reading fluency improvement observed during period 1.

Moreover, the expected English ORF improvement throughout the entire school

year is fairly linear. By contrast, the reading fluency improvement trend observed

from the experiment group for the entire two-month experiment period highly

deviated from the linear trend. Thus, simple trend analysis clearly shows that the

ORF improvement of the students in the experiment group during period 2 is

unusually high, and this may be attributed to the treatment made during the period

(i.e. , the use of the ART ).The second set of ORFA results is based on the measure

known as the ORF-16 score obtained through the use of a 16-point multi-dimensional

ORF rubric proposed by Rasinski (2004). The four dimensions considered in the

ORF-16 rubric are: (1) expression and volume; (2) phrasing; (3) smoothness; and,

(4) pace.

The trend in the ORF-16 group scores shown in Figure 15 also seem to agree with

that of the WCPM group scores presented earlier in this section. We may observe

from Figure 15 that there was a sudden increase in the reading fluency of the

students in the experiment group during period 2, the period when the ART was

used by the group. Thus, in a similar way that was shown earlier in this section, it

follows that the sudden improvement in the student's reading fluency may be

attributed to the use of the ART. The overall ORF-16 group gain ratio, which is

calculated to be 6.0, means that the observed reading fluency improvement of the

students in the experiment group during the time that they were using the ART is

about six times better compared to the time that they were not using the ART.

Figure 15. Sixteen-point multi-dimensional oral reading fluency (ORF-16) group
scores for the three oral reading fluency assessments (ORFAs).


A Children’s Speech Technology for Improving Literacy

32

I n  o r d e r  t o  s e e  h o w  t h e  r e a d i n g  f l u e n c y  d e v e l o p m e n t  h a s  a f f e c t e d  t h e

comprehension of the students in the experiment group, the third set of ORFA

results was based on the comprehension exam scores. The plots in Figure 16 show

that, on the average, the student's comprehension also abruptly improved during

period 2, the time when they were using the ART. The overall comprehension exam

score gain ratio, computed to be 4.43, indicates that the students have improved in

their comprehension by more than four times after using the ART.

Figure 16. Comprehension exam group scores for the three oral reading fluency
assessments (ORFAs).

CONCLUSION AND FUTURE DIRECTIONS

In this paper, the authors presented a two-level RMD for the design of an ART for

Filipino that uses phone-level RV and WDA methods. The results of offline tests

showed that the RMD's performance (i.e. , false alarm rate ≈ 3% and misdetection
rate ≈ 5%) is at par with those from state-of-the-art RMDs (Duchateau et al. 2006;
Liu et al. 2008; Black et al. 2011) reported in the literature. The advantages of the

RV-plus-WDA RMD are design simplicity (i.e. , it did not require building a complex

language model or using an adaptation technique) and low training cost (i.e. , it only

required 1.5 hours of training data).

The authors of this study have discussed the design of the ART prototype that

integrates the two-level RV-plus-WDA RMD, the user interface, and the feedback

and instruction sets. The authors suggest the following design considerations:

(1) user interface simplicity on account of very young users; (2) minimal interventions


R.M. Pascual and R.C.L. Guevara

33

to avoid children's frustration; and, (3) the use of positive feedback or "praise" that

most children seem to appreciate. Moreover, it is suggested that all model

pronunciations in the instruction set should contain the "correct" or acceptable

prosodic features because it has been observed that students have the tendency to

adopt these features.

The authors have presented in this paper an experimental procedure for evaluating

the effectiveness of an ART for Filipino that involves an ART intervention program

and a set of ORFA for a small group of students. The results of the ART Intervention

experiment clearly showed the ART's effectiveness in improving the students'

ORF in terms of WCPM, ORF-16 scores, and comprehension scores. Specif ically,

the results of the ORFAs showed that, after using the ART, the students, on the

average, have improved in their WCPM by 4.66 times compared to the period when

they were not using the ART. Correlation and trend analysis undoubtedly implies

that the improvement of the students in their reading fluency was indeed

attributable to the use of the ART. Similarly, after using the ART, the students have

improved in their ORF-16 scores by 6.0 times compared to the period when they

were not using the system. The results of experiment have also shown that, after

using the ART, the students, on the average, were 4.4 times better in reading

comprehension than when they were not using the ART.

With all the positive results that the authors obtained from the study, the ART in

Filipino seems to be a promising and important Filipino speech technology to

further develop and implement for the primary education system in the Philippines.

Future directions for this study include the development of an automated oral

reading assessment system for children's Filipino speech, adaptation of the RMD

system for nonnative speakers of Filipino (or those who speak Filipino as their

second or third language), and adaptation of the system design methods for other

Philippine languages and other related applications.

ACKNOWLEDGMENTS

The authors would like acknowledge the UP Digital Signal Processing Laboratory,

UP Integrated School, UP College of Education, UP Department of Linguistics, and

Adarna House for lending their support to the study. This research was funded by

the CHEDSEGS grant from the Commission on Higher Education of the Philippines,

in collaboration with the Off i ce of the V i ce-Chancellor for Research and

Development of University of the Philippines Diliman.


A Children’s Speech Technology for Improving Literacy

34

REFERENCES

Batliner A, Blomberg M, D'Arcy S, Elenius D, Giuliani D, Gerosa M, Hacker C, Russell M,
Steidl S, Wong M. 2005. The PF_STAR children's speech corpus. In: Proceedings of
INTERSPEECH; Lisbon, Portugal. p. 2761-2764.

Black M, Tepperman J, Narayanan S. 2011. Automatic prediction of children's reading
ability for high-level literacy assessment. IEEE Trans. on Audio, Speech and Language
Processing. 19(4):1015-1028.

Bolaños D, Ward W, Cole R. 2009. A reference verif ication framework and its application
to a children's speech reading tracker. In: Proceedings of 2nd Workshop on Child,
Computer and Interaction; NY, USA: ACM. p. 22.

Cleuren L, Duchateau J, Ghesquiere P, Van hamme H. 2008. Children's oral reading
corpus (CHOREC): Description and assessment of annotator agreement. In: Proceedings
of 6th International Conference on Language Resources and Evaluation; Morocco.

D u c h a te a u  J , W i g h a m  M , D e m u y n c k K , Va n  h a m m e  H . 2 0 0 6 . A f l ex i b l e  r eco g n i s e r
a r c h i t e c t u r e  i n  a  r e a d i n g  t u t o r  f o r  c h i l d r e n .  I n :  P r o c e e d i n g s  o f  I T R W  o n  S p e e c h
Recognition and Intrinsic Variation; Toulouse, France.

Duchateau J, Kong Y, Cleuren L, Latacz L, Roelens J, Samir A, Demuynck K, Ghesquiere P,
Verhelst W, Van hamme H. 2009. Developing a reading tutor: Design and evaluation of
d e d i c a t e d  s p e e c h  r e c o g n i t i o n  a n d  s y n t h e s i s  m o d u l e s .  S p e e c h  C o m m u n i c a t i o n .
51(10):985-994.

Duong M, Mostow J, Sitaram S. 2011. Two methods for assessing oral reading prosody.
ACM Trans. on Speech and Language Processing. 7(11):14.

Gao J, Li A, Xiong Z. 2012. Mandarin multimedia child speech corpus: Cass_Child. In:
Proceedings of 2012 International Conference on Speech Database and Assessments
(Oriental COCOSDA); IEEE Xplore.

Gerosa M, Giuliani D, Narayanan S, Potamianos A. 2009. A review of ASR technologies
f o r  c h i l d r e n 's  s p e ec h . I n :  Pr o cee d i n g s  o f  2 n d  Wo r k s h o p  o n  C h i l d , Co m p u t e r  a n d
Interaction; Cambridge, MA, USA: ACM. p. 7.

Guevara RC, Garcia I, Santos T, Nolasco R. 2010. A computational approach to Filipino
speech rhythm. In: Proceedings of 1st Philippine Conference-Workshop on Mother
Tongue-Based Multilingual Education; Cagayan de Oro City, Philippines.

Hagen A , Pellom B, Cole R. 2003. Children's speech recognition with application to
interactive books and tutors. In: Proceedings of IEEE Workshop on Automatic Speech
Recognition and Understanding; St. Thomas, Virgin Islands: IEEE Xplore. p. 186-191.

Hasbrouck J. 2006. Oral reading fluency norms: A valuable assessment tool for reading
teachers. The Reading Teacher. 59(7):636-644.


R.M. Pascual and R.C.L. Guevara

35

Kazemzadeh A, You H, Iseli M, Jones B, Cui X, Heritage M, Price P, Anderson E, Narayanan
S, Alwan A . 2005. TBALL data collection: The making of a young children's speec h
corpus. In: Proceedings of Interspeech; Lisbon, Portugal. p. 1581-1584.

Liu C, Pan F, Ge F, Dong B, Zhao Q, Yan Y. 2008. Application of LVCSR to the detection of
Chinese Mandarin reading miscues. In: Proceedings of 4th International Conference
on Natural Computation; Jinan, China: IEEE Xplore. p. 447-451.

Mills-Tettey G, Mostow J, Dias MB, Sweet T, Belousov S, Dias MF. 2009. Improving child
literacy in Africa: Experiments with an Automated Reading Tutor. In: Proceedings of
3rd International Conference on Information and Communication Technologies and
Development; Doha, Qatar: IEEE Xplore. p. 129-138.

Mostow J, Roth S, Hauptmann A , Kane M. 1994. A prototype reading coach that listens.
In: Proceedings of 12th National Conference on Ar tif icial Intelligence; Seattle, WA:
ACM. p. 785-792.

Mostow J, Aist G. 1999. Giving help and praise in a reading tutor with imperfect
l i s te ni ng - B ecau s e  au to mated speech recognition means never being able to say
yo u 'r e  ce r t a i n . T h e  Co m p u te r  A s s i s ted  La n g u a g e  I n s t r u c t i o n  Co n s o r t i u m  ( CA L I CO )
Journal. 16(3):407-424.

M o s t o w  J , N e l s o n -Ta y l o r  J , B e c k J . 2 0 1 3 . Co m p u t e r  g u i d e d  o r a l  r e a d i n g  v e r s u s
independent practice: Comparison of sustained silent reading to an automated reading
tutor that listens. Journal of Educational Computing Research. 49(2):249-276.

[NRP] National Reading Panel (US). 2000. Teaching children to read. Panel Repor t
i s s u e d  f o r  t h e  N a t i o n a l  I n s t i t u t e  o f  C h i l d  H e a l t h  a n d  H u m a n  D e v e l o p m e n t ,  U . S .
Department of Health and Human Services. NIH Pub. No. 00-4769.

Pa s c u a l  R , G u ev a r a  RC. 2 0 1 2 a . D eve l o p i n g  a  c h i l d r e n 's  F i l i p i n o  s p eec h  co r p u s  fo r
application in automatic detection of reading miscues and disfluencies. In: Proceedings
of IEEE TENCON 2012: IEEE Asia Pacif ic Region International Conference; 2012; Cebu
City, Philippines: IEEE Xplore.

Pascual R, Guevara RC. 2012b. Developing an automated reading tutor in Filipino for
primary students. In: Proceedings of 2nd Philippine Conference-Workshop on Mother
Tongue-Based Multilingual Education; Iloilo City, Philippines.

R a g o s a  D .  1 9 8 3 .  D e m o n s t r a t i n g  t h e  r e l i a b i l i t y  o f  t h e  d i f f e r e n c e  s c o r e  i n  t h e
measurement of change. Journal of Educational Measurement. 20(4):335-343.

Rahman F, Mohamed N, Mustafa M, Salim S. 2014. Automatic speech recognition system
fo r  M a l a y s p e a k i n g  c h i l d r e n . I n :  Pr o ceed i n g s  of  t h e  2 0 1 4  T h i r d  I CT- I S PC ;  N a k h o n
Pathom, Thailand: IEEE Xplore. p. 79-82.

Rasinski T. 2004. Assessing reading fluency. Hawaii: Pacif ic Resources for Education
and Learning. p. 1-25.


A Children’s Speech Technology for Improving Literacy

36

Rayner M, Tsourakis N, Baur C, Bouillon P, Gerlach J. 2014.  CALL-SLT: A spoken CALL
s y s t e m  b a s e d  o n  g r a m m a r  a n d  s p e e c h  r e c o g n i t i o n .  L i n g u i s t i c  I s s u e s  i n  L a n g u a g e
Technology. 10(2):1-23.

Reeder K, Shapiro J, Wakef ield J, D'Silva R. 2015. Speech recognition software contributes
t o  r e a d i n g  d e v e l o p m e n t  f o r  y o u n g  l e a r n e r s  o f  E n g l i s h .  I n t e r n a t i o n a l  J o u r n a l  o f
Computer-Assisted Language Learning and Teaching. 5(3):60-74.

Russell M. 2010. Speech technologies for children. New Orleans: IEEE Signal Processing
Society - STLC Newsletter.

Smolkowski K. [Internet]. 2013. Gain Score Analysis. Oregon: Oregon Research Institute;
[cited 2016 Dec]. Available from http://homes.ori.org/keiths/Tips/Stats_GainScores.html.

Tsau S. 2012. The effects of an automatic speech analysis system on enhancing EFL
learners’ oral reading fluency. Procedia-Social and Behavioral Sciences. 64(2012):141-
150.

Wasik B, Slavin R. 1993. Preventing early reading failure with one-to-one tutoring: A
review of f ive programs. Reading Research Quar terly. 28(2):178-200.

Yo u n g  S , E ve r m a n n  G , G a l e s  M , Wo o d l a n d  P. 2 0 0 6 . T h e  H T K B o o k [ I n t e r n e t ] . U K :
Cambridge University Engineering Department; [cited 2010 Nov 19]. Available from
http://htk.eng.cam.ac.uk.

_____________

Dr. Ronald M. Pascual <ronaldmpascual@gmail.com> is an Associate Professor

and Assistant Director of the Electronics and Electrical Engineering Department of

FEU Institute of Technology, Manila. He received his Ph.D. in Electrical and

Electronics Engineering from University of the Philippines Diliman as a CHED

scholar, his M.S. in Electronics and Communications Engineering from De La Salle

University, Manila as a DOST scholar, and his B.S. in Electronics and Communications

Engineering from Pamantasan ng Lungsod ng Maynila. His research interests include

speech signal processing, and speech technology development.

Dr.  Rowena Cristina L. Guevara is the Undersecretary for Research and Development

of the Department of Science and Technology (DOST) and a Professor of the Digital

Signal Processing Laboratory of the University of the Philippines Diliman. She was

a former Executive Director of DOST-Philippine Council for Industry, Energy, and

Emerging Technology Research and Development, and was a former Dean of the

College of Engineering of the University of the Philippines Diliman. She received

her Ph.D. in Electrical Engineering from University of Michigan, Ann Arbor as a

DOST scholar, and her M.S. and B.S. in Electrical Engineering from the University of

the Philippines Diliman. Her research interests include speech signal processing,

and audio and communications signal processing.