nelson


Creating and using digital audio files under the 
Windows operating environment 
 

Larry Nelson 
Curtin University of Technology 

 
Macintosh and Amiga users have had good sound playback facilities for 
years; recent models of the Mac have recording capabilities as well. While 
sound boards for MS DOS PCs have been around for quite some time, 
audio formatting standards have been nonexistent, and memory restrictions 
have placed severe limits on the amount of sound which can be recorded. 
The advent of Windows has changed things. This paper suggests that PCs 
with Windows are capable of serious audio work, and mentions factors to 
be aware of before setting forth on sound ventures with a PC. 

 
This paper discusses experiences gained in the process of creating a series 
of language learning software modules for the Department of 
Employment, Education and Training. 
 
The DEET-funded 'Talk Project' at Curtin University of Technology was 
instigated in order to determine the feasibility of developing cost-effective 
audio-based software for beginning students of Burmese, Indonesian, and 
Spanish[1]. The software was to be applied on personal computers capable 
of running the Microsoft Windows operating environment. It was to 
operate on what is now a garden variety PC, one which had a VGA colour 
monitor, sufficient random access memory (RAM) to run Windows, and a 
small hard disk. The only extra accessory permitted was to be a sound 
card compatible with Windows. In late 1992 such a machine could be 
purchased for a tax free price of around two thousand dollars (Australian). 
 
Approximately a dozen interactive lesson modules had been produced at 
the time of writing. Most of them offered two screens for a user to interact 
with. One screen would have two lists of words or phrases, one list for 
English, the other in one of the three languages of concern to the project. A 
user could hide one of the lists, and test her or his vocabulary and 
pronunciation skills by clicking on individual words or phrases, hearing 
them pronounced by both male and female native speakers. On clicking a 


2 Australian Journal of Educational Technology, 1993, 9(1) 

'Record' button, the user could speak the word or phrase into a 
microphone, and then playback his or her voice for comparison with the 
voices of native speakers. 
 
The second lesson screen provided basically the same skills practice, but 
used a set of game modes to challenge the user at time intervals of 
selectable duration. For example, the 'Faces' lesson randomly selects the 
name of a head part, such as ear, mouth, nose, etc., plays the voice of a 
native speaker pronouncing the name of the part, and gives the user so 
many seconds to point to the 'right answer'. 
 
The 'Talk' lessons were highly similar to those Nelson (1992) developed on 
an Amiga computer, and parallel to language exercises which Rehn (1992) 
has experimented with on Macintosh hardware. 
 
Both the Amiga and Macintosh families of personal computers have had 
in-built audio playback facilities since their inception, at least on most 
models. In recent years Apple has added built-in recording capabilities to 
Macs. The audio 'picture' on DOS-based PCs, however, did not have a 
clear focus until the release of Windows, which brought much needed 
format standards to such machines. 
 
The appearance of a plethora of 'multimedia' articles and books has 
resulted in an outbreak of new Windows capable sound boards for the PC. 
We have tried several of these, and have found that not all were created 
equal, as it were, despite the standards promulgated by Windows and the 
so called 'multimedia PC'. 
 
We detail some of our discoveries below. Before getting into results, it 
must be mentioned that the systems used were purchased at different 
times, and in different places: Sound Blaster Pro was obtained in January, 
1992 in Western Australia; the Pro AudioSpectrum was purchased in July, 
1992 in the United States; and the Windows Sound System was delivered 
from a Perth based supplier in January, 1993. It is likely that the latest 
versions of these systems differ to the ones used in the present study; 
some of the limitations discussed below may well have disappeared by the 
time readers take up this article. Nonetheless, it is felt that a sufficient 
number of generalities may come through in the following discussion to 
make readers aware of issues to take heed of in the audio digitising 
process. 
 
Memory resources required by digital audio 
 
As an example of matters which arise when creating digital voice clips, we 
relate here some of the procedures followed in setting up the Talk lesson 
known as 'alphabet'. 
 

Nelson 3 

Input to the process: a tape recording of a native Indonesian speaker 
reciting the alphabet. The Indonesian alphabet is the same as that used by 
English, having 26 letters. It took the native speaker a total of 47 seconds to 
recite the letters at a predetermined pace. The native speaker's voice was 
captured using off the shelf tape recording components, such as those 
found in Radio Shack and Dick Smith retail outlets. 
 
The initial objective was to make a complete digital copy of the entire 
recording, and then use a waveform editor to pick out the individual 
letters of the alphabet, saving them as separate audio clips. Since the audio 
clips were to be used under the Windows operating environment, they 
needed to be saved in the 'WAV' format. This is a standard Microsoft 
Resource Interchange File Format (Microsoft Corporation, 1991; also see 
Sheldon, 1992); under Windows, the format supports 8 and 16 bit 
resolution at three sampling frequencies: 11.025 kHz, 22.05 kHz, and 44.1 
kHz. 
 
Many readers may be aware of the considerable memory resources needed 
to digitise audio using present technology. A monaural digitised version 
of the 47 second source tape would require 518 kilobytes of storage if 
recorded with a resolution of 8 bits at 11.025 kHz, 1.04 megabytes at 22.05 
kHz, and 2.07 megabytes at 44.1 kHz. Increasing the resolution to 16 bits 
would double each of these figures; using stereo would double them 
again. 
 
Two veteran and one new audio systems were used to digitise the source 
tape. The seasoned systems were Sound Blaster Pro from Creative 
Systems, and Pro AudioSpectrum from Media Vision. The new system 
was the Windows Sound System from Microsoft. The Windows Sound 
System is identical to the Business Audio resources which come as 
standard equipment in recent versions of Compaq DeskPro computers. 
(Certain Compaq computers are now audio-ready, coming equipped for 
immediate sound recording and playback without the need for any 
additional hardware or software.) 
 
After the source tape was digitised, a variety of waveform editors were 
applied to the digital copy in order to extract the 26 distinct letters of the 
alphabet. 
 
Digitising from tape or microphone 
 
It is not necessary to use a tape as the audio source for digitising; one can 
use a microphone directly. In fact, some systems, such as the MicroKey 
AudioPort from Video Associates Labs, have only a microphone input 
jack. Level attenuators can be used with such systems to permit tape input, 
and are readily available at electronics hobby stores. 
 

4 Australian Journal of Educational Technology, 1993, 9(1) 

Direct use of a microphone for frequent audio digitising is, in the end, not 
yet as convenient and parsimonious as is using a tape recorder, at least not 
for what might be termed 'production' work. 
 
The most significant problem encountered in using a microphone as input, 
instead of a tape, relates to computer memory requirements. Tapes come 
in standard, convenient lengths of 15, 30, and 45 minutes. However, 
corresponding pre-set lengths for digitised recordings do not exist. In fact, 
recordings which exceed 15 seconds (or thereabouts) do not always come 
through well on a personal computer. Depending on the resolution and 
sampling frequency selected, 15 seconds of recording can exceed a 
computer's RAM capacity, and require that the recording be buffered out 
to a hard disk. 
 
Some digitising software, such as the professional Wave for Windows 
suite of programs from Turtle Beach Systems, use disk buffering as the 
standard procedure - no attempt is made to detect and use any extended 
memory which might be present on the computer. Hard disk access is 
always slower than RAM, and a system such as Wave for Windows can 
lose data when higher sampling frequencies are used. 
 
A problem encountered with regularity during the project related to the 
fact that much of the software available was descended from the days 
when PCs had at most 640 kilobytes of working RAM. Some software, 
such as that which came with the Sound Blaster Pro system involved in 
the study, predeternines how much RAM is available for digitising, and 
automatically cuts out when this limit is exceeded. This would not be a 
bad approach at all, were it not for the fact that the software used was 
unable to see beyond the old 640 kilobyte DOS working RAM limit. Sound 
Blaster's settings allow the recording to be buffered out to disk if there is 
insufficient RAM, but once this option is selected one can run the risk of 
data loss if the hard disk is not fast enough to keep up with the sampling 
frequency and recording resolution selected. 
 
The end result of these limitations is that digitising directly from a 
microphone has a few unknowns attached to it. The recording can be 
abruptly terminated by the software, or data loss can occur when using a 
hard disk incapable of capturing all the data being sent it. 
 
True, these limitations are also present when one uses a tape as input. But, 
in a production setting, tape recording a speaker is less likely to be 
disturbed than is direct digitising with a computer. Better to do the 
digitising off line, when the speakers have finished talking to a tape 
recorder. If the digitising from tape process is interrupted by the 
computer, the tape can be restarted with more ease than most speakers, 
and at less expense to the production budget. 
 

Nelson 5 

A final factor to mention here is that of archiving. Digital audio disk files 
are easier to use than analogue tape files, a fact which derives from the 
sequential access nature of tape files, and the relative imprecision 
associated with indexing a tape file's start position. Digital files are 
eminently easier to access and archive, but, and it is a big but indeed, they 
can require massive magnetic disk resources. A 15 minute monaural audio 
tape would require 20 megabytes of disk storage if digitised with a 
sampling rate of 22.05 kHz, and 8 bit resolution, and even then one would 
end up with voice quality fidelity, not music quality. 
 
Selecting options for digitising 
 
The most professional of audio digitisers have settings options such as 
those in Sound Blaster Pro, as shown below: 
 

The Sound Blaster Pro card used in the study did not have a resolution 
option; those systems which permit audio capture at both 8 and 16 bit 
resolution will have one more option to set. In the present study 8 bit 
resolution was entirely adequate; 16 bit resolution provides for better 
audio fidelity, but for voice recordings. be they male or female, 8 bit 
resolution is entirely sufficient, and saves memory usage. 
 
One thing to note in the settings box above is the presence of an 8 kHz 
sampling frequency. This option, if selected, will result in a file which 
Windows cannot use. Sound Blaster Pro will be happy with it, but not 
Windows, which insists on seeing 11.025, 22.05, or 44.1 kHz. 
 
The Max rec time: 26 Sec shown above indicates that, with the settings as 
selected, a maximum of 26 seconds of digitising can take place before the 
computer runs out of RAM. Had the Sound Blaster software been able to 
peek beyond conventional DOS memory, it would have seen that a few 
megabytes of extended RAM were unoccupied, and the 26 second limit 
would have increased substantially. 
 

6 Australian Journal of Educational Technology, 1993, 9(1) 

We made some of our recordings at a sampling frequency of 11.025 kHz, 
and others at 22.05 kHz. Selecting the higher frequency in the settings 
above would knock the maximum record time down to a paltry 13 
seconds. Under such conditions, digitising our 47 second source tape 
would take four steps, instead of the two required using the lower 
sampling frequency. 
 
If a Windows capable system were used for the digitising, it should be able 
to automatically use any free extended memory available. The Windows 
Sound System worked in this maimer; we used it on a Compaq 386 
machine with four megabytes of RAM, and also on a Compaq 486 having 
an EISA bus, and eight megabytes of RAM. On both machines the 47 
second source recording was digitised in a single step. 
 
Recording options are set in this system by using the dialogue box shown 
below: 
 

On both machines, Windows Sound System was content to accept 
whatever options we chose, and readily indicated the number of kilobytes 
of memory which would be required to digitise each second of audio. 
However, the system provided no indication of how much total time 
would be open for recording. Tests showed that the total time available 
depended not only on RAM, but also on the amount of free hard disk 
space at hand. As a practical operating guideline, a computer with four 
megabytes of RAM will permit about ten minutes of recording at 11.025 
kHz, 8 bit resolution, and half that if sampling is increased to 22.05 kHz. 
These figures will be reduced if hard disk space is tight. We found that the 
small Pocket Recorder program from Media Vision could be used to 
provide a good index of maximum record time, when and if this figure 
was needed. 
 

Nelson 7 

The Windows Sound System has an option for compressing audio clips, 
which, according to the manual, would be equivalent to recording with a 4 
bit resolution. We found this compression scheme to be useable with voice 
recordings, but at an easily detectable degradation of fidelity. 
 
It is the case that many manufacturers of audio boards provide some 
facilities for compression. All of these are meant to make it possible to 
squeeze more sound into limited computer memory and disk space. They 
do this, but audio fidelity is not the only sacrifice made if these schemes 
are applied one also loses portability. At the present time, a 4 bit recording 
made with Windows Sound System, for example, will only be playable on 
a computer equipped with the same sound hardware. It will not play back 
under Windows running a Sound Blaster or Media Vision sound driver. 
 
If portability is a concern, users must stick to uncompressed audio files 
recorded at 11.025, 22.05, or 44.1 kHz. A resolution of eight bits is also best 
as the number of 16 bit sound cards is still limited. 
 

Dissecting the digitised recording 
 
The screen shot above was captured from Quick Recorder, a Windows 
Sound System waveform processor and editor. 
 

8 Australian Journal of Educational Technology, 1993, 9(1) 

The oscilloscope display in the main part of the screen has three obvious 
voice patterns, or breath groups. Each corresponds to a single letter of the 
alphabet, pronounced by an Indonesian speaker. 
 
The smaller display above the main one shows the entire waveform, 
containing 26 small blips. Within this display bar, a frame indicates where 
the three letters in the main area are located with respect to the overall 
waveform. In this case the frame is around the letters G, H, and I. 
 
We needed to make single letter voice clips given this waveform as input. 
The waveform above could be considered to be a long sentence with 26 
words; we wanted to be able to highlight each 'word' and save it to disk as 
a separate file. 
 
Not all waveform editors will do this. Portions of a wave can be 
highlighted, and options applied, but these options tend to perform 
various actions on the selection[2], and 'Save as ...' is not usually one of 
them. The Creative Voice Editor Ver. 2.08, which arrived with our Sound 
Blaster Pro package was the only editor we tested which had such a 
capability. 
 
Some waveform editors, such as the Sound Recorder which comes as a 
standard utility with Windows, and the Pocket Recorder from Media 
Vision, can chop a waveform into two sections, and save just one of the 
sections, but this is not at all the same as being able to save what might be 
a small highlighted section in the middle of the waveform. 
 
Other editors, such as that provided under the 'expanded view' option of 
the Quick Recorder program in the Windows Sound System, and that 
found with the Turtle Beach Wave for Windows software, allow any 
contiguous highlighted section of a waveform to be copied or cut to a 
clipboard, from where it can be pasted into another wave editing window. 
When in a new window it can be saved. This two step process is not as 
convenient as that found in the Sound Blaster Pro's voice editor, but it 
does have an important advantage: the properties of the highlighted 
section, such as the sampling frequency, can be changed before the section 
is pasted. We used this at times, taking 'words' from a 22 kHz waveform 
and halving the sampling rate before saving. This was done whenever we 
felt 11 kHz sampling provided playback quality adequate for our 
application's requirements, or when one of the lessons had more than 
twenty voice clips and disk space was at a premium. There were occasions 
when we trialed 11 kHz clips in the field, and later had to go back to the 
original 22 kHz format after it became apparent that fidelity was not 
satisfactory. 
 
 
Nelson 9 

Factors to consider in selecting a sound card 
 
There are indeed a variety of sound cards on the market now, and most of 
them claim compatibility with Microsoft's Windows operating 
environment. 
 
Perhaps the two most important questions to consider when selecting a 
card would be: Is DOS compatibility important?, and, Will the card be 
used for recording as well as playback? 
 
Two of the latest systems, the Microsoft Windows Sound System and 
Compaq's Business Audio, will generally not play audio from programs 
running under DOS. 
 
If the card is to be used exclusively under Windows, a system specifically 
designed for Windows functioning, such as the two mentioned in the 
preceding paragraph, presents substantial advantages. These derive from 
two principal factors: systems not originally designed for Windows 
operation must often be run under DOS in order to gain access to all of 
their features, and, in the case of the Sound Blaster Pro and MicroKey 
AudioPort, produce files which must be converted to the WAV format by 
another program before Windows will use them. 
 
The other factor results from the ability of most of the Windows specific 
systems to work with extended memory. This presents very real benefits, 
making it possible to record and edit audio files which might be several 
minutes long instead of just a few seconds[3]. But a caveat: we found one 
Windows system, Turtle Beach's Wave for Windows, which did not use 
extended memory efficiently. 
 
Our tests of two veterans from the DOS era, the Sound Blaster Pro and 
Media Vision Pro AudioSpectrum 16, revealed some important limitations 
when applied in our Talk project. Both systems suffer from an inability to 
use extended memory. The Windows driver for the Sound Blaster Pro 
would not permit recording above 11 kHz; when the card runs under DOS 
this limitation disappears. The Media Vision system had mediocre 
documentation, and under Windows would not automatically mute its 
audio output when poked into record mode. This made the card unusable 
for Talk lessons; if a Talk user clicks on the 'Record' button on a machine 
running a Pro AudioSpectrum 16 card, the feedback can be deafening. 
 
Some of these limitations may have vanished by the time this paper is 
read. The lack of automatic record muting has been fixed in Media Vision's 
new Audioport sound device, and one would think it will also soon be 
fixed on the rest of the Media Vision range, if it has not been already. 
 
Using a microphone for voice recording produced quite variable results, 
depending on the microphone and sound card used. Of the systems we 


10 Australian Journal of Educational Technology, 1993, 9(1) 

tested, the Microsoft Windows Sound System and Compaq's Business 
Audio produced the best quality audio. These systems come with their 
own microphone; the Microsoft system also comes with a pair of 
headphones. 
 
Controlling volume levels with some cards can be inconvenient. The 
Sound Blaster Pro, the MicroKey AudioPort, and the Media Vision 
Audioport have knobs on them much like a small radio, and permit 
playback levels to be controlled easily. They use automatic gain circuitry 
when recording, which usually worked well enough. 
 
Many other systems, however, use software to control playback and 
record levels. These are, generally, not convenient to use: we solved the 
playback part of the problem by using headphones and small amplifiers 
with volume knobs. It is also helpful to be able to control bass and treble 
levels on playback; the process of recording and playing back digital 
tracks is subject to interference from the very computer on which they are 
running - a treble control, in particular, can be invaluable. 
 
The Microsoft Windows Sound System, and Compaq's Business Audio, 
include a utility for managing digital audio files, the 'Sound Finder'. We 
found it to be very useful as it permitted us to thoroughly label the 
hundreds of short voice clips used in our project, and to edit or play them 
back with ease. 
 
Concluding comments 
 
Our Windows audio work followed on from substantial prior experience 
with Amiga computers, as well as some experimentation with Macintosh 
audio. The author's original impression on creating, editing, and using 
audio under Windows was not exactly one of great enthusiasm. The 
software tools seemed limited, and the quality of the audio was initially 
felt to be inferior to that which could be achieved on other boxes. 
 
Compaq's Business Audio system and, a bit later, the Microsoft Windows 
Sound System, did much to turn the picture around. These systems are 
carefully tuned to Windows, and use a microphone very closely coupled 
to the capabilities of their recording circuitry. 
 
Digital audio on any current microcomputer platform requires 
considerable resources. Windows has added two crucial factors to the 
audio equation on Intel based machines: standard procedures for 
accessing extended RAM, and file format standards. Providing one is 
using a machine with sufficient RAM and hard disk space, our conclusion 
is that the current state of the art is adequate for producing reasonable 
quality digital audio files under Windows, at least for monaural voice 
recordings. 
 

Nelson 11 

Notes 
 
1. I gratefully acknowledge the assistance of several colleagues in 

completing the work described here, particularly that of Piet Herman 
Abik, Nur Hadi Amiyanto, and Brian Lawton.  

2. These actions can be numerous, depending on the system selected. 
They commonly include volume and pitch controls, fading, mixing, 
echo effects, frequency filtering, muting, and automatic trimming of 
quiet spots.  

3. A machine with a 486 processor is recommended when editing sound 
clips of more than half a minute duration; a 386 will handle them, but 
for frequent work the speed of the 486 is welcome. 

 
References 
 
Microsoft (1991). Microsoft Windows Multimedia Programmer's Reference. 

Redmond, Washington: Microsoft Press. 
Nelson, L. R. (1992). Developing interactive digitised audio courseware on 

Amiga, Macintosh, and PC platforms: A comparison of common 
support facilities available. In Proceedings of the International Interactive 
Multimedia Symposium. Perth: Promaco Conventions Pty Ltd. 
http://www.ascilite.org.au/aset-archives/confs/iims/1992/nelson.html 

Rehn, G. (1992). Two-way interactive sound on a stand-alone Macintosh 
platform. Australian Journal of Educational Technology, 8, 51-64. 
http://www.ascilite.org.au/ajet/ajet8/rehn.html 

Sheldon, T. (1992). Windows 3.1, the Complete Reference. Sydney: Osborne 
McGraw-Hill. 

 
Author: Larry Nelson is a senior lecturer at Curtin University's Faculty of 
Education, where he lectures in research procedures and computer 
applications in education. The work described here was undertaken as part 
of a DEET 'ILOTES' (Innovative Languages Other than English) Project. 
 
Please cite as: Nelson, L. (1993). Creating and using digital audio files 
under the Windows operating environment. Australian Journal of Educational 
Technology, 9(1), 1-11. http://www.ascilite.org.au/ajet/ajet9/nelson.html