INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL
Online ISSN 1841-9844, ISSN-L 1841-9836, Volume: 16, Issue: 6, Month: December, Year: 2021
Article Number: 4394, https://doi.org/10.15837/ijccc.2021.6.4394

CCC Publications 

Effect of Sample Sizes in Fingerprinting Database for Wi-Fi System

A.H.A. Sa’ahiry, A.H. Ismail, L.M. Kamaruddin, M.S.M. Hashim
M.S.M. Azmi, M.J.A. Safar, M. Toyoura

Ahmad Hakimi Ahmad Sa’ahiry*, Abdul Halim Ismail, Muhammad Juhairi Aziz Safar
Faculty of Electrical Engineering Technology
Universiti Malaysia Perlis, Malaysia
02600 Arau Perlis, Malaysia
*Corresponding author: a.hakimi@studentmail.unimap.edu.my
ihalim@unimap.edu.my, juhairi@unimap.edu.my

Latifah Munirah Kamaruddin
Faculty of Electronic Engineering Technology
Universiti Malaysia Perlis, Malaysia
02600 Arau Perlis, Malaysia
latifahmunirah@unimap.edu.my

Mohd Sani Mohamad Hashim, Muhamad Safwan Muhamad Azmi
Faculty of Mechanical Engineering Technology
Universiti Malaysia Perlis, Malaysia
02600 Arau Perlis, Malaysia
sanihashim@unimap.edu.my, safwanazmi@unimap.edu.my

Masahiro Toyoura
Department of Computer Science and Engineering
University of Yamanashi, Japan
4-3-11 Takeda, Kofu Yamanashi, 400-8511, Japan
mtoyoura@yamanashi.ac.jp

Abstract
Indoor positioning system has been an essential work to substitute the Global Positioning Sys-

tem (GPS). GPS utilizing Global Navigation Satellite Systems (GNSS) cannot provide an accurate
positioning in the indoor due to the multipath effect and shadow fading. Fingerprinting method
with Wi-Fi technology is a promising system to solve this issue. However, there are several prob-
lems with the fingerprinting method. The fingerprinting database collected has different sample
sizes where the previous researcher does not indicate any standard for the sample size to be used.
In this paper, the effect of the sample sizes in fingerprinting database for Wi-Fi technology has been
discussed deeply. The statistical analyzation for different sample sizes has been analyzed. Further-
more, two methods which are K- Nearest Neighbor (KNN) and Deep Neural Network (DNN) are
being used to examine the effect of the sample sizes in term of accuracy and distance error. The
discussion in this paper will contribute to the better sample size selection depending on the method
taken by the user. The result shows that sample sizes are an important metrics in developing the
indoor positioning system as it effects the result of the location estimation.

Keywords: indoor positioning system, sample size, positioning accuracy, big data, fingerprint-
ing, deep learning.


https://doi.org/10.15837/ijccc.2021.6.4394 2

1 Introduction
The Global Positioning System (GPS) and Global Navigation Satellite Systems (GNSS) in general

have been adopted as the primary positioning technology due to the highly accurate location infor-
mation they provide on a global scale; However, this technology fails in certain environments, such as
indoors or urban canyons. These GNSS failures are primarily due to the satellites’ low received signal
power due to degradation of signal as illustrated in Figure 1 and visibility in urban/indoor areas. As
a result, non-GNSS navigation technologies are critical in these regions [11].

Figure 1: Degradation signal of GPS and GNSS (Ma [10])

Numerous studies have been conducted over the last few years to address this issue. By utilizing a
variety of technologies, including infrared [2], ultra- wide band [22], bluetooth [3], inertial [20], magnetic
[8], and fusion of the technologies [20]. Wi-Fi is one of the most reliable technologies available. The
reasons for this are that it is widely deployed within buildings because the world relies on Wi-Fi to
connect to the internet for home networking, supporting the internet of things, and teaching. By
utilizing Wi-Fi, no pre-deployment effort or infrastructure support is required. As a result, labor
costs associated with installing new hardware to implement the system are reduced. The use of Wi-Fi
technologies for indoor positioning has been discussed, as well as several methods and techniques.

Several methods have been used by the previous researcher such as TOA [19], AOA [19], and finger-
printing [5] method to predict user location. However, the most precise and provide a better accuracy
is the fingerprinting method [9]. Fingerprinting methods works by taking the unique identification
for each of the reference point for the database collection. Fingerprinting has two phases which are
the offline and online phases as illustrated in Figure 2. Offline phases are a calibration or collecting
the database while for online is predicting the location of the user. In the offline phases which is the
data collection, previous researcher does not have a standard for choosing a sample data [7]. Addi-
tionally, the issue with the fingerprint database is the database must be taken by expert surveyor due
to the collection of the database need a professional trained person. Otherwise, the database will not
effectively be created, and this will produce a problem in future location estimation. Moreover, the
expert surveyor will create a massive problem where the cost to hire the expert surveyor is excessively
expensive. Other problem related to the expert surveyor is the time taken for the expert surveyor to
collect the database is time consuming. This two problem of the professional surveyor responsible for
compiling the database is expensive and time consuming [1] which need to be avoided.

Hence, crowdsourcing fingerprinting database was introduced. Crowdsourcing method is replacing
the labor to stranger, where the stranger or anyone could contribute their signal into the fingerprinting
database. However, the problem with the crowdsourcing is the sample size in the data base get from
the stranger in each of the fingerprinting database reference point is different [18]. The sample size
is the number of signal strength from the source collected. For example, the Wi-Fi signal strength
sample size can be collected by taking the signal strength over time. In the crowdsourced database,
the stranger does not know the system of the fingerprinting method. Thus, provide a different sample
size for each of the stranger and will creates inaccuracy in predicting the user location. In this paper,
the effect of the different sample sizes will be discussed to get a better understanding and the authors


https://doi.org/10.15837/ijccc.2021.6.4394 3

try to find the optimal number of sample sizes in term of fingerprint database for Wi-Fi signal.

Figure 2: Fingerprinting workflows

2 Related Work
In the previous research work, it appears that researchers frequently use a variety of sample sizes

to evaluate their research work, without explaining the rationale for the sample size selection. In [12],
40 observations were made for a set of 155 reference points that served as training data to eliminate
human behavior’s randomness. In [21], the authors collected one sample per second for five minutes
(a total of 300 samples) in order to investigate wireless channel changes over time. Although it is
acceptable, it does not indicate any reference for sample size and why 300 sample sizes were collected.

Similarly with author In [17], a different sample sizes were used to analyze and compare various
filtering strategies for real-world indoor 802.11 positioning systems. The authors determined the radio
distribution at 250 uniformly spaced grid points over a 15 x 35 meters area. The authors of [4]
proposed a technique called dynamic hybrid projection (DHP) for enhanced 802.11 localization. They
collected 802.11 RSS data at 27 different reference locations on different days and with four different
user orientations during their experiments. They selected 15 locations with a step of 1.5–2 meres from
this sample to use as training data. The sample sizes were not declared, and the sample sizes were
totally different for various researcher.

In the crowdsourced data, [15] has developed a human-computer interface for indicating location
over intervals of varying duration, a client-server protocol for pre-fetching signature data for use in
localization and location-estimation algorithm incorporating highly variable signature data. They
describe an experimental deployment of their method in a nine-story building with more than 1,400
distinct spaces served by more than 200 wireless access points. The sample size for different location
is diverse, this will be a problem for the online phases to locate the user location.

In summary, our review of the literature indicates that authors calibrate, train, test, and evaluate
indoor positioning systems using a variety of sample sizes and patterns. The previous author does
not specifically justify the reason of the number sample sizes collected. The authors in this paper
hypothesis are the sample sizes is an important criterion to gain an accurate result. Hence, before
making a prediction, the authors of this paper intend to investigate the effect of sample size on the
accuracy and statistical properties of fingerprint database data.

Referring to [9], the author provides the following performance benchmarking for indoor wireless
location system: accuracy, precision, complexity, scalability, robustness, and cost. However, in this
paper the author will be using two performance metrics which are the accuracy and precision. Both
metrics will be tested by using two methods. K-Nearest Neighbor (KNN) and Deep Neural Network
(DNN). There are advance KNN that has been used, [6] has used WKNN which is Weighted K-Nearest
Neighbor. In this work, KNN is chosen as a conventional method and to standardize the evaluation
that been used by most of previous research. DNN is the most advance method using a neural network
based on machine learning. There are many architecture in designing the deep learning model, one


https://doi.org/10.15837/ijccc.2021.6.4394 4

of it are [14], the author used a complex deep learning architecture by combining multiple machine
learning algorithm. In this study, a basic architecture will be executed as in [13], the scope and aim
are not to get the best prediction of the location. The DNN algorithm is enough to study the effect
of the different sample size in term of the accuracy of the location prediction. Hence, using these two
methods will verify the performance of the sample sizes in term of two of the verification metrics.

In this paper, the first section will be addressing the general problem which is the GPS and GNSS
problem. The second section will be the related work where the previous researcher has done. The
third section will be explaining on the data collection part where the authors will clarify how the data
is collected, what hardware are being used and the configuration of the setup. For the fourth section,
the data distribution based on its intensity will be analyzed to get a clear view of the data in the
fingerprinting database. Then, the authors study the characteristic of the signal and the statistical
analyzation in term of mean, mode, and standard deviation. The box plot also is plotted to know the
stabilization of the sample sizes based on the median. For the performance metrics, the sample size
in term of its accuracy and precision by using KNN and DNN method are evaluates. The fifth section
is the conclusion and discussion from what have been discovered from the experiment and discussion
based on the previous section.

3 Methodology
This section will explain the experimental setup of the data collection in collecting the fingerprint-

ing database. The experimental area, hardware configuration, hardware used and all the setups will
be explained in this section. The setup has been made by taking a consideration on the previous
researcher work to get the optimize setyp and avoid any configuration error.

3.1 Overview

The flow to analyze of the effect of the sample sizes in fingerprinting database is presented in Figure
3. First the data is collected by using single access point. The data is then presented by two angle
which are in two dimensional and three dimensional. Then, the data were analyzed with respect to the
sample sizes critically by the signal characteristic and statistical approach. Afterward, the different
sample sizes data were evaluated by using two methods which are the non-parametric techniques,
K-Nearest Neighbor (KNN) and by using artificial intelligence method, Deep Neural Network (DNN).

Figure 3: Works Overview

3.2 Data Collection

The data collection was made in Solid Mechanic and Acoustic lab in Universiti Malaysia Perlis as
in Figure 4. It consists of 42 reference point, 6 reference point on X axis and 7 reference point on
Y axis. The data taken is semi-controlled environment where all the configurations will be stay the
same the only changes are the location of the mobile devices taken for each of the reference point.


https://doi.org/10.15837/ijccc.2021.6.4394 5

Figure 4: Experimental Test Bed

To get the situation as exact as in real environment, people who are walking in the fingerprinting
database area was considered. Hence, this will give an exact real environment to get a precise result in
discussing the effect of sample size. The data collection procedure is the modelled similarly to paper
the approach delineated in [16].

The equipment used are TP-link (TD-W8961N) with nominal frequency 2.4 GHz and android
phone are used which is Mi A2 Lite mobile phone. TP-link is used as the access point and Mi A2
Lite is used as the devices for collecting the fingerprinting database. Mobile phone is chosen because
in crowdsourced most of the user who will contribute into the fingerprinting database will be using a
mobile phone. Thus, this experiment database will be collected by using a mobile phone. The time
taken for each of the reference point taken are 20 minutes to get 1000 samples for each of the reference
point. Then, in each of the reference point, the sample were divided into 8 different sample sizes which
are 10, 20 ,30 ,40, 50, 100, 500 and 1000 sample sizes. 10 until 50 is consider as the small sample size
while the medium is 100 and 500. 1000 is the large sample size. This number will be discussed in the
further analyzation to know the perfect number in choosing the number of sample size. This is to test
the effect of the sample size in fingerprinting database for Wi-Fi system.

Figure 5: Experimental Area

The reference point of the grid system of the fingerprint database is illustrated in Figure 5. To
make sure only the sample size is the main variable, all the other variables are constant, while only the
sample size of the Received Signal Strengh (RSS) of Wi-Fi is different. The fingerprinting database has
42 reference point, where the gap location between each reference point is 1 meter. The fingerprinting
database for the vertical axis is 6 meter while in the horizontal axis is 5 meters. One access point has
been used in this data collection process which is labelled as AP1.

In this work, we consider an access point by using one router with 2 fixed antennae at a nominal
2.4 GHz frequency which is TP-link (TD-W8961N). The axis of the reference point is based on the


https://doi.org/10.15837/ijccc.2021.6.4394 6

X and Y axis as indicated in Figure 5. The database is represented as Rx y, where R is the received
signal strength (RSS) of Wi-Fi for 42 reference point. x is the X axis while y is the Y axis for each of
the reference point in the experimental test bed. This reference point is labelled so it is easier for the
database organization for the used in estimating the user location.

3.3 K-Nearest Neighbor (KNN)

The sample size are tested using two methods in predicting the user location. Two method which
were implemented are KNN and DNN. KNN is a conventional method that has been used by most of
the researcher to locate the user data by taking the nearest neighbors. The advanced method is the
DNN method. It based on artificial intelligent where it trains the data before it started to predict
using the trained model.

KNN is a supervised machine learning algorithm. KNN does not need a specialized training phase
as it calculates through distance from its neighbors. Hence, it is also does not need a gaussian data
to get an accurate value. The step for predicting the user location will be first to label the training
data. In this test, the label data is the access point 1. The variation of sample will be the manipulated
variable for the analyzing purposes. Next, is by choosing the K of the algorithm which is how many
neighbors for the algorithm to classify into how many group. In this case, the K is chosen in 3 variation
which are 1, 3, and 5 because the K is important in KNN algorithm. Hence, to analyze a better effect
of the sample sizes in the fingerprinting database, the K will be varying. By using Euclidean distance
in equation below, It will calculate and find the nearest neighbor which the user is located. If it finds
the nearest neighbors are in their group, then it will classify them as their group.

d =
√

(x1 − x2)2+(y1 − y2)2,
Where d is distance, x1 and y1 is X and Y axis of one of the reference points in the database. x2

and y2 is second X and Y axis in another reference point.

3.4 Deep Neural Network (DNN)

Figure 6: Deep Neural Network (DNN) Model

In the DNN method, the classifier used are the same as the KNN which is supervised machine
learning method. DNN will try to predict the location by adjusting the model of the training data.
There are two phases for DNN in estimating the location. The first one is to train the data and after
getting the solid model, the model is then used to predict the location.

The first step in using a DNN classifier is to label the data. By using the Access Point 1 (AP1) as
the feature, the data is labelled. Then the data need to be trained. In training, the hyperparameter
of the model will be chosen heuristically. In this experiment the variable that has been considered
is the sample sizes, hence the hyperparameter will be fix through all the sample sizes. Two hidden
layers has been used in this experiment with 250 nodes in the first layer and 20 nodes in the second
layer. 0.2 dropout were used to avoid overfitting. The activation function for both hidden layers are


https://doi.org/10.15837/ijccc.2021.6.4394 7

by using sigmoid as it can handle negative value. For the output activation function, SoftMax will be
used in the model because SoftMax is an output’s classifier as it takes the highest value in the output
nodes to predict the user location. The model is illustrated in Figure 6.

Finally, the optimizer for the model was chose by using the Adam optimizer. It is back propagation
where the algorithm learnt from the output and try to optimize the model by adjusting the error in
output. The validation of the model used are by using accuracy metric as in subsection 3.5.

Deep learning learns from previous data which the user feeds into it. The main objective of
classification are by using DNN to take the highest probability of the output which has been optimized
by the model. The process to choose the highest probability undergoes a certain set of formula. First,
by entering the input node layer by multiplying with xi and the weigh wi to get the result of the next
node in the hidden layer.

u =
n∑

i=1
xiwi

Where xi is

xi = Rxy
Where Rx y is the received signal strength (RSS) of Wi-Fi in one of the access point, the unit is in

dBm.
On the hidden layer, u is then inserted into non-linear activation function which is the sigmoid to

get the output value between 0 to 1,

S (u) =
1

1 + e−u
Where u is the input and S is the output.
After iterating process complete, the result of the second hidden layer will be inserted into the

SoftMax equation. This will give the highest probability in continuous digit,

F (
−→
S )i =

eSi∑k
j=1 e

Sj

Where F is the output,
−→
S is the input vector gain from previous outcome on sigmoid activation

equation, k is the number of classes in the output as in this example is 42 classes, Si is the standard
exponential function for input vector and Sj is the standard exponential function for output vector

3.5 Validation Accuracy and Distance Error

The accuracy is determined by evaluating KNN and DNN approaches. In the population of the
data, 20% of the data is used to evaluate the performance of the different sample sizes. The data
is evaluated by calculating the true reference point over the estimated position. The result will be
converted in term of percentage , to make the value easily digestible by the reader, the ratio is simply
multiplied by 100. Distance error is one of the evaluations where the distance is calculated by using
Euclidean distance. Between two of the reference points, the distance error was gain and the result
is plotted in cumulative distribution frequency (CDF) graph. The frequency of distance errors is
explored through an examination of the average and maximum errors.

To get the accuracy, the ratio of the correct prediction and the total number of the prediction
sample is applied,

A =
CP
TP
× 100

where A is accuracy, CP is the number of correct prediction and TP is the total number of the
prediction sample. The accuracy gain in term of percentage as in table 2.


https://doi.org/10.15837/ijccc.2021.6.4394 8

4 Result and Discussion
In this section, the fingerprinting database will be analayzed and shown in two form by using

heatmap (2D) and distribution map (3D). RSS signal characteristic of different sample sizes will be
shown and discussed in this section. Then, statistical analysis in term of mean, mode and standard
deviation will be addressed to know the effect of different sample sizes. Box plot is plotted for
the different sample sizes at 4 reference point to view the median stabilization where the median is
important in fingerprinting method.

4.1 RSS Intensity Heatmap

Figure 7: RSS Intensity Heatmap

In getting the clear view of intensity signal strength measuremen,t the heatmap is created as
illustrated in Figure 7. In the heatmap, the darker area presenting that the signal strength is strong
as in (1,7) coordinate which is the strongest position. As the device is further away from the access
point or router the signal strength becomes weaker as in coordinate (5,6). The heatmap will give the
clear presentation of the intensity of the RSS data to make a rough analyze for the fingerprinting
database.

Figure 8: RSS Intensity Distribution

The RSS intensity distribution is illustrated in Figure 8 to get the clear presentation of the signal
strength of fingerprinting database collected. The dark blue and yellow color indicated that the it is
the strongest signal where the range is between -40 dBm to -30 dBm where it is the nearest point
to access point. On average, the signal is between – 50 dBm to -45 dBm, where the signal is in the
majority of the distribution. The weakest signal strength are between -55 dBm to -50 dBm. It is the
furtheest from the access point in position (6,1). The signal strength is fluctuated over the location.


https://doi.org/10.15837/ijccc.2021.6.4394 9

Some of the location acquire the the lowest signal strengh even in the middle position like in (6,3)
coordinate. This is expected as the RSS is not linearly proportional over the distance.

4.2 RSS characteristic

Received signal strength (RSS) from the router is taken to get the properties of the signal. Variety
of the sample sizes is taken to examine the effect of the different sample sizes. In Figure 9, the different
sample sizes for RSS characteristic is shown. The RSS from Wi-Fi does not produce a smooth line,
this is expected as Wi-Fi signal is not stable over time.

In Table 1, one of the reference point which is (0,1) coordinate has been taken to analyzed the
statistical properties of the Wi-Fi signal strengh. The mean, mode, and standard deviation for 8 of
the variation of samples are compared. The average or mean of the sample size started with -48.2
dBm and reduces to -46.75 dBm. This is due to the outlier at the starting of the sample collected. As
the sample increases, the mean started to get stable result. At 500 sample sizes, the different between
1000 and 500 sample sizes mean are just only -0.28 dBm. This show that 500 sample is a promising
result to take as the threshold for fingerpriningt database sample size. The mode or the maximum
dBm is -45 and it is the same for all number of sample sizes.

Standard deviation shows that for the first 10, 20 and 30 number of sample size are below 2. At
40 sample sizes the standard deviation started to increase. This means that the dispersion is getting
bigger. At 500 sample sizes the standard deviation starts to decrease significantly and follow with
1000 sample sizes. This show that the respectable dispersion of the number of sample sizes start at
500 and the best is at 1000 which get 1.04 in standard deviation.

Table 1: Statistic of different number of sample sizes

Number of samples Mean (dBm) Mode (dBm) Standard Deviation (dBm)
10 -48.20 -45 1.87
20 -46.75 -45 2.00
30 -46.17 -45 1.82
40 -46.50 -45 2.10
50 -47.32 -45 2.51
100 -49.30 -45 2.54
500 -50.25 -45 1.34
1000 -50.53 -45 1.03

In Figure 9, RSS characteristic of one reference point in coordinate (0,1) with variation of sample
10, 20 ,30 ,40, 50, 100, 500, and 1000 is shown. At smaller sample size as in Figure 9a, 9b and 9c the
range is between in -50 dBm and -45 dBm. The range started increase by 1 in the larger sample size
as in Figure 9d, 9e and 9f. The largest sample sizes are 500 and 1000 as in Figure 9g and 9h giving a
range between -52 dBm and -45 db. In Figure 9a, with 10 sample sizes the range could not be seen as
it has only 10 sample sizes. On the bigger sample size as in Figure 9g and 9h the signal characteristic
shows that it has outlier in the beginning. This is one of the information that can be used, as it shows
early of signal may be not a proper way to be choose as it can be an outlier.


https://doi.org/10.15837/ijccc.2021.6.4394 10

(a) RSS signal characteristic with 10 sample sizes (b) RSS signal characteristic with 20 sample sizes

(c) RSS signal characteristic with 30 sample sizes (d) RSS signal characteristic with 40 sample sizes

(e) RSS signal characteristic with 50 sample sizes (f) RSS signal characteristic with 100 sample sizes

(g) RSS signal characteristic with 500 sample sizes (h) RSS signal characteristic with 1000 sample sizes

Figure 9: RSS in one reference point (0,1) with varies of sample sizes.


https://doi.org/10.15837/ijccc.2021.6.4394 11

(a) Box plot at coordinate (0,1) (b) Box plot at coordinate (2,1)

(c) Box plot at coordinate (7,2) (d) Box plot at coordinate (7,6)

Figure 10: Box plot of different number of sample sizes for 4 reference point.

The median distribution of the sample sizes is shown in Figure 10. To avoid outlier interference,
the box plot has been plotted for 4 reference point. The reference point taken are in Figure 10a at
coordinate (0,1), Figure 10b at coordinate (2,1), Figure 10c at coordinate (7,2) and Figure 10d at
coordinate (7,6). In Figure 10a, the median of the data started stabilize when its reach 500 sample
sizes. For reference point at coordinate (2,1) in Figure 10b, it is the same with Figure 10a, where it
acquire stablity when the sample sizes are 500. In Figure 10c at coordinate (7,2), the median reaches
its stability on 50 sample sizes. In Figure 10d at coordinate (7,6), even with 20 sample sizes the
median has reach the stability. However, to get accuracy stability 50 sample sizes is considered a good
threshold as it shows no difference at 100 and above sample sizes in term of its median.

4.3 Accuracy prediction using Deep Neural Network (DNN) and K-Nearest Neigh-
bors (KNN)

The accuracy of DNN and KNN is shown in Table 2 in term of its percentage. For 10 sample
sizes, the accuracy is 7% and by increase the sample size to 20 the accuracy starts to increase to 14%.
The accuracy for DNN method keeps increasing as the sample size increase. The accuracy increases
near to 4% over the sample size population. The highest accuracy is in 1000 sample sizes at 41%
gain, nearly to 50%. This indicates by using DNN method, increase in sample sizes will increase the
accuracy. Thus, provide a better elocation estimation.

At K = 1 the KNN accuracy shows that in small sample sizes the accuracy starts to increase from
21% to 35% at 10 sample sizes to 30 sample sizes. However, the accuracy fluctuates where it drops
back to 27% and started increase back until 39%. The maximum accuracy is 39% at 500 sample sizes.
At K=3, the accuracy is unstable as the accuracy at 17% in 10 sample sizes. It fluctuate in small
area until it reaches the maximum accuracy which is 30% in 1000 sample sizes. For K = 5, the result
shows almost similar as K= 3 where the accuracy fluctuate in a small area. The highest accuracy is
in 500 sample sizes which gets 37%. This shows that KNN method does not rely on the sample size
as the KNN method depend on number K, the user applies. For example, in this test the K number is


https://doi.org/10.15837/ijccc.2021.6.4394 12

5 which it will take the nearest 5 neighbors to classify the location for the majority neighbors. Hence,
KNN does not solely depends on the sample size unlike the DNN method as the sample size increase,
the accuracy also will increase.

Table 2: Accuracy of two different algorithm

Sample Size CNN(%) KNN(%)K = 1 K = 3 K = 5
10 7 21 17 17
20 14 28 24 13
30 22 35 22 14
40 24 27 22 15
50 26 28 23 23
100 30 34 23 21
500 39 39 28 37
1000 41 38 30 24

4.4 Distance Error

In this subsection, Deep Neural Network (DNN) distance error graph is discussed. The main
objective of this subsection is to know the relationship between the sample sizes and distance error of
the fingerprinting database. Likewise, the graph shows the maximum and the average distance error
of the different sample sizes and different algorithm that has been used.

The accuracy metric equation does not same with the distance error. To compute the distance
error between each of the sample sizes. The cumulative distribution frequency (CDF) was plotted.
The distance error gain by comparing the predicted location with the true location. The distance
error is calculated by using Euclidean distance as in subsection 3.3 . The graph of the CDF distance
error for DNN and KNN is then plotted as in Figure 11 and Figure 12.

Figure 11: DNN Distance Error

4.5 Deep Neural Network (DNN) Distance Error

In Figure 11, 500 sample sizes show the most optimize result which is estimately 1.5 meter. This is
followed by 30 sample sizes, then 10, 40, 1000, 50 and 20 sample sizes. This indicates that the distance
error does not have any relationship with the sample sizes. In the accuracy test, the DNN method is


https://doi.org/10.15837/ijccc.2021.6.4394 13

affected by the sample size. As the sample size increase the accuracy also will be increase. However,
for the distance error result, it shows that the different distance error does not have any relationship
between the sample size.

Distance maximum error acquire the same result but have a small significant different in term of
the result in its arrangement. However, it still does not show a linear relationship between sample size
and distance error. In 500 sample sizes, the maximum distance error is 4.2 meter which is the most
optimize distance error. While the others, ranging around 5 meters to 8 meters. The worst are 20, 30,
40 and 100 which is at 7.2-meter distance error. For 1000, sample sizes, the maximum distance error
is 5.8 meter. Hence, this shows that the distance error for DNN method does not provide a linear
relationship between the sample size and distance error.

4.6 K-Nearest Neighbor (KNN) Distance Error

This subsection used a K-Nearest Neighbor to estimate the location. By calculating the distance
error, the maximum and average distance error was calculated. The aim is correlated to the DNN
distance error. The only difference is the distance error equation.

Figure 12: KNN Distance Error

In the KNN prediction method as shown in Figure 12, the distance error for different sample sizes
show the same as its accuracy validation. The distance error is unpredictable in term of the sample
sizes. On average, the most optimize sample sizes is 30 which give the distance error 1 meter. This
followed by 10, 50, 1000, 500, 20, 100 and 40 sample sizes. This sequence shows the distance error is
not linearly proportional to the distance error. On the maximum distance error. The result ranging
from 4 meter to 6 meter. The highest distance error is in 20 and 40 sample sizes which giving 5.2-meter
distance error. The most least distance error for the least maximum error is in 50 sample sizew which
gain 4 meters. The distance error in KNN method is unpredictable as it does not give any correlation
between the sample size and the distance error.

5 Conclusion
This paper discussed the effect of sample sizes in fingerprinting database for Wi-Fi system. Through

intensive data analysis the different sample sizes for fingerprinting database is discussed. First, the data
was collected in 5 x 6-meter area. Then, the intensity and distribution of the RSS is then presented.
Analyzation of the data through a statistical measure is then discussed. Lastly, two methods which


https://doi.org/10.15837/ijccc.2021.6.4394 14

are the conventional method, KNN and the most advance method, DNN is applied to examine the
effect of the sample size in term of accuracy and distance error.

From the characteristic of the signal from different sample sizes, the outlier could be the major
problem where the outlier in the beginning of the sample could affect the accuracy atrociously as
the standard deviation will be increase. If the database has only 10 sample sizes, the outlier could
affect the accuracy of the prediction in future. In statistical analysis, the mean, mode, and standard
deviation were discussed. The mean was affected to the outlier as in the 10 sample sizes the mean is
higher compared to 20 sample sizes. This is due to the outlier in the smaller sample sizes. For the
standard deviation, 10 sample sizes giving 1.87 and it start increasing until 100 sample sizes. Then, it
drops down at 500 sample sizes and decrease until 1.04 standard deviation in 1000 sample sizes. This
shows that at 500 sample sizes, the dispersion of the signal is optimum and more sample giving less
dispersion or error of the signal. Afterward, the box plot is plotted to show the different sample sizes
in term of the median. From 4 reference point analyzed, different location give a different stabilization
in term of the median. However, the most stabilize median is when it reaches 500 sample sizes as it
can be seen in 4 of the reference points discussed.

In the performance metrics to test the accuracy and distance error from the different sample sizes,
two methods are applied which are KNN and DNN. For DNN, the accuracy is gradually increase over
the sample size population. However, for KNN the accuracy is not linearly proportional to the sample
size as even the sample size increases the accuracy does not give a better accuracy. In distance error,
the DNN and KNN method both giving an unpredictable result. Both methods do not show any
relationship between the sample size and the distance error. Nevertheless, the sample size still effects
the distance error as showed in Figure 11 and Figure 12 in the result section. In short, the sample
size effect the accuracy and distance error of the fingerprint system. However, the sample size does
not show a linear relationship in accuracy and distance error except in DNN accuracy test.

In the future, multiple devices will be further examining in a crowdsource data, the user contribute
to the crowdsource fingerprinting database will have multiple hardware. Hence, this will create a bigger
problem due to the diversity signal from the multiple devices. The experiment will be statistically
discussed to improve the fingerprinting system in Wi-Fi technologies in predicting the user location.
This will help to improve the previous researcher work to contribute in indoor positioning system.

Funding

The authors would like to acknowledge the support from the Fundamental Research Grant Scheme
(FRGS) under a grant number of FRGS/1/2018/ICT05/UNIMAP/02/4 from the Ministry of Educa-
tion Malaysia.

Author contributions

The authors contributed equally to this work.

Conflict of interest

The authors declare no conflict of interest.

References
[1] Bolliger, P. (2008). Redpin - adaptive, zero-configuration indoor localization through user col-

laboration, Proceedings of the First ACM International Workshop on Mobile Entity Localization
and Tracking in GPS-Less Environments - MELT ’08, 55, 2008.

[2] Chunhan Lee, Yushin Chang, Gunhong Park, Jaeheon Ryu, Seung-Gweon Jeong, Seokhyun Park,
Jae Whe Park, Hee Chang Lee, Keum-shik Hong, & Man Hyung Lee. (2004). Indoor positioning
system based on incident angles of infrared emitters, 30th Annual Conference of IEEE Industrial
Electronics Society, 3, 2218–2222, 2004.


https://doi.org/10.15837/ijccc.2021.6.4394 15

[3] Dinh, T.-M. T., Duong, N.-S., & Sandrasegaran, K. (2020). Smartphone-Based Indoor Positioning
Using BLE iBeacon and Reliable Lightweight Fingerprint Map. IEEE Sensors Journal, 20(17),
10283–10294, 2020.

[4] Fang, S.-H., & Wang, C.-H. (2011). A Dynamic Hybrid Projection Approach for Improved Wi-Fi
Location Fingerprinting, IEEE Transactions on Vehicular Technology, 60(3), 1037–1044, 2011.

[5] Ismail, Abdul Halim, Mizushiri, Y., Tasaki, R., Kitagawa, H., Miyoshi, T., & Terashima, K.
(2017). A Novel Automated Construction Method of Signal Fingerprint Database for Mobile
Robot Wireless Positioning System, International Journal of Automation Technology, 11(3),
459–471, 2017.

[6] Ismail, A. H., & Terashima, K. (2018). Prediction of WiFi signal using kalman filter for
fingerprinting-based mobile robot wireless positioning system, Journal of Telecommunication,
Electronic and Computer Engineering, 10(1–15), 17–21, 2018.

[7] Kanaris, L., Kokkinis, A., Fortino, G., Liotta, A., & Stavrou, S. (2016). Sample Size Deter-
mination Algorithm for fingerprint-based indoor localization systems, Computer Networks, 101,
169–177, 2016.

[8] Kim, B., & Kong, S.-H. (2016). A Novel Indoor Positioning Technique Using Magnetic Fingerprint
Difference, IEEE Transactions on Instrumentation and Measurement, 65(9), 2035–2045, 2016.

[9] Liu, H., Darabi, H., Banerjee, P., & Liu, J. (2007). Survey of Wireless Indoor Positioning Tech-
niques and Systems, IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications
and Reviews), 37(6), 1067–1080, 2007.

[10] Ma, C., Jee, G.-I., Macgougan, G., Lachapelle, G., Bloebaum, S., Cox, G., Garin, L., & Shewfelt,
J. (2001). GPS Signal Degradation Modeling, Proceedings of International Technical Meeting of
the Satellite Division of the Institute of Navigation, 1–12, 2001.

[11] Raquet, J., & Martin, R. K. (2008). WiFi-based indoor positioning, IEEE Communications Mag-
azine, 53(3), 150–157, 2008

[12] Roos, T., Myllymäki, P., Tirri, H., Misikangas, P., & Sievänen, J. (2002). A Probabilistic Ap-
proach to WLAN User Location Estimation, International Journal of Wireless Information Net-
works, 9(3), 155–164, 2002.

[13] Sa’ahiry, A. H. A., Ismail, A. H., Kamarudin, L. M., Zakaria, A., & Nishizaki, H. (2021). An
Experimental Study of Deep Learning Approach for Indoor Positioning System Using WI-FI
System Proceedings of SympoSIMM 2020, 113–124, 2021.

[14] Tarekegn, G. B., Juang, R. T., Lin, H. P., Adege, A. B., & Munaye, Y. Y. (2021). DFOPS: Deep-
Learning-Based Fingerprinting Outdoor Positioning Scheme in Hybrid Networks, IEEE Internet
of Things Journal, 8(5), 3717–3729, 2021.

[15] Teller, S., Ryan, R., Battat, J., Charrow, B., Ledlie, J., Curtis, D., & Hicks, J. (2008). Organic
Indoor Location Discovery, Mit-Csail-Tr-2008-075.

[16] Thewan, T., Ismail, A. H., Panya, M., & Terashima, K. (2016). Assessment of WiFi RSS using
design of experiment for mobile robot wireless positioning system, FUSION 2016 - 19th Interna-
tional Conference on Information Fusion, Proceedings, July, 855–860, 2016.

[17] Wang, H., Szabo, A., Bamberger, J., Brunn, D., & Hanebeck, U. D. (2008). Performance com-
parison of nonlinear filters for indoor WLAN positioning, Proceedings of the 11th International
Conference on Information Fusion, FUSION 2008, May 2014.

[18] Yang, S., Dessai, P., Verma, M., & Gerla, M. (2013). FreeLoc: Calibration-free crowdsourced
indoor localization, 2013 Proceedings IEEE INFOCOM 2481–2489, 2013.


https://doi.org/10.15837/ijccc.2021.6.4394 16

[19] Yang, C., & Shao, H. (2015). Non-GNSS radio frequency navigation, IEEE International Confer-
ence on Acoustics, Speech and Signal Processing, 5308–5311, 2015.

[20] Ye, F., Chen, R., Guo, G., Peng, X., Liu, Z., & Huang, L. (2019). A Low-Cost Single-Anchor Solu-
tion for Indoor Positioning Using BLE and Inertial Sensor Data, IEEE Access, 7, 162439–162453,
2019.

[21] Youssef, M., & Agrawala, A. (2005). The Horus WLAN location determination system, Proceed-
ings of the 3rd International Conference on Mobile Systems, Applications, and Services - MobiSys
’05, 205, 2005.

[22] Zheng-dong, L., Xing-jie, C., Xiu-ling, L., Juan, C., Yan, H., & Hai-mei, X. (2020). Design of
Ultra-Wideband Localization System Based on Optimized Time Difference of Arrival Algorithm,
IEEJ Transactions on Electrical and Electronic Engineering, 15(8), 1176–1182, 2020.

Copyright ©2021 by the authors. Licensee Agora University, Oradea, Romania.
This is an open access article distributed under the terms and conditions of the Creative Commons
Attribution-NonCommercial 4.0 International License.
Journal’s webpage: http://univagora.ro/jour/index.php/ijccc/

This journal is a member of, and subscribes to the principles of,
the Committee on Publication Ethics (COPE).

https://publicationethics.org/members/international-journal-computers-communications-and-control

Cite this paper as:

Sa’ahiry, A. H. A., Ismail, A. H., Kamaruddin, L. M., Hashim, M. S. M., Azmi, M. S. M., Safar,
M. J. A., & Toyoura, M. (2021). Effect of Sample Sizes in Fingerprinting Database for Wi-Fi System,
International Journal of Computers Communications & Control, 16(6), 4394, 2021.

https://doi.org/10.15837/ijccc.2021.6.4394


	Introduction
	Related Work
	Methodology
	Overview
	Data Collection
	K-Nearest Neighbor (KNN)
	Deep Neural Network (DNN)
	Validation Accuracy and Distance Error

	Result and Discussion
	RSS Intensity Heatmap
	RSS characteristic
	Accuracy prediction using Deep Neural Network (DNN) and K-Nearest Neighbors (KNN)
	Distance Error
	Deep Neural Network (DNN) Distance Error
	K-Nearest Neighbor (KNN) Distance Error

	Conclusion