INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL
ISSN 1841-9836, e-ISSN 1841-9844, 14(3), 311-328, June 2019.

Research on Key Technology of Web Hierarchical Topic Detection
and Evolution Based on Behaviour Tracking Analysis

M. Chen

Mo Chen*
Business College of Beijing Union University
A3, Yanjingdongli, Chaoyang District, Beijing, 100025, P.R. China
*Corresponding author: mo.chen@buu.edu.cn

Abstract: In the development background of today’s big data era, the research
direction of Web hierarchical topic detection and evolution characterized by the semi-
structured or unstructured data has caught wide attention for academicians. This
paper proposes an idea of Web hierarchical topic detection and evolution based on
behaviour tracking analysis taking the network big data as the research object, and
expounds main implementation methods, which include the instance analysis of the
usage mode, the instance analysis of the seed, the set analysis of similar instance
supporting the topics, the set analysis of similar instance supporting the events, the
evolution analysis of the event, and expounds the algorithm of Web hierarchical topic
detection and evolution based on behaviour tracking analysis. The process of ex-
perimental analysis is organized as follows, first of all, the experiment analyses the
quality of topic detection, the accuracy rate with the number of instance concerned
and the seed threshold variation trend, the accuracy rate with the number of instance
concerned and the probability threshold variation trend, secondly, the experiment
analyses the quality of topic evolution, the accuracy rate with the variation trend of
parameter adjustment, the accuracy rate with the number of instance concerned and
the similar threshold variation trend, finally, the experiment analyses the time con-
suming to solve main research problem under different method, the qualitative result
of topic detection and evolution under different data set. The results of experimental
analysis show the idea is feasible, verifiable and superior, which plays a major role
in reconfiguring Web hierarchical topic corpus and providing an intelligent big data
warehouse for the network information evolution application.
Keywords: Web hierarchical topic, topic detection, event evolution, behaviour track-
ing analysis.

1 Introduction

In the development background of Web text mining technology and big data era, so far,
the field of intelligent technology has also developed into a more challenging stage [7, 13, 28],
the network has become one of services, which can transmit most popular information for users.
According to deep survey, the number of network data has gone through EB level in different
domain [14, 18, 19, 27]. Academicians should cogitate how to analyse intricate big data, never-
theless, it is an important and key application direction for researching Web hierarchical topic
detection and evolution based on behaviour tracking analysis.

In the network big data, the number of Internet news is showing explosive growth as a kind
of flow resource with on-going events, which has shown 5V features of volume, variety, value,
velocity and veracity [3,23,26]. Based on above characteristics, the Internet news should reflect
high currency and reliability [5], on this basis, the topic of Internet news should be quickly
detected, and its evolution path should be tracked in real time. However, how to research Web
hierarchical topic detection and evolution based on behaviour tracking analysis, it has become
an urgent problem to build a Web hierarchical topic corpus and provide a real-time big data
source for the network information evolution application.

Copyright ©2019 CC BY-NC


312 M. Chen

Through researching the literature related to topic detection and evolution technology, this
paper proposes an idea of analysing the process for Web hierarchical topic detection and evolu-
tion, and expounds main implementation methods and algorithms following with interest Web
hierarchical topic detection and evolution from behaviour tracking analysis. This process does
important contribution for researching a method of analysing the detection and evolution for
Web hierarchical topic, the results of experimental analysis show that the implement of this idea
is feasible, verifiable and superior.

2 Related works

In recent years, some scholars have done certain research about the technology of Web topic
detection and evolution. A statistical model is proposed [2], in this model, it can combine context
with related topics by jointly modelling the topic word with the hash tag and the time stamp, in
order to detect and track interpretable topics over time along with their distribution of the hash
tag, in this technical context, the experiment demonstrates that this model effectively reveals the
process of topic detection and evolution by using the real dataset, this model is different from the
traditional topic mining model, it shows serious improvement due to this fact that the distribution
of the metadata containing in user content generated can be analysed, so the whole research result
does main contribution in the area of topic detection and evolution in the context of the statistical
analysis. A topic detection and evolution method is proposed by analysing the semantic word
shift, the topic trend, and the evolving dynamic using the data set [4], in this method, it can
merge and split local topics in different time periods, in order to track the process of knowledge
transfer among topics, in this technical context, the experimental results show that the process of
topic detection and evolution usually follows pattern from adjusting status to mature status, and
sometimes with readjusting status, this method is different from the statistical analysis, it shows
serious improvement due to this fact that the word migration via topic channels has been defined,
and three migration types of non-migration, dual-migration, and multi-migration are better to
understand topic detection and evolution, so the whole research result does main contribution in
the area of topic detection and evolution in the application direction of information retrieval. A
topic detection and evolution method is proposed [30], which is called the citation-content-latent
Dirichlet allocation method, in this method, it can account for the document citation relation and
the content of document itself via a probabilistic generative model, this model can deal with the
citation and text information, and its parameters are estimated by a collapsed Gibbs sampling
algorithm, in addition, a topic detection and evolution algorithm is designed, which can run in
two steps of the topic segmentation and the topic dependency relation calculation, this model and
algorithm have been tested by using the online dataset, in this technical context, the experimental
results demonstrate that the implementation of the model and algorithm can more effectively
detect important topics and reflect topic evolution process comparing with the topic tracking in
the knowledge transfer context, so the whole research result does main contribution in the area
of topic detection and evolution for designing the model and algorithm. A topic detection and
evolution framework is proposed based on the probabilistic topic model [31], in this framework,
firstly, the notations, the terminology, and the basic topic mining model is introduced, secondly,
three technologies of the topic detection and evolution are applied, which are the discrete time
topic detection and evolution, the continuous time topic detection and evolution, and the online
topic detection and evolution, thirdly, the application of this framework is discussed, in this
technical context, the comparative experiments are completed for different technologies of the
topic mining, this framework shows serious improvement than single probabilistic model and does
main contribution in the area of the topic detection and evolution performance evaluation. A
topic detection and evolution method is proposed based on the analysis of the content similarity


Research on Key Technology of Web Hierarchical Topic Detection
and Evolution Based on Behaviour Tracking Analysis 313

or dissimilarity using the textual material [12], in this method, the graph-theoretical technology
is applied, in order to deal with the network relationship among the content-similar topics, in
this technical context, the explanatory experiment more effectively illustrates usefulness of the
approach using the online news articles in different situations, so the whole research result does
main contribution in the area of topic detection and evolution in the context of the network
analysis.

Based on above analysis of the literature about the technology of Web topic detection
and evolution, through comparing the difference among technologies and analysing the main
contributions in the research area, if want to research the process of Web topic detection and
evolution, in addition to using Web mining technology based on the structure and content, but
also using Web mining technology based on the behaviour tracking. Therefore, in recent years,
some scholars have also done certain research about the technology of Web behaviour tracking
analysis.

A collaborative tagging model is proposed [24], in this model, the technology of the usage
behavior tracking analysis is applied to the information extraction direction for web user query in
the ontology environment due to different structure data, which is based on the idea of the block
acquiring page segmentation, in order to retrieve the tag-based information, in this technical
context, the comparative experiments are completed regarding the average precision rate, the
time cost, and the storage space rate with existing information retrieval model, it shows that
the application of this model can assure research more effectiveness, so the whole research result
does main contribution in the area of the behavior tracking analysis. A tracking method of the
usage behavior is proposed [17], in this method, a relational graph is established by mining the
temporal and causal information among aggregated HTTP request, and an algorithm is designed
and implemented for primary request identification, which is a critical task of web usage mining,
in order to demonstrate higher value and effectiveness, in this technical context, the experimental
result shows that it is a more useful method of analysing a large-scale dataset for the real-
world Web access log, so the whole research result does main contribution in the direction of
mining value for information available. A web usage mining method is proposed [1], this method
can be applied to solve the problem of developing more accurate and efficient recommendation
systems, the traditional data protection mechanism focuses on the access control and the secure
transmission, which provide only security against malicious the third parties, but not the service
provider, so this method can mine efficiently and intelligently data, in order to guide and track the
users’ usage behavior, and does main contribution in the application direction of the modern E-
business. A web usage mining method is proposed [11], in this method, the state-of-the-art session
identification technology is used in terms of limitation, feature, and methodology, which provide
a structured overview of the research development, in this technical context, the comparative
experiments critically review existing session identification technology, the whole research result
does main contribution in highlighting the limitation and related challenge, identifying the area
where further improvement is required, in order to complement the performance of existing
technology. A web usage mining method is proposed [21], in this method, an algorithm is also
designed for the data cleaning and filtering using the web log dataset, this algorithm mainly
complete the process of preprocessing and clustering the usage behavior, which consists of the
use cases for the data cleaning and filtering, the user and session identification, in this technical
context, the experiments are carried out to obtain the aggregate clustering results, through this
process, two datasets of web usage log are collected and processed, so the whole research result
does main contribution in the area of web usage behavior analysis.

Based on above analysis of the literature about the technology of Web behavior tracking
analysis, through comparing the difference among technologies and analysing the main contri-
butions in the research area, the scholars have studied two research directions including the


314 M. Chen

technology of Web topic detection and evolution, but do not fully take into account the hier-
archical series induced by the usage behavior, if do not fully consider this point, then ignore
the process to track the usage behavior from the angle of topic detection and evolution. So
this paper mainly takes web usage behaviour record of news big data as the research source,
utilizes the method of analysing web usage behaviour tracking to perfect news big data corpus
for topic detection and evolution research, and proposes an idea of analysing the process for Web
hierarchical topic detection and evolution to solve difficult problems existing in current research
status.

3 Problem definition and notation

Under the background of Web big data development, users can retrieve and browse Web
news from different dimension, granularity and frequency, which has been analysed and evaluated
[15,20,22]. In this process, the time sequence trajectory of users’ behavior can be recorded. These
data not only record the characteristic of Web news that users use, but also contain the topics
reflected in Web news, and the events that are generated based on these topics. Therefore, the
knowledge hidden in Web news big data can be mined, the topics that users are concerned about
can be detected, a series of events under the topics can be tracked, and the evolution process of
events can be combed out based on the process of analysing Web news usage characteristic.

Based on the analysis of Web news structure, contents and semantic feature, every Web
news that users concern can be regarded as an instance node in every authoritative Web news
network from the perspective of global usage. The social events supported by set of some related
nodes can be considered as a topic, and a series of events will be created under each topic. In
this way, when users are concerned about a series of topics reflected in a social event, they can
not only browse many Web news instances supporting topics, but also browse a series of events
that are generated by this topic. When users are concerned about an event, it can also browse
more than one Web news instance content that supports this event. From the perspective of
local usage, when users retrieve Web news instances, besides inputting the keywords related to
social events reported by Web news, they can also input the keywords with five tuple semantic
description. Therefore, during the analysis process of Web news topic detection and evolution,
the social events can be mined utilizing structure, contents and semantic feature, a series of
events caused by the topics can be mined focusing on mining Web news instances supporting
the events, the logical hierarchical relationship can be mined between mining objects, which will
construct a multi-level structural corpus, it can represent visual topics and events of Web news
with a high degree of quality.

Based on the analysis of Web news utility feature, users can retrieve Web news in the search
process with a high degree of currency, which is called the time accuracy reporting Web news
core events. Users can retrieve Web news with a high degree of truthfulness, which is called the
releasing source reliability of Web news. Users can retrieve Web news of high currency with a high
degree of truthfulness, therefore, during the analysis process of topic detection and evolution, it
should weigh the factors of Web news currency and authenticity, and provide Web news utility
instances supporting the topics and events from the perspective of users’ utility characteristic
for Web news.

Based on the behaviour tracking analysis of web usage characteristic for user, S−U can link
the search keywords and the URL instances synchronously, in the S −U relation, S represents
the set of the keywords, U represents the set of the URL instances. As shown in the formula 1,
fq(s,u) can indicate the clicking frequency of instances, fqi(u) can indicate the clicking frequency
of homologous URL, fq(u) can indicate the clicking frequency of instances in a certain period.
As shown in the formula 2, rti(u) can indicate the clicking rate of homologous URL.


Research on Key Technology of Web Hierarchical Topic Detection
and Evolution Based on Behaviour Tracking Analysis 315

fq(u) =
n∑
i=1

fqi(u) (1)

rti(u) =
fqi(u)

fq(u)
(2)

NewsSet can be defined using the {ns1, ...,nsi−1,nsi,nsi+1, ...,nsk}, the range of array
is from one to k. nsi.url defines the url instance address, nsi.title defines the instance title,
nsi.pubtime defines the instance releasing time, nsi.pubsource defines the instance releasing
source, nsi.content defines the instance contents, nsi.keyword defines the instance keywords.
UserBehavior can be defined using the {ub1, ...,ubi−1,ubi,ubi+1, ...,ubn}, the range of array
is from one to n. ubi.username defines the user name, ubi.searchword defines the keywords,
ubi.url defines the URL clicked, ubi.systemtime defines the system time.

Based on above definition and notation, the issue should be solved to mine topics contained
in the instances, mine instances set supporting related topics. From research angle of mining Web
news instance set supporting topics concerned by users, the issue should be solved to mine events
that are happening under these topics, analyse the instance set supporting these events, so as to
reflect the evolution process of Web news topics continuously. This result can be denoted using
the TopicURL, it can be represented using the {tu1, ..., tui−1, tui, tui+1, ..., tum}, the range of
array is from one to m, tui can be represented using the < Topic,Topicurl,Event,Eventurl >,
tui.Topic can express the topic description detected, tui.Topicurl can indicate the seed instances
URL set supporting related topics, tui.Event can express the description of events under topics
mined, tui.Eventurl can indicate Web news instance of events supporting topics detected.

4 The analysis of Web hierarchical topic detection and evolution

In view of the problem definition and notation, this paper proposes an idea of analysing the
process for Web hierarchical topic detection and evolution shown in figure 1. This framework
is used for completing the analytical process for Web hierarchical topic detection and evolution,
which include the instance analysis of the usage mode, the instance analysis of the seed, the set
analysis of similar instance supporting the topics, the set analysis of similar instance supporting
the events, and the evolution analysis of the event. The algorithms are designed for completing
the functions and methods of this framework, which include the algorithm of analysing the
process for Web hierarchical topic detection, the algorithm of analysing the process for Web
hierarchical topic evolution.

4.1 The algorithm of analysing the process for Web hierarchical topic detec-
tion

The algorithm of analysing the process for Web hierarchical topic detection is implemented
by designing two methods of the usage mode analysis and the topic series construction, the
inputting content of this algorithm is the result of semantic five tuple description analysis and
utility evaluation for Web news instances, and the user usage behaviour record set, the outputting
content of this algorithm is the similar and sequential set of Web news instances that can support
corresponding topics.

According to web usage behaviour record, the explosive and attention mode of Web news
instances are analysed, in order to infer the click mode of Web news instances, the degree
distribution and similarity mode of Web news instances are also analysed, in order to infer the
retrieval mode of Web news instances. In accordance with these results of analysing web usage


316 M. Chen

Figure 1: The framework of analysing the process for Web hierarchical topic detection and
evolution

mode, the seed set of Web news instances can be mined, the similar set of Web news seed
instances can also be mined. Referring to the utility feature that have been analysed [6], the
semantic five tuple can describe the topics under the time series.

For the process of analysing the explosion mode, the execution process can be viewed a
sensor for the social event. The process of analysing the mode quotes the entropy characteristic,
and the analytical result is a speculation about the sharpness of the click rate change. For
the process of analysing the attention mode, the execution process makes up for the problem
existing in the burst mode, that is, when the research instances are followed by web users, the
measurement standard will be an absolute phenomenon. Based on the analysis of the usage
mode, the click mode of Web news instances can be inferred shown in the formula 3.

ClickMode(u) = (1 − (−
n∑
i=1

rti(u) × lognrti(u))) ×

×
log(fq(u)) −Minui∈U (log(fq(ui)))

Maxui∈U (log(fq(ui))) −Minui∈U (log(fq(ui)))
(3)

In the formula 3, n indicates the number of the granular unit that the instances are followed.
For the sudden event, it can be set in day. For the normal occurrence event, it can be set in
week or month. If the fluctuation is not large for the click rate of Web news instances, then the
attention mode is smaller. If Web news instances have obvious fluctuation, then the attention
mode will be large. According to the click process of Web news instances, it has the power law
distribution characteristic. Therefore, the click frequency of Web news instances has carried on
logarithm transformation.

For the process of analysing the degree distribution mode, the degree of Web news instances
presents the power law distribution, so its logarithmic transformation can be executed. For
the process of analysing the similarity mode, it not only makes up for the problem existing in
the degree distribution mode, which ignores the degree origin of Web news instances through
retrieval keyword, but also solves the problem of the sparse record in the click behaviour of web
user. Based on the analysis of the usage mode, the retrieval mode of Web news instances can be
speculated shown in the formula 4.


Research on Key Technology of Web Hierarchical Topic Detection
and Evolution Based on Behaviour Tracking Analysis 317

SearchMode(u) =
2

n(n + 1)
×

log(d(u)) −Minui∈U (log(d(ui)))
Maxui∈U (log(d(ui))) −Minui∈U (log(d(ui)))

×

×
n∑
i≤j

∑∞
k∈dataitem(sik(u)sjk(u))√∑∞

k∈dataitem(sik(u))
2
√∑∞

k∈dataitem(sjk(u))
2

(4)

Based on the formula 3 and 4, the set of Web news instances can be mined by using the formula 5,
SeedURL(u) should be larger than or equal to the seed threshold. As shown in the equation 6, <
ti,pi,oi,cei,rei > (nsi.uc) represents the semantic five tuple description for the seed instance. For
the utility evaluation result, the sort sequence of the utility can be executed. In the subsequent
experiments, the optimal value of the seed threshold will be analysed.

SeedURL(u) = ClickMode(u) ×SearchMode(u) (5)

TimeSeries(ns) = (< t1,p1,o1,ce1,re1 > (ns1.uc), ...,

< ti,pi,oi,cei,rei > (nsi.uc), ...,< tn,pn,on,cen,ren > (nsn.uc)) (6)

Algorithm 1 Method 1 Analysing the Usage Mode
1: Input: UserBehavior, Threshold;
2: Output: TopicURL;
3: LET UserRecord ← UserBehavior;
4: LET GroupUserRecord ← GroupByURL(u),TopicURL ← φ;
5: For each gur[i](0 ≤ i ≤ gur.size() − 1) Do
6: SeedURL ← Calculate clickmode and searchmode;
7: If SeedURL ≥ Threshold Then
8: tu.add(SeedURL);
9: End If

10: End For

For the construction of the sequence topic, the execution process uses the probability for
its first transfer, in order to judge whether it is similar to the seed instance supporting topics
by using Web news instances and taking Web news seed instance as the research center. If su
indicates the seed instance, then the variable tu expresses that whether the instance can support
the topic of su, the variable ts expresses that whether the search keyword can support the topic
of su. If the seed instance is able to support the topic of su, then tu = 1, conversely, tu = 0, if
the search keyword is able to support the topic of su, then ts = 1, conversely, ts = 0. In initial
status, tsu is one, P(tsu = 1) is one, the probability is zero. As shown in the formula 7 and 8,
P(ts = 1) is able to calculate su probability, P(tu = 1) is also able to be recalculated, when
P(tu = 1) is larger than or equal to the probability threshold, the Web news instances are able
to be found, so as to mine the similar instance set. In the subsequent experiments, the optimal
value of the probability threshold will be analysed.

P(ts = 1) =

∞∑
u:(s,u)∈E

fq(s,u)∑∞
(s,ui)∈E fq(s,ui)

×P(tu = 1) (7)


318 M. Chen

P(tu = 1) =

∞∑
s:(s,u)∈E

fq(s,u)∑∞
(si,u)∈E fq(si,u)

×P(ts = 1) (8)

Algorithm 2 Method 2 Constructing the Topic Series
1: TopicURL, UserBehavior, Threshold;
2: TopicURL;
3: For each tu[i](0 ≤ i ≤ tu.size() − 1) Do
4: swset1 ← ExistSet(tu[i].getSet(”Topicurl”),ub);
5: While swset1 is not null Do
6: While each swset1 is exist Do
7: p(ts) ← CalculateResult(swset1.getElement(j).position,tu[i].getSet(”Topicurl”),ub);
8: If p(ts) ≥ Threshold Then
9: swset2.addSet(swset1.getElement(j));

10: End If
11: ub ← (swset1.getElement(j).position,p(ts));
12: End While
13: While each swset2 is exist Do
14: wnuset1 ← ExistSet(swset2.getElement(j).position,ub);
15: If(wnuset1 ← EqualSet(wnuset1,wnuset2)) is not null Then
16: While each wnuset1 is exist Do
17: p(tu) ← CalculateResult(wnuset1.getElement(k).position,
18: swset2.getElement(j).position,ub);
19: If p(tu) ≥ Threshold Then
20: wnuset2.addSet(wnu1.getElement(k));
21: End If
22: ub ← (wnuset1.getElement(k).position,p(tu));
23: End While
24: End If
25: End While
26: swset1 ← ExistSet(wnuset2,swset2,ub);
27: End While
28: tu[i] ← wnuset2;
29: Describe Topic tu[i];
30: End For

4.2 The algorithm of analysing the process for Web hierarchical topic evolu-
tion

The algorithm of analysing the process for Web hierarchical topic evolution is implemented
by designing two methods of the event series construction and the event evolution analysis, the
inputting content of this algorithm is the result of analysing semantic five tuple for Web news
instances, the user usage behavior record set and Web news topic set, the outputting content of
this algorithm is the topic set that has been excavated under the events and the evolution result
of analysing the events belonging to the topics.

According to the set of Web news instances that can support the topics mined, the user usage
behavior record is used to analyse the time sequence of Web news instances that are followed,
the similarity degree of the core events reported among Web news instances is calculated by


Research on Key Technology of Web Hierarchical Topic Detection
and Evolution Based on Behaviour Tracking Analysis 319

using the result of analysing semantic five tuple description, and the evolution state of the events
supported by the similar Web news instances and the topics belonging to the events can also be
analysed.

The calculation result of the time series similarity can show the topics mined among Web
news instances, the user can pay much attention to Web news instances that describe the same
events, which have occurred within a certain time period, and the time sequence concerned is
similar. On a certain granularity, the vector of the time sequence can be expressed as shown in
the formula 9 representing the attention rate of Web news instances. In this formula, rti(uj)
represents the attention rate of the instance uj in i component for a granularity, n indicates
the number of the granular units that the instances are continuously followed. For the sudden
events, the granularity can be set in days, for the normality events, the granularity can be set in
weeks or months.

TimeSeries(tui) = ({rt11(tu11), ...,rt1j(tu1j), ...,rt1m(tu1m)}, ...,{rti1(tui1),
...,rtij(tuij), ...,rtim(tuim)}, ...,{rtn1(tun1), ...,rtnj(tunj), ...,rtnm(tunm)}) (9)

According to the semantic five tuple description of Web news instances supporting the
topics, the core events can be extracted, the similarity degree can be calculated among the
core events reported by Web news instance. As shown in the formula 10, FS(fsi,fsj) can be
calculated with a threshold, which is greater than or equal to the similarity threshold. The
semantic five tuple is used to describe the core events reported by the aggregated Web news
instance set, the Web news instances that represent the node of each event are arranged in
ascending order according to the occurrence time of the core events. If the core events occur
in the same time, then Web news instances are arranged in descending order according to its
utility characteristic. The Web news instances that are clustered together can be arranged in
ascending order according to the occurrence time of the core events, if the core events occur
in the same time, then Web news instances are arranged in descending order according to its
utility characteristic. < ti,pi,oi,cei,rei > (< Ti,Ei > .uc) expresses the seed topic and the
event description, < tij,pij,oij,ceij,reij > (< Ti.uc >,< Eij.uc >) expresses the seed topic of
the evolution event description. In the subsequent experiments, the optimal range of analysing
the parameters and the similarity threshold value will be analysed.

FS(fsi,fsj) = α×
∑n

m=1(tsim, tsjm)√∑n
m=1(tsim)

2
√∑n

m=1(tsjm)
2

+

+β ×
∑n

m=1(ceim,cejm)√∑n
m=1(ceim)

2
√∑n

m=1(cejm)
2

(10)

TopicURL(T,E) = ({< t11,T1,E11 >,...,< t1j,T1,E1j >,...,< t1m,T1,E1m >}, ...,
{< ti1,Ti,Ei1 >,...,< tij,Ti,Eij >,...,< tim,Ti,Eim >}, ...,{< tn1,Tn,En1 >,...,

< tnj,Tn,Enj >,...,< tnm,Tn,Enm >}) (11)

5 The experimental analysis and result

In the process of completing the experiments based on designing the algorithms, the experi-
mental environment of the software and hardware is used as follows. Java language is used for the


320 M. Chen

Algorithm 3 Method 3 Constructing the Event Series
1: Input: TopicURL, UserBehavior, NewsSet, Threshold, Parameters;
2: Output: TopicURL;
3: For each t[i](0 ≤ i ≤ t.size() − 1) Do
4: Generate timeseriesvector;
5: Fs ← Calculate similarity of timeseries and coreevent;
6: If Fs ≥ Threshold Then
7: Adjust event under topic t[i];
8: End If
9: Describe event under topic t[i];

10: End For

programming design to implement the algorithms, MyEclipse platform of the software research
and development is used for the framework implementation, SQL Server of the database manage-
ment system is used for web big data storage and process. The processor is Intel 2.40GHz, the
memory is 32GB [?,8,9,16,29]. The experiments mainly use the standard data set for the social
event of German A320 airliner crash, the data source is from massive Web news analysis corpus
for the real data, the experimental analysis and result can verify feasibility and effectiveness of
the research idea.

5.1 The qualitative analysis of the topic detection

As shown in the figure 2, the accuracy rate represents the quality of the topic detection
by using three web usage behaviour mining processes. Firstly, the red column represents the
accuracy rate of analysing the instance clicking mode, this quality is not high, although it has
improvement, but the maximum is able to only arrive on about 0.64. Secondly, the blue column
represents the accuracy rate of analysing the instance searching mode, this quality is not also
high, although it has also improvement in several monitoring points, but the maximum is able to
also only arrive on 0.63. Thirdly, the green column represents the accuracy rate of the algorithm
designed in this paper, this quality is significantly improved, and the maximum is able to arrive
on about 76

Figure 2: The qualitative analysis of the topic detection


Research on Key Technology of Web Hierarchical Topic Detection
and Evolution Based on Behaviour Tracking Analysis 321

5.2 The analysis of the accuracy rate with the number of the instances con-
cerned and the seed threshold variation trend

As shown in the figure 3, the accuracy rate of the topic detection is indicated through the
X and Y axis adjustment. The accuracy rate represents the quality of the topic detection, if the
seed threshold is defined, then it is able to increase in a stable trend, because the number of
the instances followed is less, the relationship among big data is simpler. If the number of the
instances is increasing, then the relationship of the link exists, so the accuracy rate of the topic
detection is increasing in a stable trend. If the number of the instances is defined, firstly, then
the quality of the topic detection expresses an increasing trend, secondly, then it will decrease
with the increasing threshold, because the threshold is less, the topics of inaccurate accuracy are
able to be found. If the threshold can arrive at a stable range, then the topics of approximate
accurate are able to be found. If the threshold can arrive at a value, then the accurate topics
cannot be found. This experiment expresses when the number of the instances is one hundred
and sixty, and the seed threshold is zero point seven five, the quality of the topic detection can
get the highest about 0.78.

Figure 3: The analysis of the accuracy rate with the number of the instances concerned and the
seed threshold variation trend

5.3 The analysis of the accuracy rate with the number of the instances con-
cerned and the probability threshold variation trend

As shown in the figure 4, the accuracy rate represents the quality of the instances mined
that support the topics by adjusting the probability threshold of the X axis and the number of
the instances of the Y axis. The accuracy rate indicates the quality of the instances mined, if
the threshold is defined, then the quality of the instances mined is able to increase in a stable
trend, because the number of the instances followed is less, the relationship among big data is
simpler. If the number of the instances concerned is increasing, then the relationship of the link
exists, so the accuracy rate of the instances mined is increasing in a stable trend. If the number


322 M. Chen

of the instances is defined, firstly, then the quality of the instances mined expresses an increasing
trend, secondly, then it will decrease with the increasing threshold, because the threshold is less,
the instances of inaccurate accuracy can be mined. If the threshold can arrive at a stable range,
then the instances of approximate accurate can be mined. If the threshold can arrive at a value,
then the accurate instances cannot be mined. This experiment expresses when the number of the
instances followed is one hundred and forty, and the probability threshold is zero point seven,
the quality of the instances mined can get the highest about 0.76.

Figure 4: The analysis of the accuracy rate with the number of the instances concerned and the
probability threshold variation trend

5.4 The qualitative analysis of the topic evolution

As shown in the figure 5, the accuracy rate represents the quality of the topic evolution
by using three web usage behaviour mining processes. Firstly, the red solid line represents the
accuracy rate of analysing the time series similarity, which shows that the accuracy rate is not
high with the increase in the number of Web news instances concerned, although it has risen, but
the highest can only arrive on about 0.64. Secondly, the blue solid line represents the accuracy
rate of analysing the core event similarity, which shows that the accuracy rate is not also high
with the increase in the number of Web news instances concerned comparing with analysing
the time series similarity, and the accuracy rate has also a little decreasing slightly trend, the
highest can only arrive on about 0.64. Thirdly, the green solid line represents the accuracy
rate of the algorithm designed in this paper, which shows that the accuracy rate has greatly
improved because of integrating the time series similarity based on the semantic evaluation and
the similarity analysis for the core events. Although the accuracy rate is similar comparing
with other two methods under the circumstance of less Web news instances concerned, but
the accuracy rate has gradually widening the gap comparing with other two methods with the
increase in the number of Web news instances concerned, the highest can arrive on about 0.75.
So this experiment expresses that the quality of analysing the topic evolution is higher than other
two methods by using the algorithm designed in this paper.


Research on Key Technology of Web Hierarchical Topic Detection
and Evolution Based on Behaviour Tracking Analysis 323

Figure 5: The qualitative analysis of the topic evolution

5.5 The accuracy rate of analysing the topic evolution with the variation
trend for the parameter adjustment

As shown in the figure 6, the accuracy rate indicates the quality of analysing the topic
evolution for Web news topics according to the parameter adjustment of the sequence event
construction process. The red dashed line represents the accuracy rate, in which the Alpha
parameter values are different aiming at the formula 10. From its trend, when the Alpha value is
adjusted from 0.6 to 0.65, and the Beta value is adjusted from 0.35 to 0.4, the quality of analysing
the topic evolution is more high and stable, and the accuracy rate is close to 0.70. In general,
the parameter adjustment can make the quality of analysing the topic evolution more stable to
the maximum for Web news topics, which accords with the experimental effect expected, and
can determine the optimal range of the parameter.

Figure 6: The accuracy rate of analysing the topic evolution with the variation trend for the
parameter adjustment

5.6 The accuracy rate analysis with the number of the instances concerned
and the similar threshold variation trend

As shown in the figure 7, the accuracy rate represents the quality of analysing the topic
evolution by adjusting the similarity threshold of the X axis and the number of the instances of
the Y axis. If the threshold is defined, then the quality of analysing the topic evolution is able to
increase in a stable trend, because the number of the instances followed is less, the relationship
among big data is simpler in the analytical process of the time series and core event similarity.


324 M. Chen

If the number of the instances followed is gradually increasing, then the relationship among
big data adds also the semantic feature for analysing the process of the topic evolution, so its
accuracy rate can increase. If the number of the instances is defined, firstly, then the quality
of analysing the topic evolution can increase, secondly, then it will decrease with the increasing
threshold, because the threshold is less, the inaccurate or approximate accurate analysis of the
topic evolution mays be completed. If the threshold can increase to a stable range, then the
approximate accurate result of analysing the topic evolution can be excavated. If the threshold
can increase to a value, then the accurate result of analysing the topic evolution cannot be
excavated. This experiment expresses when the number of the instances followed is one hundred
and eighty, and the similarity threshold is zero point seven, the quality of analysing the topic
evolution can get the highest about 0.76.

Figure 7: The accuracy rate analysis with the number of the instances concerned and the similar
threshold variation trend

5.7 The time consuming analysis for solving main research problem under
different methods

As shown in the figure 8, in view of German A320 airliner crash social event, the X axis rep-
resents massive Web news time released through the authoritative Web news network platform,
the red solid line represents the time consuming for analysing Web hierarchical topic evolution
under the method based on the content and description, the blue solid line represents the time
consuming for analysing Web hierarchical topic evolution under the method based on the be-
haviour tracking. According to the change trend of two solid lines, the number of Web news
instances has increased sharply in several time intervals with the progress of the event develop-
ment, therefore, the time consuming has also increased sharply, in other time intervals, the time
consuming has relatively stable trend. While the time consuming is lower for the blue solid line,
because the analytical process of Web hierarchical topic detection and evolution uses the be-
haviour tracking method based on the semantic five tuple description and the utility evaluation.
This experiment expresses that the non-deterministic problem can be solved efficiently using the


Research on Key Technology of Web Hierarchical Topic Detection
and Evolution Based on Behaviour Tracking Analysis 325

method, which is proposed by this paper.

Figure 8: The time consuming analysis for solving main research problem under different methods

5.8 The qualitative analysis of Web hierarchical topic detection and evolution
result under different data sets

As shown in the figure 9, the qualitative analysis of Web hierarchical topic detection and
evolution result uses also other four standard data sets in addition to the data set of German
A320 airliner crash social event, which include Shanghai Bund trample, Taiwan revival airliner
falling river, Nepal 8.1 earthquake and Orient Star cruise overturn social event. According to
the change trend of five columns, there is little difference in the qualitative analysis of Web
hierarchical topic detection and evolution result at the start, development and end stage of
five social events. This experiment shows that the analytical process of Web hierarchical topic
detection and evolution is stable under different social events. Moreover, the qualitative analysis
of Web hierarchical topic detection and evolution result has only little effect in different stages
of the social events with the increase of the number of Web news instances.

Figure 9: The qualitative analysis of Web hierarchical topic detection and evolution result under
different data sets

6 Conclusion

This paper completes the research on an idea of analysing the process for Web hierarchical
topic detection and evolution in view of the behaviour tracking technology taking the network


326 M. Chen

big data of Web news as the processing object, this result of designing and implement is more
valuable for the scholars in the research field. In the research process, this paper proposes the
analytical algorithm of Web hierarchical topic detection and evolution, in which this paper has
also proposed the methods of analysing the usage mode, the topic series construction, the event
series construction and the evolution analysis for the events, so as to solve the problem existing
in current research status. The results of experimental analysis show that the idea is feasible,
verifiable and superior, which plays a major role in reconfiguring Web hierarchical topic corpus,
improves understanding efficiency of the network big data, enhances the website availability,
constructs and improves the website service function, improves the efficiency of the business
operational and website clicking rate, provides an intelligent big data warehouse for the network
information evolution application.

Funding

This paper is supported by National Natural Science Foundation of China under Grant
Nos.71572015, Support Project of High-Level Teachers in Beijing Municipal Universities in
the Period of 13th Five-Year Plan under Grant Nos.CIT&TCD201704072, Premium Funding
Project for Academic Human Resources Development in Beijing Union University under Grant
Nos.BPHR2018AS01.

Bibliography

[1] Ahila, S.S.; Shunmuganathan, K.L. (2016). Role of Agent Technology in Web Usage Mining:
Homomorphic Encryption Based Recommendation for E–commerce Applications, Wireless
Personal Communications, 87(2), 499-512, 2016.

[2] Alam, M.H.; Ryu, W.J.; Lee, S. (2017). Hashtag-Based Topic Evolution in Social Media,
World Wide Web-Internet and Web Information Systems, 20(6), 1527-1549, 2017.

[3] Aujla, G.S.; Kumar, N.; Zomaya, A.Y. (2018). Optimal Decision Making for Big Data Pro-
cessing at Edge-Cloud Environment: An SDN Perspective, IEEE Transactions on Industrial
Informatics, 14(2), 778–782, 2018.

[4] Chen, B.T.; Tsutsui, S.; Ding, Y.; Ma, F.C. (2017). Understanding the Topic Evolution in
a Scientific Domain: an Exploratory Study for the Field of Information Retrieval, Journal
of Informetrics, 11(4), 1175-1189, 2017.

[5] Chen, M.; Yang, X.P. (2016). Research on Model of Network Information Extraction
Based on Improved Topic-Focused Web Crawler Key Technology, Tehnicki vjesnik/Tech-
nical Gazette, 23(4), 49–54, 2016.

[6] Chen, M.; Yang, X.P.; Sun, M.; Zhao, Y. (2014). Research on Model of Network Information
Currency Evaluation Based on Web Semantic Extraction Method, International Journal of
Future Generation Communication and Networking, 7(2), 103-116, 2014.

[7] Chen, Y.; Zhang, H.; Liu, R.; Ye, Z.W.; Lin, J.Y. (2019). Experimental Explorations on
Short Text Topic Mining Between LDA and NMF Based Schemes, Knowledge-Based Sys-
tems, 163, 1–3, 2019.

[8] Dai, Y.; Wu, W.; Zhou, H.B.; Zhang, J.; Ma, F.Y. (2018). Numerical Simulation and
Optimization of Oil Jet Lubrication for Rotorcraft Meshing Gears, International Journal of
Simulation Modelling, 17(2), 318-326, 2018.


Research on Key Technology of Web Hierarchical Topic Detection
and Evolution Based on Behaviour Tracking Analysis 327

[9] Dai, Y.; Zhu, X.; Zhou, H.; Mao, Z.; Wu, W. (2018). Trajectory Tracking Control for
Seafloor Tracked Vehicle by Adaptive Neural-Fuzzy Inference System Algorithm, Interna-
tional Journal of Computers Communications & Control, 13(4), 465-476, 2018.

[10] Du, J.; Sun, Y.; Ren, H. (2018). The Relationship of Delivery Frequency with the Cost and
Resource Operational Efficiency: A Case Study of Jingdong Logistics, Mathematics and
Computer Science, 3(6), 129-140, 2018.

[11] Fatima, B.; Ramzan, H.; Asghar, S. (2016). Session Identification Techniques Used in Web
Usage Mining a Systematic Mapping of Scholarly Literature, Online Information Review,
40(7), 1033-1053, 2016.

[12] Gaul, W.G.; Vincent, D. (2017). Evaluation of the Evolution of Relationships between Topics
over Time, Advances in Data Analysis and Classification, 11(1), 159-178, 2017.

[13] Jimenez-Marquez, J.L.; Gonzalez-Carrasco, I.; Lopez-Cuadrado, J.L.; Ruiz-Mezcua, B.
(2019). Towards a Big Data Framework for Analysing Social Media Content, International
Journal of Information Management, 44, 1–3, 2019.

[14] Kaseb, M.R.; Khafagy, M.H.; Ali, I.A.; Saad, E.M. (2019). An Improved Technique for
Increasing Availability in Big Data Replication, Future Generation Computer Systems-The
International Journal of Escience, 91, 493–497, 2019.

[15] Kausel, E.E. (2018). Big Data at Work: The Data Science Revolution and Organizational
Psychology, Personnel Psychology, 71(1), 135-136, 2018.

[16] Kho, N.D. (2018). The State of Big Data, Econtent, 41(1), 11-12, 2018.

[17] Liu, J.; Fang, C.; Ansari, N. (2016). Request Dependency Graph: a Model for Web Usage
Mining in Large-Scale Web of Things, IEEE Internet of Things Journal, 3(4), 598-608, 2016.

[18] Makkie, M.; Huang, H.; Zhao, Y.; Vasilakos, A.V.; Liu, T.M. (2019). Fast and Scalable
Distributed Deep Convolutional Autoencoder for fMRI Big Data Analytics, Neurocomputing,
325, 20–22, 2019.

[19] Osman, A.M.S. (2019). A Novel Big Data Analytics Framework for Smart Cities, Future
Generation Computer Systems-The International Journal of Escience, 91, 620–623, 2019.

[20] O’Halloran, K.L.; Tan, S.; Duc-Son, P. (2018). A Digital Mixed Methods Research Design:
Integrating Multimodal Analysis with Data Mining and Information Visualization for Big
Data Analytics, Journal of Mixed Methods Research, 12(1), 11-15, 2018.

[21] Pandian, P.S.; Srinivasan, S. (2016). A Unified Model for Preprocessing and Clustering
Technique for Web Usage Mining, Journal of Multiple-Valued Logic and Soft Computing,
26(3), 205-220, 2016.

[22] Sagi, T.; Gal, A. (2018). Non-Binary Evaluation Measures for Big Data Integration, VLDB
Journal, 27(1), 105-110, 2018.

[23] Tran, Q.T.; Nguyen, S.D.; Seo, T.I. (2019). Algorithm for Estimating Online Bearing Fault
Upon the Ability to Extract Meaningful Information From Big Data of Intelligent Structures,
IEEE Transactions on Industrial Electronics, 66(5), 3804–3806, 2019.

[24] Uma, R.; Muneeswaran, K. (2017). OMIR: Ontology-Based Multimedia Information Re-
trieval System for Web Usage Mining, Cybernetics and Systems, 48(4), 393-414, 2017.


328 M. Chen

[25] Wu, P.J.; Lin, K.C. (2018); Unstructured Big Data Analytics for Retrieving E-Commerce
Logistics Knowledge, Telematics and Informatics, 35(1), 237-241, 2018.

[26] Yao, L.; Ge, Z.Q. (2019). Scalable Semisupervised GMM for Big Data Quality Prediction in
Multimode Processes, IEEE Transactions on Industrial Electronics, 66(5), 3681–3684, 2019.

[27] Zhang, D. (2017). High-Speed Train Control System Big Data Analysis Based on Fuzzy RDF
Model and Uncertain Reasoning, International Journal of Computers Communications &
Control, 12(4), 11-15, 2017.

[28] Zhang, D.; Sui, J.; Gong, Y. (2017). Large Scale Software Test Data Generation Based on
Collective Constraint and Weighted Combination Method, Tehnicki Vjesnik, 24(4), 1041-
1050, 2017.

[29] Zhang, D.; Jin, D.; Gong, Y. (2015). Research of Alarm Correlations Based on Static Defect
Detection, Tehnicki vjesnik, 22(2), 311-318, 2015.

[30] Zhou, H.K.; Yu, H.M.; Hu, R. (2017). Topic Discovery and Evolution in Scientific Liter-
ature Based on Content and Citations, Frontiers of Information Technology & Electronic
Engineering, 18(10), 1511-1524, 2017.

[31] Zhou, H.K.; Yu, H.M.; Hu, R. (2017). Topic Evolution Based on the Probabilistic Topic
Model: a Review, Frontiers of Computer Science, 11(5), 786-802, 2017.