INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL
Online ISSN 1841-9844, ISSN-L 1841-9836, Volume: 18, Issue: 3, Month: June, Year: 2023
Article Number: 5045, https://doi.org/10.15837/ijccc.2023.3.5045

CCC Publications 

WOMDI-Apriori Data Mining Algorithm for Clustered Indicators
Analysis of Specialty Groups in Higher Vocational Colleges

Fei Gao, Jing Yang*, Yang Yang, Xiaojing Yuan

Fei Gao
Non-governmental Higher Education Institute of China, Zhejiang Shuren College
Hangzhou 310015, China
gaofei@zjsru.edu.cn

Jing Yang*
Hebei Women and Children Activity Center
Shijiazhuang 050081, China
*Corresponding author: pljinger@126.com
*ORCID: 0000-0003-3216-9444.

Yang Yang
School of Transportation Science and Engineering, Beihang University
Beijing 100191, China
yangphd@buaa.edu.cn

Xiaojing Yuan
School of Traffic and Transportation, Beijing Jiaotong University
Beijing 100044, China
xjyuan@bjtu.edu.cn

Abstract

The cluster effect of specialty groups plays an important role in the development of Higher
Vocational Colleges. The purpose of this research is to scientifically explore the interaction mech-
anism of specialty groups clustering indexes in higher vocational colleges, quantitatively analyze
the correlation of these indexes, and explore reasonable measures to promote the specialty groups
clustering effect in higher vocational colleges. Firstly, data denoising and field screening were car-
ried out on the original data, and then the variables were clustered and divided into LHS (Left
Hand Side) and RHS (Right Hand Side). Then, an improved multi-dimensional interactive Apri-
ori association rule mining algorithm considering index weights and orientation constraints was
proposed. The improved Apriori algorithm and the traditional Apriori algorithm were applied to
mine the structured data sets. The results show that the improved WOMDI-Apriori algorithm
in this study improves the accuracy by 79.96% compared with the traditional Apriori algorithm.
The results indicate that, when the indicators of brand, key and characteristic majors at or above
the provincial level, proportion of full-time teachers with double qualifications, and the number of
internship students accepted by cooperative enterprises are at a low level, the number of projects
and satisfaction proportion of employers with graduates would be negatively affected; The major


https://doi.org/10.15837/ijccc.2023.3.5045 2

category of equipment manufacturing is subjected to various factors coupling, which may lead to
different graduates’ counterpart employment rate; for association rules where the successor of the
mining results is dominated by negative results, measures should be taken to avoid or reduce the
possibility of their occurrence as much as possible. For association rules in which the successors
of the mining results are dominated by positive results, measures should be taken to facilitate the
occurrence of these frequent item sets whenever possible. The framework proposed in this research
can provide theoretical guidance for analyzing operating characteristics and promoting the positive
effects of specialty groups in higher vocational colleges.

Keywords: WOMDI-Apriori data mining algorithm, clustered indicators analysis, specialty
groups, higher vocational colleges.

1 Introduction
"Double high plan" is another major project after the demonstration, backbone, high-quality higher

vocational college construction plan in China, in which high-level college and high-level specialty (spe-
cialty group) is essentially a symbiotic relationship. In the limited resources of the fierce competition
environment, higher vocational colleges, will inevitably focus on strengthening the construction of ad-
vantageous specialty groups, choose the cluster development mode, and the growth of specialty groups
is indeed conducive to breaking through bottlenecks such as scattered resources, indistinctive features
and insufficient linkage to improve the adaptability of talents, carry out technological innovation and
focus on regional needs.

Specialty groups are a collection of majors or directions with common foundations, complemen-
tary advantages and resource sharing, and are committed to changing the division between majors
and promoting cross-border cooperation in knowledge flow. Although not all countries have the ter-
minology of specialty group, the research on interdisciplinary education can still provide important
reference. Camilleri et al. pointed out that European professional higher education, as an applied
education, requires to break the curriculum boundary and run through professional experience to
promote knowledge integration [1]. Norton et al. proposed that American community colleges en-
trust multidisciplinary courses to achieve education content integration[2]. Scholars have carried out
relevant research on the connotation definition, existing problems and development paths of specialty
groups. Around the definition of specialty groups, the different interpretations contain the similarity
theory, the joint force theory and the common theory, which are the most representative. In recent
years, various views have shown a trend of convergence, such as Gu Yong’an [3]. Specialty groups
also face a series of obstacles in the process of formation, operation and exertion of influence. Zeng
Xianwen and Zhang Shu analyzed the role of specialty groups with the help of human capital value
measurement models [4]. Zeng Xianwen and Yan Meng introduced concepts such as class density,
group intensity, and concentration of specialty groups to quantitatively analyze specialty groups [5].
Around the existing problems, scholars have elaborated on the optimization path from different angles,
such as Zong Cheng, he has pointed out that recombine talent training models, innovate the formation
of teacher teams, and jointly build a sharing training base [6]. Zhao Mengcheng believes that it is
necessary to establish specialty group faculty, inter-specialty team, curriculum sharing mechanism,
and information sharing platform to achieve micro-organizational changes [7].

With the development of computer technology, the acquisition of big data has become possible
[8]. The research and solution of problems in the social science field also increasingly rely on the
use of big data methods. For example, the genetic algorithm is used to optimize production schedul-
ing problem[9].The intelligent mechanism is established to realize the coordination and subsequent
response of emergency resources [10]. Text data mining is used to analyze the relationship between in-
novation and development in economic field[11]. Machine learning and data mining tools are often able
to solve some problems that cannot be efficiently discovered by traditional means. For example, The
economic lot-size problem is explained through machine learning to optimize resource allocation[12].
With the help of machine learning and educational data mining, students’ performance can be pre-
dicted based on video learning systems[13]. Data mining technology is a data analysis method that
extracts implicit and potentially valuable laws for decision making from a large amount of data, and its
process is user-oriented and knowledge discovery-oriented data analysis process, while association rule


https://doi.org/10.15837/ijccc.2023.3.5045 3

analysis is an important branch of data mining technology. In the work of traffic accident causation
analysis and risk identification, association rule analysis is a common research method for data mining
of this problem, and the Apriori association rule mining algorithm is one of the main methods of as-
sociation rule analysis [14]; the association rule mining algorithm was first proposed by Agrawal et al.
when they analyzed the problem of market shopping baskets, and the algorithm can accurately and
effectively mining the correlation between two or more factors, but it cannot quantify the importance
of a single factor in the association rule, and the computational effort will increase exponentially when
there are more data items, the computational efficiency decreases, and it is easy to ignore rare data
[15]. In the work of association rule mining for professional clustering effect, the factors are essen-
tially treated with equal weights, which cannot reflect the degree of influence of different indicators
on professional clustering effect and easily ignore the factors that need to be focused on, and at the
same time, it is impossible to filter out the factors with weak influence on the results, which causes
a large number of useless operations and affects the efficiency of model calculation [16]. As a result,
many researchers began to improve the traditional Apriori algorithm. An intelligent method is used
for improving the Apriori algorithm in order to extract frequent itemsets[17]. The improved Apriori
algorithm can reduce the time complexity of association rule mining[18].

At present, the academic community has carried out useful exploration in the field of specialty
groups, but it has yet to be enriched. On the one hand, the research theme still lacks in-depth
analysis of the elements interrelationship and action mechanism within the specialty group, and the
internal law of the development of the specialty group needs to be further excavated. On the one
hand, the research method is mainly qualitative research, and in general it lacks the strong support of
first-hand data and systematic empirical analysis. This study will use WOMDI-Apriori data mining
algorithms to analyze cluster indicators for 232 specialty groups of higher vocational colleges, establish
a corresponding analysis framework, and explore basic problems such as the essential attributes and
operating mechanisms of specialty groups, so as to provide theoretical support for related research,
and provide new ideas and perspectives for creatively understanding and solving problems, so as to
contribute to the deepening of related research and the development of practice.

2 Data pre-processing
The original sample data set contains a total of 242 base data, each covering 41 field variables,

constituting a 242*41 matrix. Before model construction and data mining analysis, the sample struc-
ture design needs to be implemented, and the first task is data pre-processing, including data cleaning
(denoising), field screening, variable coding and numeralization, and the final available matrix output.
The sample structure design process is shown in Figure 1.

Figure 1: Sample structure design flow


https://doi.org/10.15837/ijccc.2023.3.5045 4

2.1 Data denoising

Data Mining refers to the process of extracting hidden, deep and potentially valuable information
from large, fuzzy and noisy big data through purification and denoising, algorithm design and other
means. Data mining contains many meanings: (1) the data must be real and valid; (2) it contains the
information resources required for data mining; (3) the information mined has the value of utilization;
(4) it is not necessary to mine the general knowledge in the massive data, but more to mine specific
laws. Data pre-processing is the first step in the process of data mining work, and it is also the most
crucial one. Data pre-processing usually accounts for nearly 60% of the entire data mining workload,
which is extremely time-consuming. It follows that when data mining is carried out, it is necessary to
ensure the reliability and validity of the data at the very beginning, and effectively doing the work of
data pre-processing can further improve the quality of the data in the database, effectively compensate
for the incompleteness of the research data set, and provide more reliable data information for data
mining work.

The most obvious problems of the original data in the record information of specialty groups
are incompleteness, inconsistency and noisiness, which can directly cause the waste of computational
resources and even lead to the bias of computational results. The main problematic characteristics of
the data set are as follows: (1) worthless field variables, (2) incomplete missing data, (3) noisy data, (4)
inconsistency and compatible transformation problems, and (5) redundant data and similar mergeable
data. The general process of dealing with such data problems in this study includes: purification and
cleaning, compatible conversion, and integration and subsumption. Specific examples are as follows:

(1) Merging of redundant data: the value of the "category of backbone major" field of the data
of serial number code 226 in the original data is "manufacturing", the "category of backbone major"
field of code 155 in the original data takes the value of "equipment", considering that they are actually
"equipment manufacturing", therefore, both are modified. The data such as these are processed in a
unified manner.

(2) Abnormal data deletion: The 10 data items indicating employment rate of "9625%" and "0%"
are deleted.

(3) Data correction: the indicator value of "graduates’ counterpart employment rate" is changed
from "823" to "82.3"; the field " The value of "Number of full-time teachers (sum)" in a data entry is
"150.0018", which is changed to "150"; the field "Number of hours taught by part-time faculty as a
percentage of total professional hours in one academic year (%)" was changed to "84.2";. The value of
"%" in a data entry is "447", which is replaced by "44.7"; the value of "Funding for horizontal projects
(million yuan)" in a data entry is "(+)". The value of "(+)17" in one data is changed to "17"; the value
of 3 data in the field "Brand, key and characteristic majors at or above provincial level" is empty and
is filled with the value of "0" is filled in.

(4) No value field variable rejection: All values of the "Rank" field are empty, which is a worthless
field, so it is rejected here.

2.2 Field Filtering

According to the characteristics of the association rule mining algorithm, the fields with too much
dispersion (biased data fields) should be deleted in the design of this data sample structure, and the
facing industries of the specialty groups should be the main object of this analysis. The fields that
are not considered include "serial number", "school name", "name of specialty group", "city where the
school is located", "name of included majors", "name of involved colleges", "name of included major
categories", "ranking ".

2.3 Variable coding

(1) Clustering analysis of variable values
In the original data of this study, all fields take values in discrete variable form. The association

rule mining algorithm for continuous variables generates memory overflow errors in the computation
process, so the input data set needs to be in discrete variable form. However, too discrete variables
(e.g., all integers with field values from 1 to 100) can cause the results to have too little support,


https://doi.org/10.15837/ijccc.2023.3.5045 5

and valuable association rules can be easily ignored and missed, so the discrete variables need to be
clustered to obtain more focused and reliable association rule mining results

(2) Division of LHS (Left Hand Side) and RHS (Right Hand Side)
According to the characteristics of association rule mining, in the data mining process, there

are causative and resultant terms, i.e., it is necessary to define the precedence term (LHS) and the
successor term (RHS).

Here, the filtered and de-noised data were divided into causal dimensions, and the fields related to
"status quo" and "input" were used as LHS, and the causal dimension fields were analyzed by profes-
sionalism: industry oriented, including major quantity, number of colleges involved, including major
categories, number of majors sharing cooperative enterprises, number of majors sharing employers,
number of majors sharing courses, number of majors sharing on campus and off campus training
bases, number of majors sharing full-time teachers, brand key and characteristic majors at or above
the provincial level, number of majors sharing off-campus part-time teachers, number of full-time stu-
dents in specialty group , number of full-time teachers, proportion of full-time teachers with double
qualifications number of hours taught by part-time faculty as a percentage of total professional hours
for one academic year, number of specialty group training bases , average equipment value of students
on campus training bases, frequency of use of on-campus training bases for one academic year, total
number of cooperative enterprises , total number of courses jointly developed by cooperative enter-
prises , number of cooperative enterprises support part-time teachers , number of internship students
accepted by cooperative enterprises , total value of equipment donated by the cooperative enterprise.

The fields of "consequence" and "output" related factors are used as RHS, and the fields of the
result dimension through professional analysis include: the initial employment rate of graduates of the
specialty group, graduates’ counterpart employment rate, satisfaction proportion of employers with
graduates , the number of graduates accepted by off-campus internship training bases , number of
students accepted by the cooperative enterprise, total number of employees trained for the enterprise,
number of provincial or above teaching achievement awards, number of provincial or above scientific
research achievement awards, horizontal project funds, number of invention patent , number of industry
standard, number of provincial or above awards for college student, number of provincial or above
scientific research projects.

2.4 Sample structure design

Based on the above analysis, the clustering calculation is performed by the LOOK UP function
embedded in the database, based on the "if" loop and copying the values of the clustered variables
and returning them to the source file, processing all the fields according to the above process work
and constructing the final input value matrix. The results of dimensional division and variable value
clustering are shown in Table 1 below.

Table 1: Sample structured design data set

Dimension variable Value clustering Value
quantity

LHS Category of backbone
major

Finance and trade, electronics and informa-
tion, equipment manufacturing, medicine and
health, civil engineering, culture and art, ed-
ucation and sports, tourism, public manage-
ment and services, agriculture, forestry, fish-
eries and animal husbandry, light industry and
textiles, food, drugs and grains, transporta-
tion, energy, power and materials, and other
categories

16

Including major quan-
tity

3,4,5,6 4


https://doi.org/10.15837/ijccc.2023.3.5045 6

Including major cate-
gories

1,2,3,4 4

Brand, key and char-
acteristic majors at or
above the provincial
level

0,1,2,3,4 5

Number of majors shar-
ing courses

0,1,2,3,4,5,6,7,8,10,11 11

Number of majors shar-
ing on campus and off
campus training bases

1,2,3,4,5,6,7,8,9,10,11,14,17,18,33 15

Number of majors shar-
ing full-time teachers

< 10, 11 ∼ 50, > 50 3

Number of full-time stu-
dents in specialty group

< 1000, 1000 ∼ 2000, 2000 ∼ 3000, > 3000 4

Proportion of full-time
teachers with double
qualifications

< 50, 50 ∼ 70, 70 ∼ 90, 90 ∼ 100 4

Average equipment
value of students on
campus training bases

< 1, 1 ∼ 10, 10 ∼ 100, > 100 4

Number of cooperative
enterprises support
part-time teachers

< 10, 10 ∼ 50, 50 ∼ 100, > 100 4

Number of internship
students accepted by
cooperative enterprises

< 100, 100 ∼ 500, > 500 3

Total value of equip-
ment donated by the co-
operative enterprise

< 50, 50 ∼ 100, 100 ∼ 500, 500 ∼ 1000, > 1000 5

RHS Graduates’ counterpart
employment rate

< 50, 50 ∼ 70, 70 ∼ 90, 90 ∼ 100 4

Satisfaction proportion
of employers with grad-
uates

50 ∼ 90, 90 ∼ 100 3

Number of students ac-
cepted by the coopera-
tive enterprise

< 100, 100 ∼ 300, > 300 3

Number of provincial or
above teaching achieve-
ment awards

0,1,2,3,4,5 6

Number of provincial or
above scientific research
achievement awards

0,1,2 3

Horizontal project
funds

< 10, 10 ∼ 100, > 100 3

Number of invention
patent

0 ∼ 5, 5 ∼ 10, > 10 3

Number of industry
standard

0,1,2,3,4,5,7,10 8

Number of provincial or
above awards for college
students

< 50, 50 ∼ 100, > 100 3


https://doi.org/10.15837/ijccc.2023.3.5045 7

Number of provincial or
above scientific research
projects

< 5, 5 ∼ 10, > 10 3

Total number of em-
ployees trained for the
enterprise

< 100, 100 ∼ 1000, 1000 ∼ 10000, > 10000 4

3 Algorithm design and modeling

3.1 Association rules mining

Association rule mining is one of the main technologies of data mining, and it is also the most
common form of mining patterns in unsupervised learning systems. Association rule mining is to mine
valuable knowledge from a large amount of data to describe the relationship between data items [19].

A complete association rule can be expressed as the implication form of "x => y". X is the leading
term, also known as the cause layer, y is the subsequent term, also known as the result layer. The
association rule "x => y" is the basic condition that can be seen and established in the study, which
meets the requirements of the preset support, confidence and lift. Support, confidence and lift are
three important parameters that characterize the association rule, among which:

(1) Support: the number of transactions of itemset x contained in dataset D is called the support
number of itemset x, expressed as σx. The support rate (also known as support) of item set X is
recorded as: support (x), that is, probability P (X):

Support(x) =
σx

{D}
× 100% (1)

Where {d} is the number of transactions in dataset D. if support (x) is not less than the preset
minimum support threshold (min_support), X is called a frequent itemset, otherwise x is an infrequent
itemset.

The support of item set (D) is the support rate of association rule x => y, which is essentially the
proportion of transactions in D containing (X ∪ Y ), that is, the frequency of probability p(X ∪ Y ),
as support (x => y):

Support(X => Y ) =
|X

⋃
Y |

|D|
= Support(X

⋃
Y ) = P(X

⋃
Y ) (2)

(2) Confidence: two association rules defined on I and D, such as x => y, whose confidence means
that "the transaction meeting condition x also meets condition Y ". The confidence of association rules
with x => y is the conditional probability p(y | x) of itemset y on the premise of including itemset x,
which is recorded as: confidence (x => y).

Confidence(X => Y) = P (Y | X) =
P (X, Y )

P (X)
=

Sup(X => Y )
Sup(X)

=
Sup(X ∩ Y )

Sup(X)
(3)

Where X ⊆ I, Y ⊆ I, X ∩ Y = ∅.
(3) Lift: the lifting degree is used to characterize the correlation degree of the leading item and

the subsequent item. In order to avoid the interference of pseudo strong association rules and prevent
invalid association rules from appearing in the final result, the lifting degree index is hereby introduced,
and this index is also used as the judgment condition of effective association rules:

Lift(X => Y ) =
P (Y | X)

P (Y )
=

Conf (X => Y )
Sup(Y )

=
Sup(X => Y )
Sup(X)Sup(Y )

(4)

The greater the degree of promotion, the higher the degree of correlation between itemset X and
itemset y. Generally, we believe that only association rules with lift greater than 1 are effective strong
association rules, and X and y are positively correlated at this time; If the lifting degree lift < 1, it


https://doi.org/10.15837/ijccc.2023.3.5045 8

means that X and y have no correlation degree or are mutually exclusive item sets. Such non effective
association rules will not be considered in the results.

(4) Minimum support, minimum confidence threshold and strong association rules
In the modeling process, users can specify the minimum support (recorded as min_support) and the

minimum confidence (recorded as min_confidence). The former describes the minimum importance
that association rules must meet, and the latter specifies the minimum reliability that association rules
must meet, and min_support ∈ (0, 1], min_confidence ∈ (0, 1].

Data set D meets the minimum support threshold and the minimum trust threshold on item set
I, and the association rules with the lift greater than 1 are called valuable strong association rules.

3.2 Weighting model

In order to eliminate the weight deviation of subjective weighting method and objective weighting
method in the process of weight assignment as much as possible, a combined weighting method based
on the sum of deviation squares is adopted [20], and the subjective weight obtained by IAHP method
and the objective weight obtained by rough set model are integrated and optimized to calculate the
actual consideration weight of the corresponding index:

Combine the weight vector ω′ obtained by the subjective weighting method and the weight vector
ω∗ obtained by the objective weighting method to obtain the final reasonable weight vector ω̄. As-
suming that w is the optimal weight vector, the deviation between the subjective weight vector w and
the objective weight vector w should be minimized [21]. Establish optimization model:


min θ

n∑
j=1

(
ω′j − ω̄j

)2
+ (1 − θ)

n∑
j=1

(
ω∗j − ω̄j

)2
s.t




ω̄ ≥ 0
n∑

j=1
ω̄j = 1

(5)

Where θ denotes the trust degree in the result of subjective weighting; 1 − θ denotes the trust degree in
the result of objective weighting; ω̄j represents the weight of the j th index attribute. There is an opti-
mal solution ω̄ =

[
ωL, ωU

]
in formula (5) of the optimization model, where ωL =

(
ωL1 , ω

L
2 , ω

L
3 . . . , ω

L
n

)
is the lower bound of the interval number of model solutions, ωU =

(
ωU1 , ω

U
2 , ω

U
3 . . . , ω

U
n

)
is the upper

bound of the interval number of model solutions, and ω̄ represents the interval weight vector of the
final combination weighting. The steps of the established subjective and objective joint weighting
model are as follows:

Step 1: Determine the set of objects to be evaluated and the corresponding index set. Let X be
a collection of objects denoted as X = {x1, x2 . . . , xn} , A = {a1, a2, . . . , an} as the index set, a(x) is
the value of object X on attribute A, meanwhile, the index value can be discrete value or continuous
value.

Step 2: Interval number feature vector method is used to determine the subjective weight. Accord-
ing to the index comparison scale, interval number judgment matrix B =

[
BL, BU

]
is determined,

and the weight vector ω′ =
[
ω

′L, ω
′U

]
of each index aj (j ≤ m) is calculated based on IAHP method.

Step 3: Apply rough set theory to get objective attribute weight. Attribute set C = {c1, c2, . . . Cn}
is the set of evaluation indicators determined in Step1, and the domain U = {u1, u2, u3 . . . , un} is
a set of events with various possible causes at the corresponding time. The value of each traffic
crash recorded on the sub-index is regarded as a piece of information of ut, ut = {c1t, c2t . . . cnt}, and
discretize it to establish a discretized twodimensional information table. Then, the weight vector is
calculated according to Formula (5).

Step 4: Obtain the optimal combination of weights. Calculate the final weight vector ω̄ =
[
ωL, ωU

]
according to Formula (5). The weight value is assigned to the field variables of traffic crash analysis,
and the relative support, relative confidence, and relative lift of each association rule are determined.


https://doi.org/10.15837/ijccc.2023.3.5045 9

3.3 Construction of the WOMDI-Apriori algorithm

The traditional Accident Tree analysis and FP-Tree algorithm and other methods are convenient
for operators to grasp the overall characteristics of the problem, but they can’t realize the relevance
thinking between the multi-attributes of indicators in each dimension, nor can they effectively quantify
and analyze the causes of the negative results. The association rule mining algorithm was first proposed
by Agrawal et al. when analyzing the market basket problem. The algorithm can accurately and
effectively mine the correlation between two or more factors, but it can’t quantify the importance of
a single factor in the association rule. In addition, when there are many data items, the amount of
calculation will multiply, the computational efficiency will be reduced, and it is easy to ignore rare data
[19]. At the same time, the traditional algorithm can’t filter out the factors that have a weak influence
on the results, leading to a large number of useless operations, affecting the calculation efficiency of
the model. Most importantly, because the association rule mining algorithm was first developed for
market basket analysis, some scholars did not improve and optimize the algorithm in an orderly way
when applying the algorithm to analyze the association rules of other problems. If the traditional
Apriori association rule mining algorithm is directly applied to study the problems in the fields of
economics or education, a large number of invalid or even incorrect disordered association rules are
output in the result [20].

This research improves and optimizes the algorithm from three perspectives: 1) the algorithm
is constrained by the form of orderly and directional rule Association, so that the traditional Apri-
ori association rule mining algorithm for the application scope of the shopping basket field can be
compatible with the problem of clustered indicators analysis of special groups in higher vocational
colleges; 2) Through the subjective and objective weighting model, the index weights of all field vari-
ables are calculated, and based on the weight optimization results, the concepts of "relative support",
"relative confidence" and "relative improvement" are proposed; 3) Breaking the traditional mining
output method of "cause leading item => consequence subsequent item", introducing the idea of
multi-dimensional interactive association, not only considering the association relationship of "cause
leading item => consequence subsequent item", but also exploring the association rules between di-
mensions from the perspective of the autocorrelation (leading item) of the dimension part of the cause
layer and the internal autocorrelation (subsequent item) of the result dimension.

Figure 2 shows the steps of the multi-dimensional interactive improved Apriori algorithm consid-
ering directional constraints proposed in this paper.

Figure 2: Professional clustering effect association rule mining process


https://doi.org/10.15837/ijccc.2023.3.5045 10

4 Association rule mining analysis

4.1 Parameter calibration

First, the input of data of each dimension is carried out; then, the initial threshold is set, and the
minimum support threshold min_sup=0.30 and the minimum confidence threshold min_conf=0.35 are
set after continuous manual debugging considering the main features of the specialty group clustering
effect in higher vocational colleges; meanwhile, the min_lift threshold of the lift is set to 1, so that
the frequent itemsets with no positive association between the preceding and following items can be
eliminated.

4.2 Improving the accuracy improvement calibration of the algorithm

The original Apriori association rule mining algorithm and the improved WOMDI-Apriori associ-
ation rule mining algorithm are applied to the input dataset by calling the "arules" function package
in R language [22], and the mining results are summarized in Figure 3.

The above calculation results can be identified: the original Apriori association rule mining al-
gorithm without loaded orientation constraints mines a total of 7379 items, while applying the
WOMDI-Apriori model proposed in this paper, under the same threshold setting (min_sup=0.30,
min_conf=0.35), only the mining results output only 1472 eligible association rules. In other words, if
the algorithm is not optimized and improved, at least 5907 invalid association rules will be generated,
and the improved WOMDI-Apriori model improves the computational accuracy by 79.96% over the
traditional Aprori model under the conditions of the base data in this paper. It can be seen that the
direct application of the traditional Apriori algorithm to such problems will cause some confusion to
the results, so the optimization and improvement of the Apriori algorithm in this paper is necessary for
such research problems of association rule analysis of professional clusters in higher education schools.

Figure 3: Comparison of original Apriori algorithm and improved Apriori algorithm

Figure 3 can show the percentage of the total frequent item set in the sparse matrix of the mining
results. Through the above mining summary information can also be found, after the algorithm
improvement and optimization, the rules that contain more items in the mining results are: 204
association rules for those containing 2 items, 498 association rules for those containing 3 items, 504
association rules for those containing 4 items, 230 association rules for those containing 5 items, and 36
association rules for those containing 6 items The median is 4, which means that 50% of the frequent
itemsets contain no more than 4 items, and the mean value of 3.59 means that the average number of
items contained in all frequent itemsets is 3.59.

5 Results and discussion

5.1 Valuable association rule extraction

The parameter scatter plot of the mining results is plotted by the "plot function" in arulesViz,
a visual analysis toolkit for association rules in R language, as shown in Figure 4. The horizontal


https://doi.org/10.15837/ijccc.2023.3.5045 11

coordinate in the figure represents the relative support (R-Sup), the vertical coordinate represents the
relative confidence (R-Conf), and the color shade is the relative lift (R-Lift) size.

Figure 4: Scatter plot of association rule mining results

The scatter plot of association rules of the mining results in Figure 4 shows that the relative
support of some association rules ranges from 0.30 to 0.65, and another part ranges from 0.75 to
0.95, indicating that high-frequency rules and low-frequency rules coexist among all threshold-eligible
rules; from the confidence index, the relative confidence of most association rules is between 0.3 and
1, but the region below 0.9 relative confidence has a lighter color, which means that the relative
lift of this part of association rules is not enough, and it is possible that the association rules with
insufficient degree of association between item sets or even invalid ones; meanwhile, for the lift index,
by observing the colors in the scatter plot, about half of the association rules have a relative lift less
than 1, indicating that a large proportion of the rules fail to satisfy the constraint of a lift greater
than 1, i.e., they are invalid association rules.

On the basis of filtering out the valid association rules (R-lift>1) and focus locking on the regions
with high confidence and high support (blue circles in Figure 4), the valuable strong association rules
are further extracted.

5.2 Analysis of high support association rule extraction results

In order to extract the valuable strong association rules more precisely, the "inspect" and "sort"
functions in R Studio are called, and the 1472 valid association rules obtained by applying WOMDI-
Apriori association rule mining algorithm are conditionally sorted according to the relative support
(R-Sup) from the largest to the smallest, and the rules with relative lift (R-Lift) less than 1 are
eliminated by extension, and the extracted output results are shown in Table 2.

The high support association rule characterized by the relative support ranking corresponds to the
higher frequency of frequent item sets, and by analyzing the results in Table 2, the following pattern
is summarized:

(1) The relative confidence of the three association rules with the highest ranking of high support
is also at a high level, and their relative confidence is greater than 9; indicating that these association
rules are all strongly correlated rules. The relative lift is greater than 1 and less than 1.1, which means
that these association rules are positively correlated, but the correlation is at a low level.

(2) The lower level indicators including brand, key and characteristic majors at or above the
provincial level, proportion of full-time teachers with double qualifications, and number of internship
students accepted by cooperative enterprises may basically cause the loss of horizontal and vertical
projects, and satisfaction proportion of employers with graduates may be affected negatively as well;
On the contrary, the specialty group of medicine and the health in the higher vocational colleges with
three brand, key and characteristic majors at or above the provincial level , often have a high number


https://doi.org/10.15837/ijccc.2023.3.5045 12

Table 2: Extraction of high relative support results
rules R-support R-confidence R-lift

{Brand, key and characteristic majors at or above
the provincial level= 0, Proportion of full-time

teachers with double qualifications=<50, Number of
internship students accepted by cooperative

enterprises =<100}=> { Satisfaction proportion of
employers with graduates = 50∼90, Horizontal

project funds =<10, Number of provincial or above
scientific research projects =<5}

0.941 0.998 1.01

{ Category of backbone major=Medicine and health,
Brand, key and characteristic majors at or above the
provincial level= 3, Number of full-time students in

specialty group =>3000, Cooperative enterprises
support part-time teachers Total =>100,
Cooperative enterprises accept internship

studentsNumber =>500}=> { Satisfaction
proportion of employers with graduates = 90∼100,

Number of scientific research projects at or above the
provincial level= 5∼10}

0.933 0.905 1.01

{Including major categories= 2, Total value of
equipment donated by the cooperative enterprise =
100∼500}=> {Scientific research achievement award

at or above the provincial level=1}

0.892 0.973 1.02

of students, cooperative enterprises support part-time teachers, and cooperative enterprises accept
internship students. Higher vocational colleges with such conditions are at a high level of satisfaction
proportion of employers with graduates , and there are also a considerable number of scientific research
projects.

(3) In most situations, if the specialty groups with two major categories, and total value of equip-
ment donated by the cooperative enterprise is between 1 ∼ 5 million RMB, they can often get 1
provincial or above scientific research achievement award.

5.3 Analysis of high-confidence association rule extraction results

In order to extract the valuable strong association rules more precisely, the "inspect" and "sort"
functions in R Studio are called, and the 1472 valid association rules obtained by applying WOMDI-
Apriori association rule mining algorithm are conditionally sorted according to the relative confidence
(R-Conf) from the largest to the smallest, while those rules with relative lift (R-Lift) less than 1 are
eliminated by extension, and the extracted output results are shown in Table 3.

The conditional probability of occurrence of high confidence association rules characterizing fre-
quent item sets according to the relative confidence ranking is higher, and the following pattern is
summarized by analyzing the results in Table 3.

(1) By observing Table 3, we can find that the association rules with the highest three relative
confidence levels do not have high relative support, indicating that the item sets with high conditional
probabilities are not necessarily frequent item sets. However, their R-lifts are all greater than 2,
suggesting that these high-confidence association rules are strongly correlated frequent item sets.

(2) Also in major category of equipment manufacturing, when the number of full-time students in
specialty group is greater than 3000, counterpart employment rate of graduates is instead at a lower
level of 50%-70%; on the contrary, when in the case of number of full-time students in specialty group
is less than 1000, graduates’ counterpart employment rate is higher. Possible explanations are that
other conditions have changed, affecting the prior distribution of conditional probabilities, such as
brand, key and characteristic majors at or above the provincial level, proportion of full-time teachers


https://doi.org/10.15837/ijccc.2023.3.5045 13

Table 3: Extraction of high relative confidence results
rules R-support R-confidence R-lift

{ Category of backbone major=Equipment
manufacturing, Brand, key and characteristic majors

at or above the provincial level=3, Number of
full-time students in specialty group =>3000,
Proportion of full-time teachers with double

qualifications=<50}=>{ Satisfaction proportion of
employers with graduates =50∼90, Graduates’

counterpart employmentrate =50∼70}

0.351 0.998 2.49

{Category of backbone major=Equipment
manufacturing, Including major quantity=5, Number

of full-time students in specialty group =<1000,
Total value of equipment donated by the cooperative

enterprise =>1000}=>{ Graduates’ counterpart
employmentrate=90∼100, Number of provincial or

above awards for college student =>100}

0.359 0.996 2.5

{Number of majors sharing on campus and off
campus training bases =9, Average equipment value
of students on campus training base =10∼100, Total

value of equipment donated by the cooperative
enterprise =100∼500}=>{Teaching achievement
award at or above provincial level=4, Invention

patent=5∼10 }

0.361 0.996 2.49

with double qualifications and differences in conditions such as including major quantity.
(3) Medium level input of major sharing on campus and off campus training bases, equipment in

the campus training base and equipment donated by the cooperative enterprise can create a medium
level of output of teaching achievement awards at or above provincial level and invention patents.

5.4 Recommendations based on association rule mining results

For association rules in Tables 2 and 3 where the successor of the mining results is dominated by
negative results, measures should be taken to avoid or reduce the possibility of their occurrence as
much as possible. For association rules in which the successors of the mining results in Tables 2 and
3 are dominated by positive results, measures should be taken to facilitate the occurrence of these
frequent item sets whenever possible. Based on the above results, the specific measures are as follows:

(1) The high support association rules indicates that such frequent item sets have a high frequency
of occurrence, and if the successor item is a negative outcome, the occurrence of the preceding item
should be avoided as much as possible to reduce the frequency of negative outcomes; for example,
the first and third association rules in Table 2 should take measures to prevent the occurrence of the
preceding item to avoid higher education institutions from obtaining lower employers satisfaction ,
and fewer project and research awards.

(2) The high support association rules indicates that such frequent item sets have a high frequency
of occurrence, and if the successor item is a positive outcome, then the frequency of the preceding
item should be increased as much as possible to enhance the frequency of the positive outcome; for
example, the second association rule in Table 2 should take measures to promote the occurrence of
the preceding item so that higher education institutions can obtain higher employers satisfaction and
more research projects.

(3) The high confidence association rules indicates that such frequent item sets have a high fre-
quency of occurrence. If the successor is a negative result, once the preceding term occurs, extra
attention should be paid at this time; for example, the first association rule in Table 3, once the com-
bination of the frequent set of items of the preceding term matches the situation in the table, extra


https://doi.org/10.15837/ijccc.2023.3.5045 14

attention should be paid to prevent the lower cunterpart employment rate of graduates and employers’
satisfaction proportion with graduates.

(4) The high confidence association rules characterize such frequent item sets with high probability
of occurrence. If the successor is a positive outcome, it should contribute to the frequent item set
coupling condition of the prior as much as possible, e.g., the second association rule in Table 3 should
set the coupling condition of the prior as much as possible in order to facilitate higher graduates’
counterpart employment rate and more provincial or above awards for college student.

6 Conclusions
The main findings of this research are obtained as follows:
(1) The improved WOMDI-Apriori algorithm proposed in this study has 79.96% higher accuracy

than the traditional Apriori algorithm, which will cause some confusion to the results if the traditional
association rule mining algorithm is applied directly. Therefore, it is necessary to improve the Apriori
algorithm for the problem of association rule analysis of clustering effect of specialty groups in higher
vocational colleges.

(2) When brand, key and characteristic majors at or above the provincial level, proportion of
full-time teachers with double qualifications, and cooperative enterprises accept internship students’
number are at low levels, the number of projects obtained by higher vocational colleges and the
satisfaction of employers with graduates are negatively affected. For the major category of equipment
manufacturing, the number of full-time students in specialty group of higher vocational colleges does
not fully determine the graduates’ counterpart employment rate, and the level of this indicator is also
related to the coupling effect of other factors.

(3) High support association rules characterize the high frequency of such frequent item sets. If the
successor items in the specialty group cluster effect mining results are negative results, the occurrence
of the leading items should be avoided as much as possible to reduce the frequency of negative results;
if the successor items in the specialty group cluster effect mining results are positive results, the
occurrence frequency of the leading items should be increased as much as possible to improve the
frequency of positive results. High confidence association rules characterize such frequent item sets
with a high probability of occurrence. If the successor in the specialty group cluster effect mining
result is a negative result, once the preceding term occurs, extra attention should be paid at this time;
if the successor in the specialty group cluster effect mining result is a positive result, the frequent term
set coupling condition of the preceding term should be contributed as much as possible.

Further research directions:
(1) In analyzing the clustering effect of specialty groups in higher vocational colleges, the LHS and

RHS divisions for each field variable in the original data may be unreasonable and need to be further
optimized and adjusted in the future.

(2) In the future, the fields need to be clustered and divided into multiple dimensional field clusters,
so that each field may appear in the mining results as a causative factor or a result. For the analysis
of the cluster taking values of each field index, further refinement is needed in the future based on the
position of statistical quartiles, medians, and means.

Acknowledgments

This work was supported by Humanities and Social Science Planning Fund Project of Chinese
Ministry of Education “Research on the Generation Mechanism and Measurement Model of Specialty
Groups Agglomeration Effect in Higher Vocational Colleges under the Background of ‘Double high
plan’" (21YJA880013).

Author contributions

The authors contributed equally to this work.


https://doi.org/10.15837/ijccc.2023.3.5045 15

Conflict of interest

The authors declare no conflict of interest.

References
[1] Camilleri, A., Delplace, S., Frankowicz, M. et al.(2014) . Professional Higher Education in Europe

Characteristics, Practice Examples and National Differences, Brussels: European Association of
Institutions in Higher Education. 2014.

[2] Grubb , W., Badway, N., Bell, D.,Kraskouskas, E.(1996). Community College Innovations in
Workforce Preparation: Curriculum Integration and Tech-Prep, Washington, DC: Office of Voca-
tional and Adult Education.1996.

[3] Gu, Y.A.(2016). Applied Undergraduate Specialty Cluster: An Important Breakthrough in the
Transformation and Development of Local Universities, China Higher Education, 22, 35–38, 2016.

[4] Zeng, X.W., Zhang, S.(2010). On the Construction of Specialty Group in Higher Vocational
Colleges — a Qualitative Discussion. Contemporary Education Science, 13, 15–18, 2010.

[5] Zeng, X.W., Yan, M.(2010). On the Construction of Specialty Group in Higher Vocational Col-
leges - based of Quantitative Analysis, Chinese Vocational and Technical Education, 18, 33–36,
2010.

[6] Zong, C. (2020). Vocational Colleges and Universities: How to Build and How to Evaluate,
Journal of Vocational Education, 7, 40–45, 2020.

[7] Zhao, M.C. (2020). On the Nature of Major Clusters Construction of Higher Vocational Colleges
and its Organizational Reform Ways on Micro Level, Research in Educational Development, 9,
63–70, 2020.

[8] Yang, Y., He, K., Wang, Y.P. et al.(2022). Identification of Dynamic Traffic Crash Risk for
Cross-area Freeways Based on Statistical and Machine Learning Methods, Physica A: Statistical
Mechanics and Its Applications, 595, 127083-,2022.

[9] Xu, W., Sun, H.Y., Awaga, A.L., Yan, Y.,Cui, Y.J. (2022). Optimization Approaches for Solving
Production Scheduling Problem: A Brief Overview and a Case Study for Hybrid Flow Shop Using
Genetic Algorithms, Advances in Production Engineering & Management, 17(1), 45–56, 2022.

[10] Sun, H.Y., Xu, W., Yu, Y.Y., Cai , G.Y.(2022). An Intelligent Mechanism for COVID-19 Emer-
gency Resource Coordination and Follow-Up Response, Computational Intelligence and Neuro-
science, 1–10.2022.

[11] Cicea, C., Lefteris, T., Marinescu, C., Popa, S, c., Albu, Fc.(2021). Applying Text Mining Tech-
nique on Innovation-Development Relationship: A Joint Research Agenda, Economic Computa-
tion And Economic Cybernetics Studies And Research, 55(1), 5–22, 2021.

[12] Sousa, Junior W.T. de, Montevechi, J.A.B., Miranda, R. de C., Rocha, F., Vilela, F.F.(2019).
Economic Lot-Size Using Machine Learning, Parallelism, Metaheuristic and Simulation, Interna-
tional Journal of Simulation Modelling, 18(2), 205–216, 2019.

[13] Teoh, C.W., Ho, S.B., Dollmat, K.S. et al. (2022). Predicting Student Performance from Video-
Based Learning System: a Case Study, Informatics and Service Science, 9(3), 64–7, 2022.

[14] Singh, P.K., Othman, E., Ahmed, R.et al.(2021). Optimized Recommendations by User Profiling
Using Apriori Algorithm, Applied Soft Computing, C, 107272, 2021.

[15] Redhu, S., Hegde, R.M. (2020). Optimal Relay Node Selection in Time-varying IoT Networks
Using Apriori Contact Pattern Information, Ad hoc networks, 98(Mar.):102065.1-102065.9.2020.


https://doi.org/10.15837/ijccc.2023.3.5045 16

[16] Yang, Y., Wang, K., Yuan, Z., Liu, D. (2022). Predicting Freeway Traffic Crash Severity Us-
ing XGBoost-Bayesian Network Model with Consideration of Features Interaction, Journal of
Advanced Transportation, 4257865.2022.

[17] Karimtabar, N., Fard, M.J.S.(2022). Finding Frequent Items: Novel Method For Improving Apri-
ori Algorithm, Computer Science-AGH, 23(2), 161–177, 2022.

[18] Pan, T. (2021). An Improved Apriori Algorithm for Association Mining Between Physical Fitness
Indices of College Students, International Journal of Emerging Technologies in Learning, 16(9):
235–246, 2021.

[19] Yang,Y., Tian,N., Wang,Y., Yuan, Z. (2022). A Parallel FP-Growth Mining Algorithm with Load
Balancing Constraints for Traffic Crash Data, International Journal of Computers Communica-
tions & Control, 17(4): 4806.2022.

[20] Yang, Y., Yuan, Z., Meng, R. (2022). Exploring Traffic Crash Occurrence Mechanism toward
Cross-Area Freeways via an Improved Data Mining Approach, Journal of Transportation Engi-
neering Part A Systems, 148(9): 04022052.2022.

[21] Yang, Y., Yuan, Z., Chen, J., Guo, M. (2017). Assessment of Osculating Value Method Based on
Entropy Weight to Transportation Energy Conservation and Emission Reduction, Environmental
Engineering & Management Journal, 16(10), 2413–2424, 2017.

[22] Narváez-Bandera, I., Suárez-Gómez, D., Isaza, C E. et al. (2022). Multiple Criteria Optimization
(MCO): A Gene Selection Deterministic Tool in RStudio, PLOS ONE, 17. 2022.

Copyright ©2023 by the authors. Licensee Agora University, Oradea, Romania.
This is an open access article distributed under the terms and conditions of the Creative Commons
Attribution-NonCommercial 4.0 International License.
Journal’s webpage: http://univagora.ro/jour/index.php/ijccc/

This journal is a member of, and subscribes to the principles of,
the Committee on Publication Ethics (COPE).

https://publicationethics.org/members/international-journal-computers-communications-and-control

Cite this paper as:

Gao, F.; Yang, J.; Yang, Y.; Yuan X.J. (2023). WOMDI-Apriori Data Mining Algorithm for
Clustered Indicators Analysis of Specialty Groups in Higher Vocational Colleges, International Journal
of Computers Communications & Control, 18(3), 5045, 2023.

https://doi.org/10.15837/ijccc.2023.3.5045


	Introduction
	Data pre-processing
	Data denoising
	Field Filtering
	Variable coding
	Sample structure design

	Algorithm design and modeling
	Association rules mining
	Weighting model
	Construction of the WOMDI-Apriori algorithm 

	Association rule mining analysis
	Parameter calibration
	Improving the accuracy improvement calibration of the algorithm

	Results and discussion
	Valuable association rule extraction
	Analysis of high support association rule extraction results
	Analysis of high-confidence association rule extraction results
	Recommendations based on association rule mining results

	Conclusions