D:\TeX_522\main.DVI Mathematical Problems of Computer Science 52, 21{29, 2019. U D C 5 1 9 .8 I nfor mation - T heor etic M ethods for Anomaly Detection Ma r ia m E . H a r o u t u n ia n a n d Tig r a n S . B a d a s ya n Institute for Informatics and Automation Problems of NAS RA e-mail: armar@ipia.sci.am; tigranbadasyan@gmail.com Abstract Maintaining the security of digital systems with a huge amount of data is one of the main concerns of IT specialists in these times. Anomaly detection in systems is one of the solutions to overcome this challenge. Anomaly detection means ¯nding patterns that are not normal or deviate from normal behavior in a system. Anomaly detection has various applications in bio-informatics, image processing, cyber security, security for databases, etc. There are many groups of methods that are used for anomaly detection including statistical methods, neural network methods and information the- oretic methods. In this paper we survey pros and cons of anomaly detection based on information theoretic techniques. Keywords: Anomaly / outlier detection, Entropy, Relative entropy, Local-search heuristic algorithm LSA. 1 . In t r o d u c t io n A n o m a ly o r o u t lie r is a n im p o r t a n t c o n c e p t o f t h e d a t a a n a lys is . A n o m a ly is d e ¯ n e d a s a d e via t io n fr o m n o r m a l d a t a . It m e a n s t h a t t h e d a t a o b je c t is n o t s im ila r t o t h e o t h e r o b s e r va t io n s in t h e d a t a s e t . It is ve r y im p o r t a n t t o d e t e c t t h e s e o b je c t s d u r in g t h e d a t a a n a lys is t o t r e a t t h e m d i®e r e n t ly fr o m t h e o t h e r d a t a . Th e a n o m a ly d e t e c t io n m e t h o d s a r e wid e ly u s e d fo r t h e fo llo win g p u r p o s e s : - Cr e d it c a r d ( a n d m o b ile p h o n e ) fr a u d d e t e c t io n ; - S u s p ic io u s W e b s it e d e t e c t io n ; - W h o le -g e n o m e D N A m a t c h in g ; - E CG-s ig n a l ¯ lt e r in g ; - S u s p ic io u s t r a n s a c t io n d e t e c t io n a n d s o o n . Th e in ve s t ig a t io n o f a n o m a ly d e t e c t io n t e c h n iqu e s is ve r y im p o r t a n t in s u c h ¯ e ld s a s - Me d ic a l a n d p u b lic h e a lt h ; - D e c is io n m a kin g ; - B u s in e s s in t e llig e n c e ; - In d u s t r ia l D a m a g e D e t e c t io n a n d o t h e r s . 2 1 2 2 Information - Theoretic Methods for Anomaly Detection Th e a n o m a ly d e t e c t io n p r o b le m h a s b e c o m e a r e c o g n iz e d r a p id ly-d e ve lo p in g t o p ic o f t h e d a t a a n a lys is . A s ig n ī c a n t n u m b e r o f m e t h o d s h a ve r e c e n t ly b e e n p r o p o s e d . Ma n y s u r ve ys a n d s t u d ie s a r e d e vo t e d t o t h is p r o b le m , s e e fo r e xa m p le [1 ] - [4 ]. Th e y c o n c e r n d i®e r e n t a s p e c t s o f a n o m a ly d e t e c t io n . [1 ] r e vie ws t r a d it io n a l o u t lie r d e t e c t io n a lg o r it h m s , [2 ] o ve r vie ws t h e e xis t in g r e s e a r c h fo r t h e p r o b le m o f d e t e c t in g a n o m a lie s in d is c r e t e s e qu e n c e s , [3 ] is fo c u s e d o n e n s e m b le le a r n in g o n e s , wh ile [4 ] o n ly in vo lve s t h e la t e s t a n d p o p u la r a n o m a ly d e t e c t io n m e t h o d s fo r t h e d a t a wit h h ig h d im e n s io n a lit y a n d m ixe d t yp e s , o n wh ic h t h e c la s s ic a l d e t e c t io n m e t h o d s c a n n o t o p e r a t e ve r y we ll. Th e r e is n o wid e ly a c c e p t a b le fo r m a l d e ¯ n it io n o f t h e a n o m a ly. Th e p r e c is e d e ¯ n it io n o f t h e o u t lie r d e p e n d s o n t h e s p e c i¯ c p r o b le m a n d it s d a t a r e p r e s e n t a t io n . A t a n a b s t r a c t le ve l, a n a n o m a ly is d e ¯ n e d a s a p a t t e r n t h a t d o e s n o t c o n fo r m t o e xp e c t e d n o r m a l b e h a vio r . A s t r a ig h t fo r wa r d a n o m a ly d e t e c t io n a p p r o a c h , t h e r e fo r e , is t o d e ¯ n e a r e g io n r e p r e s e n t in g n o r m a l b e h a vio r a n d d e c la r e a n y o b s e r va t io n in t h e d a t a t h a t d o e s n o t b e lo n g t o t h is n o r m a l r e g io n a s a n a n o m a ly. B u t s e ve r a l fa c t o r s m a ke t h is a p p r o a c h ve r y c h a lle n g in g . S p e c i¯ c fo r m u la t io n o f t h e p r o b le m is d e t e r m in e d b y s e ve r a l d i®e r e n t fa c t o r s s u c h a s t h e n a t u r e o f t h e in p u t d a t a , t h e a va ila b ilit y o f la b e ls , t h e c o n s t r a in t s a n d r e qu ir e m e n t s in d u c e d b y t h e a p p lic a t io n d o m a in . Th is ju s t ī e s t h e n e e d fo r t h e b r o a d s p e c t r u m o f a n o m a ly d e t e c - t io n t e c h n iqu e s . B a s e d o n t h e r e s e a r c h a r e a t h e m a jo r it y o f a n o m a ly d e t e c t io n t e c h n iqu e s c a n b e c a t e g o r iz e d in t o c la s s i¯ c a t io n , n e a r e s t n e ig h b o r , c lu s t e r in g , s t a t is t ic a l a n d in fo r m a t io n t h e o r e t ic a l t e c h n iqu e s . E a c h o f t h e la r g e n u m b e r o f a n o m a ly d e t e c t io n t e c h n iqu e s h a s it 's o wn u n iqu e s t r e n g t h s a n d we a kn e s s e s . It is im p o r t a n t t o kn o w wh ic h a n o m a ly d e t e c t io n t e c h n iqu e is b e s t s u it e d fo r a g ive n p r o b le m . Give n t h e c o m p le xit y o f t h e p r o b le m s p a c e , it is n o t fe a s ib le t o p r o vid e s u c h a n u n d e r s t a n d in g fo r e ve r y a n o m a ly d e t e c t io n p r o b le m . N o n e o f t h e p r e vio u s s u r ve ys , h o we ve r , fo c u s e s o n in fo r m a t io n t h e o r e t ic a l d e t e c t io n m e t h - o d s . In t h is p a p e r we s u r ve y t h e s t a t e -o f-t h e -a r t t e c h n iqu e s o f a n o m a ly d e t e c t io n b a s e d o n in fo r m a t io n t h e o r y. Th e a im is t o e xp lo r e t h e u s e o f t o o ls / p r o p o s a ls o f In fo r m a t io n Th e - o r y, wh ic h is c a p a b le o f b e in g u s e d t o s o lve t h e p r o b le m o f a n o m a ly d e t e c t io n fr o m n e w p e r s p e c t ive s . 2 . S o m e In fo r m a t io n Th e o r e t ic Me a s u r e s Entropy is a n im p o r t a n t c o n c e p t in in fo r m a t io n t h e o r y a n d c o m m u n ic a t io n t h e o r y [5 ]. It m e a s u r e s t h e u n c e r t a in t y o f a c o lle c t io n o f d a t a it e m s : H ( X ) = ¡ X x p ( x ) lo g p ( x) ; fo r a d a t a s e t X o f it e m s x wit h p r o b a b ilit ie s p ( x) . E n t r o p y s p e c i¯ e s t h e n u m b e r o f b it s r e qu ir e d t o e n c o d e a n d t r a n s m it t h e c la s s ī c a t io n o f d a t a it e m . Fo r a n o m a ly d e t e c t io n e n t r o p y c a n b e u s e d a s a m e a s u r e o f t h e r e g u la r it y o f a u d it d a t a . Th e s m a lle r t h e e n t r o p y, t h e m o r e r e g u la r is t h e d a t a s e t . H ig h r e g u la r it y d a t a c a n p r e d ic t fu t u r e e ve n t s b e c a u s e e ve n t s r e p e a t e d m a n y t im e s in t h e c u r r e n t d a t a s e t a r e like ly t o a p p e a r in t h e fu t u r e . Conditional entropy o f X g ive n Y is H ( XjY ) = ¡ X x;y p ( x; y ) lo g p( xjy ) ; M. Haroutunian and T. Badasyan 2 3 wh e r e p ( x; y ) is t h e jo in t p r o b a b ilit y o f x a n d y a n d p ( xjy ) is t h e c o n d it io n a l p r o b a b ilit y o f x g ive n y. Th is m e a s u r e c a n b e u s e d fo r a n o m a ly d e t e c t io n a s t h e va lu e o f d e p e n d e n c e r e g u la r it y. A s in c a s e o f e n t r o p y, t h e s m a lle r t h e c o n d it io n a l e n t r o p y, t h e b e t t e r . Kullback-Leibler divergence a ls o kn o wn a s relative entropy b e t we e n t wo p r o b a b ilit y d is - t r ib u t io n s o n t h e s a m e X is D ( P k Q) = X x p ( x) lo g p ( x ) q ( x ) : Fo r a n o m a ly d e t e c t io n K L d ive r g e n c e c a n b e u s e d t o m e a s u r e t h e d is t a n c e o f r e g u la r it ie s b e t we e n t h e t r a in in g d a t a s e t a n d t h e t e s t d a t a s e t . A g a in , t h e s m a lle r t h e r e la t ive e n t r o p y, t h e b e t t e r . 3 . In fo r m a t io n Th e o r y B a s e d A n o m a ly D e t e c t io n Te c h n iqu e s In fo r m a t io n t h e o r e t ic t e c h n iqu e s a n a lyz e t h e in fo r m a t io n c o n t e n t o f a d a t a s e t u s in g d i®e r e n t in fo r m a t io n t h e o r e t ic m e a s u r e s s u c h a s e n t r o p y, r e la t ive e n t r o p y, a n d s o o n . S u c h t e c h n iqu e s a r e b a s e d o n t h e fo llo win g ke y a s s u m p t io n : a n o m a lie s in d a t a in d u c e ir r e g u la r it ie s in t h e in fo r m a t io n c o n t e n t o f t h e d a t a s e t . In [6 ] t h e in fo r m a t io n -t h e o r e t ic m e a s u r e s a r e p r o p o s e d fo r a n o m a ly d e t e c t io n in t h e fo l- lo win g g e n e r a l a p p r o a c h : - Me a s u r e t h e r e g u la r it y o f t h e d a t a a n d p e r fo r m t h e a p p r o p r ia t e d a t a t r a n s fo r m a t io n . It e r a t e t h is s t e p if n e c e s s a r y s o t h a t t h e d a t a s e t u s e d fo r m o d e lin g h a s h ig h r e g u la r it y. - D e t e r m in e h o w t h e m o d e l s h o u ld b e b u ilt , i.e . h o w t o a c h ie ve t h e b e s t p e r fo r m a n c e o r t h e o p t im a l p e r fo r m a n c e / c o s t t r a d e -o ®, a c c o r d in g t o t h e r e g u la r it y m e a s u r e . - U s e r e la t ive e n t r o p y t o d e t e r m in e wh e t h e r a m o d e l is s u it a b le fo r a n e w d a t a s e t ( e .g ., fr o m a n e w e n vir o n m e n t ) . D i®e r e n t s e qu e n c e le n g t h s a r e u s e d t o s h o w t h e r e la t io n s h ip b e t we e n r e g u la r it y a n d d e - t e c t io n p e r fo r m a n c e . In p r a c t ic e , o n e c a n s im p ly c o m p u t e t h e r e g u la r it y o f a g ive n d a t a s e t a n d d e t e r m in e h o w t o b u ild a m o d e l, b e c a u s e c o m p u t in g r e g u la r it y, in g e n e r a l, is m u c h m o r e e ± c ie n t t h a n c o m p u t in g a m o d e l. Th is is o n e o f t h e a d va n t a g e s o f t h is a p p r o a c h in c o m p a r is o n wit h t h e o t h e r s , wh e r e t h e r e is n o qu id e lin e fo r b u ild in g a m o d e l a n d e xp la in in g it s p e r fo r m a n c e . H o we ve r , t h e r e la t io n s h ip b e t we e n r e g u la r it y a n d d e t e c t io n p e r fo r m a n c e wa s s h o wn fo r t h e c la s s ī e r m o d e l, wh ile t h e r e a r e o t h e r p r o b a b ilis t ic a lg o r it h m s , t h a t c a n b e u s e d fo r a n o m a ly d e t e c t io n . H o w t h e in fo r m a t io n -t h e o r e t ic m e a s u r e s c a n b e u s e d fo r t h e s e a lg o r it h m s is a n o p e n qu e s t io n . S o m e r e a l-life a p p lic a t io n s c o n t a in categorical data, h o we ve r , m o s t o f t h e o u t lie r a lg o - r it h m s a r e fo c u s e d o n n u m e r ic a l d a t a a n d d o n o t p e r fo r m we ll wh e n a p p lie d t o c a t e g o r ic a l d a t a . Th e p r o b le m o f o u t lie r d e t e c t io n in c a t e g o r ic a l d a t a is c o n s id e r e d in [7 ]. H e r e e n t r o p y is u s e d t o m e a s u r e t h e d e g r e e o f d is o r d e r o f a d a t a s e t . Th e o p t im iz a t io n p r o b le m is d e s c r ib e d a s fo llo ws : ¯ n d in g a s u b s e t o f k o b je c t s s u c h t h a t t h e e xp e c t e d e n t r o p y o f t h e r e s u lt a n t d a t a s e t a ft e r t h e r e m o va l o f t h is s u b s e t is m in im iz e d . Th e g r e e d y o p t im iz a t io n s c h e m e t h a t u s e s lo c a l s e a r c h h e u r is t ic is s t u d ie d fo r qu a lit y - t im e t r a d e o ®. A s a r e s u lt t h e Local search algorithm (LSA) wa s p r e s e n t e d , wh ic h h a s b e e n s h o wn t o b e t h e b e s t in d e t e c t in g o u t lie r s b o t h in t e r m s o f a c c u r a c y a n d s p e e d wh e n c o m p a r e d wit h o t h e r t e c h n iqu e s . W o r kin g o f L S A c a n b e d e ¯ n e d a s fo llo ws . - Fo r a d a t a s e t D S , k o u t lie r s a r e t o b e d e t e c t e d u s in g e n t r o p y, - Th e n u m b e r o f o u t lie r s t o b e g e n e r a t e d ( k ) is u s e r d e ¯ n e d . 2 4 Information - Theoretic Methods for Anomaly Detection - Th e in it ia l s e t o f o u t lie r s ( S O) is e m p t y a n d a ll t h e d a t a s e t 's r e c o r d s a r e m a r ke d a s n o n o u t lie r s . - k s c a n s a r e c a r r ie d o u t t o s e le c t k r e c o r d s a s o u t lie r s . - D u r in g e a c h s c a n , e a c h r e c o r d t a g g e d a s a n o n -o u t lie r is t e m p o r a r ily r e m o ve d fr o m t h e d a t a s e t a n d t h e c h a n g e in e n t r o p y is c a lc u la t e d . - Th e r e c o r d t h a t a c h ie ve s t h e m a xim u m d e c r e a s e in e n t r o p y b y r e m o vin g t h a t r e c o r d , is s e le c t e d a s a n o u t lie r a n d a d d e d t o S O. - Th is c o n t in u e s fo r e a c h s c a n u n t il t h e s iz e o f OS r e a c h e s t h e d e ¯ n e d va lu e o f k. A s t h e o p t im a l n u m b e r o f o u t lie r s va r ie s fr o m d a t a s e t t o d a t a s e t , it is n o t p o s s ib le t o p r e d ic t t h e n u m b e r o f o u t lie r s o r t o d e ¯ n e a s t a n d a r d o r ¯ xe d n u m b e r o f o u t lie r s t h a t c a n b e a p p lic a b le fo r e ve r y d a t a s e t . Th is is t h e d is a d va n t a g e o f L S A . In [8 ] a n o u t lie r d e t e c t io n t e c h n iqu e fo r c a t e g o r ic a l d a t a is p r o p o s e d wh ic h is b a s e d o n e n t r o p y a n d c a lle d A u t o m a t e d E n t r o p y V a lu e Fr e qu e n c y ( A E V F) . It r e qu ir e s n o u s e r in p u t a n d will a lwa ys g e n e r a t e t h e o p t im a l n u m b e r o f o u t lie r s . Th is is t h e e xt e n d e d ve r s io n o f t h e L S A , wh ic h in t r o d u c e s t h e n e w t e r m s : e n t r o p y d i®e r e n c e g a p a n d m a x e n t r o p y g a p . Th e entropy difference gap is t h e d i®e r e n c e in t h e va lu e s o f c h a n g e o f e n t r o p y b e t we e n o n e r e c o r d a n d t h e n e xt . Th e max entropy gap is t h e m a xim u m e n t r o p y d i®e r e n c e g a p t h a t c a n e xis t a s d e ¯ n e d b y t h e u s e r . If t h e e n t r o p y d i®e r e n c e g a p b e c o m e s la r g e r t h a n t h e m a x e n t r o p y g a p , t h e a lg o r it h m t e r m in a t e s . Th e a lg o r it h m is a t wo -s t e p p r o c e s s : - Ge n e r a t in g e n t r o p y c h a n g e va lu e s . Th e c h a n g e in e n t r o p y va lu e fo r e a c h r e c o r d in a d a t a s e t is g e n e r a t e d a n d s t o r e d in a t a b le . On c e a ll r e c o r d s h a ve b e e n p r o c e s s e d , t h e t a b le is u p d a t e d s o t h a t t h e r e c o r d wit h t h e m a xim u m e n t r o p y c h a n g e va lu e is a t t h e t o p , fo llo we d b y t h e o t h e r r e c o r d s in d e s c e n d in g o r d e r o f e n t r o p y c h a n g e va lu e s . - Ge n e r a t in g o u t lie r s . Th e e n t r o p y d i®e r e n c e g a p is d e t e r m in e d a n d t h e n c o m p a r e d wit h t h e m a x e n t r o p y g a p fo r e a c h va lu e fr o m t h e t o p o f t h e t a b le d o wn wa r d s . If t h e e n t r o p y d i®e r e n c e g a p is le s s t h a n o r e qu a l t o t h e m a x e n t r o p y g a p t h e n t h e a lg o r it h m c o n t in u e s c a r r yin g o u t c o m p a r is o n s d o wn t h e t a b le , o t h e r wis e t h e a lg o r it h m t e r m in a t e s a n d a ll t h e r e c o r d s u p t o t h a t p o in t a r e a d d e d t o OS , wh ic h is t h e n d is p la ye d a s t h e o u t p u t . Th e d is a d va n t a g e o f t h is a p p r o a c h is t h a t it is n o t p o s s ib le t o d e ¯ n e a s t a n d a r d m a x e n t r o p y g a p wh ic h c o u ld b e a p p lie d t o e ve r y d a t a s e t . In [8 ] a n a u t o m a t e d a lg o r it h m is p r o p o s e d b y c o m b in in g t h e t wo a b o ve a lg o r it h m s , wh ic h r e qu ir e s o n ly t h e d a t a s e t a s in p u t a n d d o e s n o t r e qu ir e e it h e r t h e n u m b e r o f o u t lie r s o r t h e m a x e n t r o p y g a p a s u s e r in p u t . A E V F a lg o r it h m is : - Fin d in g t h e o p t im a l m a x e n t r o p y g a p . Th e c h a n g e in e n t r o p y va lu e fo r e a c h r e c o r d in t h e d a t a s e t is g e n e r a t e d a n d s t o r e d in a t a b le . Th e a ve r a g e e n t r o p y c h a n g e is d e t e r m in e d , wh ic h is s e t a s t h e m a x e n t r o p y g a p . On c e a ll r e c o r d s h a ve b e e n p r o c e s s e d , t h e t a b le is u p d a t e d t o s h o w t h e r e c o r d wit h t h e m a xim u m e n t r o p y c h a n g e a t t h e t o p o f t h e t a b le , fo llo we d b y d e s c e n d in g va lu e s o f e n t r o p y c h a n g e . - Fin d in g t h e o p t im a l n u m b e r o f o u t lie r s . Th e e n t r o p y d i®e r e n c e g a p is d e t e r m in e d a n d t h e n c o m p a r e d wit h t h e m a x e n t r o p y g a p . If t h e e n t r o p y d i®e r e n c e g a p is le s s t h a n o r e qu a l t o t h e m a x e n t r o p y g a p t h e n t h e a lg o r it h m c o n t in u e s wo r kin g d o wn t h e t a b le c a r r yin g o u t t h e c o m p a r is o n . Ot h e r wis e t h e a lg o r it h m t e r m in a t e s a n d a ll t h e r e c o r d s u p t o t h a t p o in t a r e a d d e d t o OS , wh ic h is t h e n d is p la ye d a s t h e o u t p u t . A n o ve l, p a r a m e t e r -fr e e m e t h o d , COMPREX, fo r id e n t ifyin g a n o m a lie s u s in g p a t t e r n - b a s e d compression is in t r o d u c e d in [9 ]. Co m p r e s s io n -b a s e d t e c h n iqu e s h a ve b e e n e xp lo r e d m o s t ly in c o m m u n ic a t io n s t h e o r y, fo r r e d u c e d t r a n s m is s io n c o s t a n d in c r e a s e d t h r o u g h p u t M. Haroutunian and T. Badasyan 2 5 a n d in d a t a b a s e s , fo r r e d u c e d s t o r a g e c o s t a n d in c r e a s e d qu e r y p e r fo r m a n c e . In t h e m e n - t io n e d p a p e r t h e p r o p o s e d m e t h o d ¯ n d s a c o lle c t io n o f d ic t io n a r ie s t h a t d e s c r ib e t h e n o r m o f a d a t a b a s e s u c c in c t ly, a n d s u b s e qu e n t ly ° a g s t h o s e p o in t s d is s im ila r t o t h e n o r m wit h a h ig h c o m p r e s s io n c o s t a s a n o m a lie s . Th is a p p r o a c h e xh ib it s fo u r ke y fe a t u r e s : 1 ) it is p a r a m e t e r fr e e ; it b u ild s d ic t io n a r ie s d ir e c t ly fr o m d a t a , a n d r e qu ir e s n o u s e r - s p e c i¯ e d p a r a m e t e r s s u c h a s d is t a n c e fu n c t io n s o r d e n s it y a n d s im ila r it y t h r e s h o ld s , 2 ) it is g e n e r a l; it wo r ks fo r a b r o a d r a n g e o f c o m p le x d a t a b a s e s , in c lu d in g g r a p h , im a g e a n d r e la t io n a l d a t a b a s e s t h a t m a y c o n t a in b o t h c a t e g o r ic a l a n d n u m e r ic a l fe a t u r e s , 3 ) it is s c a la b le ; it s r u n n in g t im e g r o ws lin e a r ly wit h r e s p e c t t o b o t h d a t a b a s e s iz e a s we ll a s n u m b e r o f d im e n s io n s , a n d 4 ) it is e ®e c t ive ; e xp e r im e n t s o n a b r o a d r a n g e o f d a t a s e t s s h o w la r g e im p r o ve m e n t s in b o t h c o m p r e s s io n , a s we ll a s p r e c is io n in a n o m a ly d e t e c t io n , o u t p e r fo r m in g it s s t a t e -o f-t h e - a r t c o m p e t it o r s . Te c h n iqu e s fo c u s e d o n a n o m a ly d e t e c t io n in graph-based data h a ve r e c e n t ly b e e n t h e s u b - je c t o f a t t e n t io n . A g e n e r a l, c o m p r e h e n s ive s u r ve y o f s t a t e -o f-t h e -a r t m e t h o d s fo r a n o m a ly d e t e c t io n in d a t a r e p r e s e n t e d a s g r a p h s is p r o vid e d in [1 0 ]. H e r e we fo c u s o n ly o n in fo r m a t io n -t h e o r e t ic a p p r o a c h e s . In [1 1 ] t h e a u t h o r s in t r o d u c e d t wo t e c h n iqu e s fo r g r a p h -b a s e d a n o m a ly d e t e c t io n . Th e ¯ r s t , anomalous substructure de- tection, lo o ks fo r s p e c i¯ c , u n u s u a l s u b s t r u c t u r e s wit h in a g r a p h . In t h e s e c o n d m e t h o d , anomalous subgraph detection, t h e g r a p h is p a r t it io n e d in t o d is t in c t s e t s o f ve r t ic e s ( s u b - g r a p h s ) , e a c h o f wh ic h is t e s t e d a g a in s t t h e o t h e r s fo r u n u s u a l p a t t e r n s . U s in g t h e c o n c e p t o f c o n d it io n a l e n t r o p y a m e a s u r e o f g r a p h r e g u la r it y c a lle d conditional substructure entropy h a s b e e n in t r o d u c e d t o d e ¯ n e t h e n u m b e r o f b it s n e e d e d t o d e s c r ib e a n a r b it r a r y s u b s t r u c - t u r e s s u r r o u n d in g s . A s u b s t r u c t u r e is a c o n n e c t e d s u b g r a p h o f t h e o ve r a ll g r a p h . B y s u r - r o u n d in g s , a u t h o r s a r e r e fe r r in g t o t h e e d g e s a n d ve r t ic e s a d ja c e n t t o t h e s u b s t r u c t u r e . Th e s u r r o u n d in g s c a n b e t h o u g h t o f a s a s e t o f e xt e n s io n s t o t h e s u b s t r u c t u r e ; a n e xt e n s io n o f a s u b s t r u c t u r e is d e ¯ n e d t o b e t h e a d d it io n o f e it h e r a s in g le ve r t e x ( a lo n g wit h t h e e d g e c o n n e c t in g it t o t h e s u b s t r u c t u r e ) , o r a s in g le e d g e wit h in t h e s u b s t r u c t u r e . L e t Y b e d e ¯ n e d t o c o n t a in a ll n-ve r t e x s u b s t r u c t u r e s wit h in t h e g r a p h , a n d X c o n t a in a ll e xt e n s io n s o f t h e s u b s t r u c t u r e s in Y . Fo r a g ive n s u b s t r u c t u r e y 2 Y; P ( y ) is d e ¯ n e d a s t h e n u m b e r o f in s t a n c e s o f y in G, d ivid e d b y t h e t o t a l n u m b e r o f in s t a n c e s o f a ll n-ve r t e x s u b s t r u c t u r e s . Fo r p a r t ic u la r s u b s t r u c t u r e s x 2 X; y 2 Y , P ( xjy ) is d e ¯ n e d t o r e p r e s e n t t h e p e r c e n t a g e o f in s t a n c e s o f y t h a t e xt e n d t o a n in s t a n c e o f x. B y a n a lo g y o f t h e c o n d it io n a l e n t r o p y fo r s t r in g d a t a , t h e conditional substructure entropy is d e ¯ n e d a s H ( XjY ) = ¡ X x;y p( y ) [p ( xjy ) lo g p ( xjy ) + ( 1 ¡ p ( xjy ) ) lo g ( 1 ¡ p ( xjy ) ) ]: E n t r o p y a n d K L d ive r g e n c e m e t h o d s h a ve b e e n r e g a r d e d a s e ®e c t ive m e t h o d s fo r d e t e c t - in g abnormal tra± c b a s e d o n IP a d d r e s s -d is t r ib u t io n s t a t is t ic s o r p a c ke t s iz e -d is t r ib u t io n s t a t is t ic s [1 2 ] - [1 4 ]. In [1 5 ] t wo n e w a n d e ®e c t ive a n o m a ly-b a s e d d e t e c t io n m e t r ic s ( g e n e r iliz e d e n t r o p y a n d in fo r m a t io n d is t a n c e ) we r e p r o p o s e d , wh ic h id e n t ify D D o S a t t a c ks e a r ly a n d a c c u r a t e ly. A n e ®e c t ive IP t r a c e b a c k s c h e m e wa s p r o p o s e d b a s e d o n a n in fo r m a t io n d is t a n c e m e t r ic t h a t c a n t r a c e a ll a t t a c ks b a c k t o t h e ir o wn lo c a l a r e a n e t wo r ks ( L A N s ) in a s h o r t t im e . Th e Renyi or generalized entropy o f o r d e r ® is d e ¯ n e d a s fo llo ws : H® ( x) = 1 1 ¡ ® lo g 2 ( X i p®i ) ; ® ¸ 0 ; ® 6= 1 : 2 6 Information - Theoretic Methods for Anomaly Detection Th e Renyi divergence b e t we e n p r o b a b ilit y d is t r ib u t io n s P a n d Q is : D® ( P jjQ ) = 1 ® ¡ 1 lo g 2 ( X i p®i q 1¡® i ) ; ® ¸ 0 : Th e R e n yi d ive r g e n c e , a s t h e g e n e r a liz e d d ive r g e n c e , c a n d e d u c e m a n y c o n c r e t e d ive r - g e n c e fo r m s a c c o r d in g t o d i®e r e n t va lu e s o f o r d e r ®. Th e information distance is t h e s ym m e t r ic iz e d m e t r ic D® ( P; Q) = D® ( P jjQ ) + D® ( QjjP ) : It is s ig n ī c a n t t h a t t h is m e t r ic s c a n im p r o ve t h e s ys t e m s ' d e t e c t io n s e n s it ivit y b y a d - ju s t in g t h e va lu e o f o r d e r ® o f t h e g e n e r a liz e d e n t r o p y a n d in fo r m a t io n d is t a n c e m e t r ic s . A s t h e p r o p o s e d m e t r ic s c a n in c r e a s e t h e in fo r m a t io n d is t a n c e b e t we e n t h e a t t a c k t r a ± c a n d t h e le g it im a t e t r a ± c , t h e y c a n e ®e c t ive ly d e t e c t lo w-r a t e D D o S a t t a c ks e a r ly a n d r e d u c e t h e fa ls e p o s it ive r a t e c le a r ly. A s a r e s u lt [1 5 ] t h e p r o p o s e d in fo r m a t io n m e t r ic s im p r o ve t h e p e r fo r m a n c e o f lo w-r a t e D D o S a t t a c ks d e t e c t io n a n d IP t r a c e b a c k o ve r t h e t r a d it io n a l a p p r o a c h e s . A s in [9 ], u n ive r s a l s o u r c e c o d in g h a s b e e n u s e d fo r a n o m a ly d e t e c t io n in va r io u s p a p e r s m o s t ly b a s e d o n c o m p a r in g c o d e le n g t h , s e e fo r e xa m p le , [1 6 ] - [2 0 ]. Th e c o m p a r is o n is d o n e b y a m e a s u r e o f e n t r o p y r a t e . Th e p r o b le m is t h a t t h e r e a r e m a n y d is s im ila r s o u r c e s t h a t h a ve t h e s a m e e n t r o p y r a t e . To o ve r c o m e t h is is s u e a n e w a p p r o a c h b a s e d o n t h e n o t io n o f atypicality is s u g g e s t e d in [2 1 ]. Mo s t o f t h e va lu e in t h e in fo r m a t io n in s o m e a p p lic a t io n s is in t h e p a r t s t h a t d e via t e fr o m t h e a ve r a g e , t h a t a r e u n u s u a l, a t yp ic a l. A t yp ic a lit y is d e ¯ n e d a s d a t a t h a t c a n b e e n c o d e d wit h fe we r b it s in it s e lf r a t h e r t h a n u s in g t h e c o d e fo r t h e t yp ic a l d a t a . In [2 1 ] t h e a u t h o r s s h o w t h a t t h is d e ¯ n it io n h a s g o o d t h e o r e t ic a l p r o p e r t ie s a n d d e ve lo p a n im p le m e n t a t io n b a s e d o n u n ive r s a l s o u r c e c o d in g . Th is id e a o f ¯ n d in g a lt e r n a t ive e xp la n a t io n s fo r d a t a r a t h e r t h a n m e a s u r in g s o m e kin d o f d i®e r e n c e fr o m t yp ic a l d a t a is wh a t s e p a r a t e s t h is m e t h o d fr o m u s u a l a p p r o a c h e s in o u t lie r d e t e c t io n a n d a n o m a ly d e t e c t io n . A t yp ic a lit y is p u r e ly a p r o p e r t y o f d a t a a n d t h e r e fo r e t h e r e a r e n o m is s e s o r fa ls e a la r m s : d a t a is a t yp ic a l o r n o t . 4 . Co n c lu s io n Th e p r e s e n t wo r k s t u d ie d m a in in fo r m a t io n t h e o r e t ic a p p r o a c h e s fo r a n o m a ly d e t e c t io n a p - p lic a t io n s . In fo r m a t io n t h e o r y p r o vid e s a u n ive r s a l a p p r o a c h in s t e a d o f lo o kin g a t s p e c i¯ c s t a t is t ic s o f d a t a . Th e a d va n t a g e s o f in fo r m a t io n t h e o r e t ic t e c h n iqu e s a r e a s fo llo ws : ( 1 ) Th e y c a n o p e r a t e in a n u n s u p e r vis e d s e t t in g . ( 2 ) Th e y d o n o t m a ke a n y a s s u m p t io n s a b o u t t h e u n d e r lyin g s t a t is t ic a l d is t r ib u t io n fo r t h e d a t a . References [1 ] V .Ch a n d o la , A .B a n e r je e a n d V . K u m a r , \ A n o m a ly d e t e c t io n : a s u r ve y" , ACM Com- puting Surveys, vo l. 4 1 , n o . 3 , p p . 1 5 8 , 2 0 0 9 . [2 ] V .Ch a n d o la , A .B a n e r je e a n d V .K u m a r , \ A n o m a ly d e t e c t io n fo r d is c r e t e s e qu e n c e s : a s u r ve y" , IE E E Trans. K nowl. D ata E ng., vo l. 2 4 , n o . 5 , p p . 8 2 3 - 8 3 9 , 2 0 1 2 . M. Haroutunian and T. Badasyan 2 7 [3 ] C. C. A g g a r wa l a n d S .S a t h e , Ou t lie r E n s e m b le s . A n in t r o d u c t io n | , S p r in g e r , 2 0 1 7 . [4 ] X . X u , H .L iu a n d M.Y a o , R ecent P rogress of Anomaly D etection, H in d a wi, Co m p le xit y, 2 0 1 9 . [5 ] T. M. Co ve r a n d J. A . Th o m a s , E lements of information theory, 2 n d e d it io n , A W iley- Interscience P ublication, 2 0 0 6 . [6 ] W . L e e a n d D .X ia n g , \ In fo r m a t io n -t h e o r e t ic m e a s u r e s fo r a n o m a ly d e t e c t io n ', In P ro- ceedings of the IE E E Symposium on Security and P rivacy, IE E E Co m p u t e r S o c ie t y,p p . 1 3 0 - 1 4 3 , 2 0 0 1 . [7 ] Z.H e , S .D e n g , X u X ., \ A n o p t im iz a t io n m o d e l fo r o u t lie r d e t e c t io n in c a t e g o r ic a l d a t a " , Advances in Intelligent Computing., L e c t u r e N o t e s in Co m p u t e r S c ie n c e , vo l 3 6 4 4 , S p r in g e r , p p . 4 0 0 - 4 0 9 , 2 0 0 5 . [8 ] U . Qa m a r , \ A u t o m a t e d e n t r o p y va lu e fr e qu e n c y ( A E V F) a lg o r it h m fo r o u t lie r d e t e c t io n in c a t e g o r ic a l d a t a " , 12th W SE AS Intern. Conf. on Arti¯cial Intelligence, K nowledge E ngineering and D ata B ases, Ca m b r id g e , U K , 2 0 1 3 . [9 ] L . A ko g lu , H . To n g , J. V r e e ke n a n d C. Fa lo u t s o s , \ Fa s t a n d r e lia b le a n o m a ly d e t e c t io n in c a t e g o r ic a l d a t a , P roc. 21st ACM Int. Conf. Inf. K nowl. M anage., p p . 4 1 5 4 2 4 , 2 0 1 2 . [1 0 ] L . A ko g lu , H . To n g a n d D . K o u t r a , \ Gr a p h b a s e d a n o m a ly d e t e c t io n a n d d e s c r ip t io n : a s u r ve y" , D ata M in K nowl D isc, vo l. 2 9 , p p . 6 2 6 - 6 8 8 , 2 0 1 5 . [1 1 ] C. C. N o b le a n d D . J. Co o k, \ Gr a p h -b a s e d a n o m a ly d e t e c t io n " , P roc. of the 9th ACM - SIGK D D International Conference on K nowledge D iscovery and D ata M ining, A CM P r e s s , p p . 6 3 1 6 3 6 , 2 0 0 3 . [1 2 ] Y . Gu , A . Mc Ca llu m a n d D .To ws le y, \ D e t e c t in g a n o m a lie s in n e t wo r k t r a ± c u s in g m a xim u m e n t r o p y e s t im a t io n " , proc. ACM SIGCOM M Conf. Internet M easurement, p p . 3 4 5 -3 5 0 , 2 0 0 5 . [1 3 ] G. N yc h is , V . S e ka r , D . G. A n d e r s o n , H . K im , H . Zh a n g , \ A n e m p ir ic a l e va lu a t io n o f e n t r o p y-b a s e d t r a ± c a n o m a ly d e t e c t io n " , P roc. of the 8th ACM SIGCOM M Conference on Internet M easurement, Gr e e c e , 2 0 0 8 . [1 4 ] A . W a g n e r a n d B . P la t t n e r , \ E n t r o p y b a s e d wo r m a n d a n o m a ly d e t e c t io n in fa s t IP n e t wo r ks " , P roc. of the W orkshop on E nabling Technologies: Infrastructure for Collab- orative E nterprises, W E T ICE , 2 0 0 5 . [1 5 ] Y . X ia n g , K . L i a n d W . Zh o u , \ L o w-r a t e D D o S a t t a c ks d e t e c t io n a n d t r a c e b a c k b y u s in g n e w in fo r m a t io n m e t r ic s " , IE E E Trans. on information, forensics and security, vo l. 6 , n o . 2 , p p . 4 2 6 - 4 3 7 , 2 0 1 1 . [1 6 ] F. P a n a n d W . W a n g , \ A n o m a ly d e t e c t io n b a s e d -o n t h e r e g u la r it y o f n o r m a l b e h a vio r s , P roc. 1st Int. Symp. Syst. Control Aerosp. Astronaut., p p . 1 0 4 6 -1 1 0 4 6 -6 , 2 0 0 6 . [1 7 ] E . E . E ila n d a n d L . M. L ie b r o c k, \ A n a p p lic a t io n o f in fo r m a t io n t h e o r y t o in t r u s io n d e t e c t io n , P roc. F ourth IE E E Int. W orkshop Inf. Assurance, 2 0 0 6 . [1 8 ] C.-K . H a n a n d H .-K . Ch o i, \ E ®e c t ive d is c o ve r y o f a t t a c ks u s in g e n t r o p y o f p a c ke t d yn a m ic s , IE E E Netw., vo l. 2 3 , n o . 5 , p p . 4 1 2 , 2 0 0 9 . [1 9 ] N . W a n g , J. H a n a n d J.Fa n g , \ A n a n o m a ly d e t e c t io n a lg o r it h m b a s e d o n lo s s le s s c o m - p r e s s io n , P roc. IE E E 7th Int. Conf. Netw. Archit. Storage, p p . 3 1 3 8 , 2 0 1 2 . [2 0 ] H . S h a h r ia r a n d M. Zu lke r n in e , \ In fo r m a t io n -t h e o r e t ic d e t e c t io n o f S QL in je c t io n a t - t a c ks , P roc. IE E E 14th Int. Symp. High-Assurance Syst. E ng., p p . 4 0 4 7 , 2 0 1 2 . 2 8 Information - Theoretic Methods for Anomaly Detection [2 1 ] A . H o s t -Ma d s e n , E . S a b e t i a n d C. W a lt o n , \ D a t a d is c o ve r y a n d a n o m a ly d e t e c t io n u s in g a t yp ic a lit y: t h e o r y" , IE E E Trans. on Inform. Theory, vo l. 6 5 , n o . 9 , p p . 5 3 0 2 - 5 3 2 2 , 2 0 1 9 . Submitted 10.06.2019, accepted 18.10.2019. ²ÝáÙ³ÉdzݻñÇ Ñ³Ûïݳµ»ñÙ³Ý ÇÝýáñÙ³óÇáÝ - ï»ë³Ï³Ý Ù»Ãá¹Ý»ñ سñdz٠º. гñáõÃÛáõÝÛ³Ý ¨ îÇ·ñ³Ý ê. ´³¹³ëÛ³Ý ÐÐ ¶²² ÆÝýáñÙ³ïÇϳÛÇ ¨ ³íïáÙ³ï³óÙ³Ý åñáµÉ»ÙÝ»ñÇ ÇÝëïÇïáõï e-mail: armar@sci.am, tigranbadasyan@gmail.com ²Ù÷á÷áõÙ ²ÝáÙ³ÉdzݻñÇ Ñ³Ûïݳµ»ñáõÙÁ Ý߳ݳÏáõÙ ¿ ѳ½í³·Ûáõï ïíÛ³ÉÝ»ñÇ ÝáõÛݳϳݳ- óáõÙ, áñáÝù ÝáñÙ³É ã»Ý ϳ٠߻ÕíáõÙ »Ý ѳٳϳñ·áõÙ ÝáñÙ³É å³Ñí³ÍùÇó: ²ÝáÙ³ÉdzÛÇ Ñ³Ûïݳµ»ñáõÙÁ áõÝÇ ï³ñµ»ñ ÏÇñ³éáõÃÛáõÝÝ»ñ Ï»Ýë³ÇÝýáñÙ³ïÇϳÛÇ, å³ïÏ»ñÝ»ñÇ Ùß³ÏÙ³Ý, Ïǵ»ñ³Ýíï³Ý·áõÃÛ³Ý, ïíÛ³ÉÝ»ñÇ µ³½³ÛÇ ³Ýíï³Ý·áõÃÛ³Ý ³å³ÑáíÙ³Ý ¨ ³ÛÉ áÉáñïÝ»ñáõÙ: Î³Ý µ³½Ù³ÃÇí Ù»Ãá¹Ý»ñÇ ËÙµ»ñ, áñáÝù û·ï³·áñÍíáõÙ »Ý ³ÝáÙ³ÉdzݻñÇ Ñ³Ûïݳµ»ñÙ³Ý Ñ³Ù³ñ, Ý»ñ³éÛ³É íÇ׳ϳ·ñ³Ï³Ý Ù»Ãá¹Ý»ñÁ, Ý»ÛñáݳÛÇÝ ó³ÝóÇ Ù»Ãá¹Ý»ñÁ ¨ ÇÝýáñÙ³ódzÛÇ ï»ëáõÃÛ³Ý Ù»Ãá¹Ý»ñÁ: ²Ûë Ñá¹í³ÍáõÙ ùÝݳñÏíáõÙ »Ý ÑÇÙÝ³Ï³Ý ÇÝýáñÙ³óÇáÝ-ï»ë³Ï³Ý Ùáï»óáõÙÝ»ñÝ ³ÝáÙ³ÉdzݻñÇ Ñ³Ûïݳµ»ñÙ³Ý ÏÇñ³éáõÃÛáõÝÝ»ñÇ Ñ³Ù³ñ: ÆÝýáñÙ³ódzÛÇ ï»ëáõÃÛáõÝÁ ï³ÉÇë ¿ ѳÙÁݹѳÝáõñ Ùáï»óáõÙ` ïíÛ³ÉÝ»ñÇ íÇ׳ϳ·ñáõÃÛáõÝÁ í»ñÉáõÍ»Éáõ ÷á˳ñ»Ý: ´³Ý³ÉÇ µ³é»ñ` ³ÝáÙ³ÉdzÛÇ Ñ³Ûïݳµ»ñáõÙ, ¿Ýïñáådz, ѳñ³µ»ñ³Ï³Ý ¿Ýïñáådz, ï»Õ³ÛÇÝ áñáÝÙ³Ý ¿íñÇëïÇÏ ³É·áñÇÃÙ LSA: Èíôîðìàöèîííî-òåîðåòè÷åñêèå ìåòîäû äëÿ îáíàðóæåíèÿ àíîìàëèé Ìàðèàì Å. Àðóòþíÿí è Òèãðàí Ñ. Áàäàñÿ Èíñòèòóò ïðîáëåì èíôîðìàòèêè è àâòîìàòèçàöèè ÍÀÍ ÐÀ e-mail: armar@sci.am, tigranbadasyan@gmail.com Àííîòàöèÿ Îáíàðóæåíèå àíîìàëèé îçíà÷àåò îïîçíàâàíèå ðåäêèõ äàííûõ, êîòîðûå íå ÿâëÿþòñÿ íîðìàëüíûìè èëè îòêëîíÿþòñÿ îò íîðìàëüíîãî ïîâåäåíèÿ â ñèñòåìå. Îáíàðóæåíèå àíîìàëèé èìååò ðàçëè÷íûå ïðèìåíåíèÿ â áèîèíôîðìàòèêå, M. Haroutunian and T. Badasyan 2 9 îáðàáîòêå èçîáðàæåíèé, êèáåðáåçîïàñíîñòè, áåçîïàñíîñòè äëÿ áàç äàííûõ è ò. ä. Ñóùåñòâóåò ìíîãî ãðóïï ìåòîäîâ, êîòîðûå èñïîëüçóþòñÿ äëÿ îáíàðóæåíèÿ àíîìàëèé, âêëþ÷àÿ ñòàòèñòè÷åñêèå ìåòîäû, ìåòîäû íåéðîííûõ ñåòåé è òåîðåòèêî-èíôîðìàöèîííûå ìåòîäû.  íàñòîÿùåé ðàáîòå ðàññìàòðèâàþòñÿ îñíîâíûå òåîðåòèêî-èíôîðìàöèîííûå ïîäõîäû äëÿ ïðèëîæåíèé îáíàðóæåíèÿ àíîìàëèé. Òåîðèÿ èíôîðìàöèè ïðåäîñòàâëÿåò óíèâåðñàëüíûé ïîäõîä âìåñòî òîãî, ÷òîáû ñìîòðåòü íà êîíêðåòíûå ñòàòèñòè÷åñêèå äàííûå. Êëþ÷åâûå ñëîâà: Îáíàðóæåíèå àíîìàëèé, ýíòðîïèÿ, îòíîñèòåëüíàÿ ýíòðîïèÿ, ýâðèñòè÷åñêèé àëãîðèòì ëîêàëüíîãî ïîèñêà LSA.