A Self-Organizing Maps Model for Outlier Detection in Call Data from Mobile Telecommunication Networks

A Self-Organizing Maps Model for Outlier Detection in Call Data from Mobile Telecommunication Networks Olusola A. Abidogun and Christian W. Omlin Intelligent Systems Research Group Department of Computer
of 5
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
A Self-Organizing Maps Model for Outlier Detection in Call Data from Mobile Telecommunication Networks Olusola A. Abidogun and Christian W. Omlin Intelligent Systems Research Group Department of Computer Science University of the Western Cape Private Bag X17, Bellville 7535 Republic of South Africa {oabidogun, Phone: Fax: {oabidogun, comlin} Abstract The growth in the use of mobile telecommunications has resulted in huge amounts of data being produced and collected. Insight into information and knowledge derived from these databases can help operators provide better, customized services for a competitive edge in terms of customer care and retention, marketing and fraud detection. In this paper, we present a Self-Organizing Map (SOM) model for outlier detection in call data from users, over a period of time, in a mobile telecommunication network. Our model forms a topographic analysis of the user call patterns from which outlier call patterns can be isolated within the mobile telecommunication network. The outlier call patterns can later be interpreted and labeled according to specific requirements of the mobile service provider. Thus, suspicious call behaviours are isolated within the mobile telecommunication network, in order to identify abnormal call patterns of users. We give results using masked call data from a real mobile telecommunication network. Network Management/OSS Category: Subscriber Management I. INTRODUCTION A Call Detail Record (CDR) is created for every completed call in a mobile telecommunication network. These data records are referred to as Toll Tickets. Toll Tickets contain a wealth of information about the call made by the subscriber. Some of the salient information includes subscriber number (MSISDN - Mobile Station International Integrated Services Digital Network), date and time of call, the duration, call charge, the originating area, and a number of other fields. Besides their billing role, Toll Tickets constitute an enormous database within which other useful knowledge about the caller can be extracted. One example is detection of outliers (anomalous use) in the mobile telecommunication network. Outliers can be described as typical infrequent observations; call data patterns which do not appear to follow the characteristic distribution of the rest of the data. Consequently, outliers have to be classified as such, but only in relation to an emerging temporal pattern. Though various other CDR services exist on which outliers can be detected, such as General Packet Radio Service (GPRS) and Multimedia Message Service (MMS), this study is based on Global System for Mobile communications (GSM) CDR. Over a period of time, an individual handset s Subscriber Identity Module (SIM) card generates a large pattern of use. The pattern of use could include international calls and time-varying call patterns among others. Anomalous use can be detected within the overall pattern; like subscribers abuse of free call services such as emergency services. Anomalous use can be identified as belonging to one of two types [1]: 1) The pattern is intrinsically fraudulent; it will almost never occur in normal use. This type is relatively easy to detect. 2) The pattern is anomalous only relative to the historical pattern established for that phone. In order to isolate outliers of the second type (2), it is necessary to have knowledge of the history of SIM card usage. Hence, a descriptive analysis of the call data for each subscriber can be used for knowledge extraction. Interpretation by way of clustering or grouping similar patterns can help in isolating suspicious call behaviour within the mobile telecommunication network. This can also help fraud analysts in their further investigation and call pattern analysis of subscribers. While call data are recorded for subscribers for billing purposes, it is interesting to know that no priori assumptions are made about the data indicative of fraudulent call patterns. In other words, the calls made for billing purposes are unlabeled. Further analysis is thus required in order to isolate fraudulent usage. Because of the huge call volumes it is impossible to analyse without sophisticated techniques and tools. Hence there is need for techniques and tools to intelligently assist humans in analyzing volumes of call data for subscribers in order to facilitate the fraud detection process. One such technique is unsupervised learning: there are many unsupervised learning algorithms suitable for our kind of problem. For further reading we refer to [2] and [3]. In this paper, we present the application of the most popular neural network model for unsupervised learning called Self-Organizing Maps (SOM), originated by Kohonen [4]. It has been successfully applied in clustering and visualization of high dimensional data. A similar work in outlier detection in mobile telecommunication was reported by [5]. In that work, the authors reported on an outlier detection engine which detects outliers in an online process through the online unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. The model was applied to network intrusion detection. In a follow-up work to [5], [6] described a new framework for outlier filtering using both supervised and unsupervised learning techniques iteratively in order to make the detection process more effective and more understandable. Both works differ from our work which applies a SOM model to outlier detection. In a related work, [7] used brute force techniques to find outliers by studying the behaviour of projections from the data set. Other relevant outlier detection models were reported by [8], [9] and [10]. For an overview on outliers, see [11]. In Section 2, we briefly review the standard Self-Organizing Map algorithm. In Section 3, we describe the mechanism for learning Self-Organizing Maps with our normalized call data. In Section 4, we present experimental results on real data. Section 5, discusses our investigation and observations while in Section 6 we conclude and present possible further applications. II. SELF-ORGANIZING MAPS MODEL The Self-Organizing Map (SOM), developed by Kohonen [4], is a neural network model for the analysis and visualization of high dimensional data. It projects the nonlinear statistical relationships between high-dimensional data into simple topological relationships on a regular, typically two-dimensional grid of nodes. The SOM thereby compacts information while preserving the most important topological and/or metric relationships of the primary data elements on the two-dimensional plane. SOMs have been successfully applied in the development of adaptive devices for various telecommunications [12]. See [13] and [14] for a bibliography on published papers. The basic SOM model is a set of prototype vectors, with a defined neighbourhood relation. This neighbourhood relation defines a structured lattice, which may be rectangular or hexagonal lattice of map units. The SOM is formed through an unsupervised, competitive learning process. This process is initiated when a winner unit is searched, which minimizes the Euclidean distance measure between data samples x and the map units m i. This unit is described as the best-matching node, signified by the subscript c: x m c = min i { x m i }, or c = argmin i { x m i }. Then, the map units are updated in the topological neighborhood of the winner unit which is defined in terms of the lattice structure. The update step can be performed by applying (1) m i (t+1) = m i (t) + h ci (t)[x(t)-m i (t)] (2) where t is an integer, the discrete-time coordinate, h ci (t) is the so-called neighbourhood kernel; it is the function defined over the lattice points. The average width and form of h ci defines the stiffness of the elastic surface to be fitted to the data points. In addition, the last term in the square brackets is proportional to the gradient of the squared Euclidean distance d(x,m i ) = x m i 2. The learning rate α(t) [0, 1] must be a decreasing function of time and the neighborhood function h c (t, i) is a non-increasing function around the winner unit defined in the topological lattice of map units. A good candidate is a Gaussian around the winner unit defined in terms of the coordinates r in the lattice of neurons h c (t, i) = exp( ri r c 2 2σ(t) 2 ). (3) Some other neighborhood functions are discussed in [4]. During training, the learning rate and the width of the neighborhood function are decreased, typically in a linear fashion. The map then tends to converge to a stationary distribution, which approximates the probability density of the data. After the input samples have been presented and the codebook vectors have converged, the map is calibrated. Calibration of the map is done to locate images of different input data items on it. In practical applications, for which such maps are used, it may be self-evident how a particular input data set ought to be interpreted and labeled. By inputting a number of typical, manually analysed data sets, looking where best matches on the map lie, and labeling the map units correspondingly, the map becomes calibrated. Since the mapping is assumed to be continuous along a hypothetical elastic surface, the closest reference vectors approximate the unknown input data. A number of ways of improving the performance of the SOM algorithm and a number of variants of the SOM are presented in [4]. The Self-Organizing Map may be visualized by using a unified distance matrix representation [15], where the clustering of the SOM is visualized by calculating distances between the map units locally and representing these with gray levels. Another choice for visualization is Sammon s mapping [16] which projects the high-dimensional map units on a plane by minimizing the global distortion of inter point distances when applying the mapping. However, in this work we used a Kohonen Extension Map developed by [17]. III. EXPERIMENTS The SOM Model was applied to unsupervised classification of call data for prepaid service subscribers from a real mobile telecommunication network. The data set contains the call data of 500 masked subscribers for a period of 6 months. The data set was first normalized. Subsequently, we extracted from the data set Mobile Originating Calls (MOC). These are calls that were initiated by the subscribers. Within the 6 months period, a total of 227,318 calls originated from the 500 subscribers. The final training records after normalization contained the following fields 1) Subscriber Number (MSISDN) 2) Other Party Called 3) Cell ID in use by the subscriber 4) Area Code for the location of the subscriber 5) Date and Time the call was made 6) Duration of the call The subscriber number was not used in training but to identify the SOM of each subscriber. A. Feature Extraction We generated a feature vector for each subscriber. However, there were symbolic data present in our training data. These include the Other Party Called, Cell ID in use by the subscriber and the Area Code for the location of the subscriber. In order to convert these to numeric values, we constructed a frequency table for each of the symbolic data per subscriber. Each symbol in the frequency table for each subscriber was then ranked based on the individual frequencies. This ranking was then used as the corresponding numeric value for the symbol. The field for date and time was also transformed. This was converted to peak (7am to 8pm) and off-peak (8pm to 7am) periods. Peak period were represented by 1, and off peak represented by 2. Thus we have a five dimensional feature vector. The number of training examples available for each subscriber varies. The subscriber with the maximum number of training examples had 7007 and the subscriber with the minimum number of examples had 3. The feature vectors for each subscriber were saved in a text file where it can be read by the SOM software. Each vector was labeled for traceability. B. Construction of the Maps The maps for each subscriber were generated by a software that implemented the standard SOM algorithm [17]. It took as its input the saved text file of input features. A sample format of the file is shown in the Table 1 below. After training, a number of outputs are generated. A file of the weights is generated with a.map extension (see Table 2 below). Also a postscript file is generated which gives a graphical view of the map. See Figure 1. C1 C2 C3 C4 C5 C6 C7 C8 C Table 1. A set of 9 feature vectors for a user transactions in 6 months. The calls made are labeled C1-C9 for traceability. X1 X2 X3 X4 X5 Y Y Y Y Y Table 2. Trained Weight Vectors for Input features in Table 1. IV. RESULTS A traditional Kohonen map groups similar input vectors together. Distance implies difference, but nearness does not imply resemblance. However, from the map shown in Figure 1 below, this is made visible on the map by calculating the square difference between neighbouring units of the trained map. Consequently, this value is used to colour the edge separating the units. Hence dark lines indicate strong difference and light lines indicate strong resemblance. A. Outliers Detection There is need to identify outliers from clusters on the map which indicate abnormal call pattern behaviour. Generally, clusters that tend to the corners of the map especially with Fig. 2. Extended Kohonen Maps for three runs on the same input vector of size 9 showing the preservation of the topological property of the maps Fig. 1. An extended Kohonen Map showing difference and resemblance between map units large distances (dark lines) seem to indicate abnormal behaviour; for instance, C4 and C8 in Figure 1. On inspection of the original data, it was discovered that these calls actually showed behaviour that was unusual, e.g. unusually long call duration. However, the only way we can validate our assumption is through a feedback procedure in collaboration with the fraud analysts to further investage the knowledge extracted. The feedback will be used to label the clusters appropriately. V. DISCUSSION Though we were able to represent the call profile of a mobile phone subscriber using the SOM, it must be pointed out that information on the temporal nature of this particular example was lost in the process. Further, the fact that different map sizes were generated for the different subscribers present a challenge when it comes to comparing the call profiles of different subscribers. Different maps (with same sizes) are generated for the same subscriber with different runs of the algorithm. This is due to the stochastic nature of the SOM. This means that the accuracy of the map depends on the number of iterations of the SOM. Nonetheless, the algorithm preserves the topological property of the map. Observations that are close to each other in the input space (at least locally) remain close to each other in the SOM (see Figure 2). Also, an important question that arises about the map is its reliability. From our result, the SOM algorithm can be used for getting insight into the call data and for the initial search of potential dependencies. In general, the findings need to be cross-validated with other principled statistical methods, in order to assess the confidence of the conclusions and to reject those that are not statistically significant. The works of [17] and [18] offer suggestions in this regard. VI. CONCLUSION We presented a descriptive data mining approach for outlier detection in subscriber s call data in a mobile telecommunication network using the Self-Organizing Map algorithm. The ideas presented here may be used for clustering call patterns in order to label them as normal or abnormal. The labeled data set could then be used to learn a time-series classifier, for instance a recurrent neural network model, in a supervised fashion. In addition, combining the advantages of our approach with a recurrent neural network model should result in promising techniques for numerous real-world tasks involving detection of input sequence features spanning extended time periods. We intend to pursue these tracks in future work. ACKNOWLEDGMENT The authors wish to acknowledge support of the Telkom- Cisco Centre of Excellence for IP and Internet Computing for funding this research. REFERENCES [1] R. J. Frank, S. P. Hunt and N. Davey, Applications of Neural Networks to Telecommunications Systems, Department of Computer Science, University of Hertfordshire, Hatfield, Herts., UK. AL10 9AB. Available at nngroup/pubs/papers/frank-eufit99.pdf [2] G. E. Hinton and J. A. Anderson, Parallel Models of Associative Memory, New Jersey, USA: Lawrence Erbaum Associates, [3] B. D. Ripley, Pattern Recognition and Neural Networks, Cambridge, UK: Cambridge University Press, [4] T. Kohonen, Self-Organizing Maps, Berlin, Germany: Springer-Verlag, [5] K. Yamanishi, J. Takeuchi and G. Williams, Online Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms. In Proceedings of ACM (KDD 2000), Boston, Massachusetts, USA, [6] K. Yamanishi and J. Takeuchi, Discovering Outlier Filtering Rules from Unlabeled Data -Combining a Supervised Learner with an Unsupervised Learner. In Proceedings of ACM (KDD 2001), San Franscisco, California, USA, [7] C. C. Aggarwal and P. S. Yu, Outlier Detection for High Dimensional Data. In Proceedings of ACM (SIGMOD 2001), May 21-24, San Barbara, California, USA, [8] P. Wu, W. Peng and M. Chen, Mining Sequential Alarm Patterns in a Telecommunication Database, Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan. Available at [9] E. Hung and D. W. Cheung, Parallel Mining of Outliers in Large Database. In Journal of Distributed and Parallel Databases Vol. 12, 5-26, Dordrecht, Netherlands: Kluwer, 2002. [10] J. Vesanto and E. Alhoniemi, Clustering of the Self-Organizing Map. IEEE Transactions on Neural Networks, Vol. 11, No. 3, , May [11] V. Barnett and T. Lewis, Outliers in Statistical Data, 3rd ed. West Sussex, UK: John Wiley, [12] T. Kohonen, The Self-Organizing Map. In Proceedings of the IEEE, E. Sanchez-Sinencio and C. Lau (Eds), Vol. 78, No. 9, , [1990], IEEE Press, New York, [13] S. Kaski, J. Kangas and T. Kohonen, Bibliography of Self-Organizing Map (SOM) papers: , Neural Computing Surveys, Vol. 1, , [14] M. Oja, S. Kaski and T. Kohonen, Bibliography of Self-Organizing Map (SOM) papers: Addendum, Helsinki University of Technology, Neural Networks Research Centre. Available at 1.pdf. [15] A. Ultsch and H. Siemon, Kohonen s Self-Organizing Maps for Exploratory Data Analysis. In Proceedings of the International Neural network Conference (INNC 90), pages , Dordrecht, Netherlands: Kluwer, [16] J. W. Sammon Jr., A Nonlinear Mapping for Data Structure Analysis. IEEE Transactions on Computers, Vol. C-18, No. 5, , May [17] P. Kleiweg, An Extended Kohonen Map, University of Groningen, Netherlands. Available at kleiweg/kohonen/kohonen.html#soft, [18] E. Bodt, M. Cottrell and M. Verleysen, Are they Really Neighbor? A Statistical Analysis of the SOM Algorithm Output. In Proceedings of the International Workshop on Artificial Intelligence and Statistics (AISTATS), pages 35-40, Key West, Florida (USA), 4-7 January, [19] V. Kreinovich, Outlier Detection Under Interval and Fuzzy Uncertainty: Algorithmic Solvability and Computational Complexity. Available at [20] T. Kohonen, J. Kangas and J. Laaksonen, SOM PAK: The Self- Organizing Map Program Package, Helsinki University of Technology, Finland. Available at pak. include foundations of artificial intelligence, neural networks, fuzzy and hybrid systems and their application to signal processing and time series analysis and modeling. His address is Olusola A. Abidogun is a Masters Student at the University of the Western Cape. He is a member of the Intelligent Systems Research Group of the department of Computer Science. He i
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks