Leadership & Management

A FRAMEWORK FOR SECURITY IN DATA MINING USING INTELLIGENT AGENTS

Description
Nowadays it is possible to outsource data mining needs of a corporation to a third party. An establishment in the corporate world without much number or expertise in computational resources can outsource their mining needs. But the data as well as
Published
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
    VOL. 10, NO. 6, APRIL 2015 ISSN 1819-6608 ARPN Journal of Engineering and Applied Sciences  ©2006-2015 Asian Research Publishing Network (ARPN). All rights reserved. www.arpnjournals.com 2551 A FRAMEWORK FOR SECURITY IN DATA MINING USING INTELLIGENT AGENTS Sharath Kumar J. and Maheswari N. School of Computing Science and Engineering, VIT University Chennai, Vandalur, Kelambakkam Road, Tamil Nadu, India E-Mail: sharathkumar.j@vit.ac.in  ABSTRACT  Nowadays it is possible to outsource data mining needs of a corporation to a third party. An establishment in the corporate world without much number or expertise in computational resources can outsource their mining needs. But the data as well as the association rules defined over it are the sole property of the company and thus privacy and security needs to be preserved. Also, partitioned databases are capable of simplifying the complexity of massive data as well as improving the overall performance of the system. In this paper we devise a scheme that ensures the privacy of data, incorporating database partitioning to ensure a highly efficient and secure system with new algorithm for privacy  preservation. This is combination of L-diversity and P-sensitive technology. Agent Technology is also introduced in the given system. Different agents are used for different task like mining agent, data agent, task agent, user agent etc. and they communicate with each other and work together to provide a heuristic solution. Keywords:  partitioning, database, cipher, i- privacy, intelligent agents, Re-identification, L-diversity, p-sensitive. INTRODUCTION With new technologies having a model for IT services based on internet and big data centers, the importance of outsourced data and computing services are increasing day by day. With the data intensive nature of cloud computing, business intelligence as well as knowledge discovery services are those expected to be among the services externalized on the cloud. While achieving sophisticated analysis on bulk of business data is advantageous, the security issues cannot be overlooked. The main issue here is that the server is having access to  private data of the owner and can gain knowledge from it. In other words the corporate privacy is at risk here. The existing privacy preserving techniques for personal  privacy are modifying data using encryption, perturbation and generalization techniques, data mining algorithms with privacy techniques incorporated in it, randomized response techniques, heuristics based techniques. In this  paper we device a technique in which each transformed item t is made indistinguishable from other items. Main objective is to provide data privacy. In which l-diversity and p-sensitive these term are used [13]. Another term is agents. Agents are self learner that’s why they are used to do this task so the heuristic solution given out [2]. Agent is defining as pieces of code that is situated in some environment and that is capable of autonomous action in this environment in order to meet its design objective. Agent having multiple properties like robust means recovers from failure, social this term related interact with other agent then reactive, which is define responds to change in its environment, autonomous means independent, we can say not controlled externally etc [4]. This is achieved by first encrypting the data and then grouping it. Addition of noisy transactions is the method used to defend against frequency based attack models. We are integrating privacy preserving transparency techniques along with partitioned databases for increased privacy as well as efficient data retrieval. In database partitioning, horizontal and vertical  partitioning of database forms an integral part of the database design as it transforms the database into a more manageable and highly efficient system. Horizontal partitioning:  This partitioning divides the table, indexes or views horizontally into multiple smaller set of rows with lesser number than the srcinal. Two widely used types of horizontal partitioning are hash and range partitioning. Horizontal partitioning is widely used as it makes database server management easier. Operations performed by DBA like backup and restore becomes easier if the partitions are aligned i.e. both the table and corresponding index is partitioned identically [9]. Vertical partitioning: This partitioning divides the table vertically into multiple tables each with lesser number of columns. There are mainly two types of vertical  partitioning namely one through normalization where repeating attributes are eliminated and the other one is through row splitting by which srcinal table is divided into tables which contain fewer columns [8]. Hybrid partitioning:  In this data is partitioned first horizontally and then vertically or vice versa. When we talk about main objective that is data  privacy for this solution is l-diversity and p-sensitive used. Data is sensitive or secrete type of information. This two word having different meaning, secrete means data like ATM pin then password etc where as sensitive like any disease that particular person having like HIV or any health issue. So both types of data should protect from unauthorised person. Now a day’s mainly focus on K-anonymity technique. But K-anonymity facing diversity and background knowledge problem [15]. To overcome this problem we combined l-diversity [13] and p-sensitive [14] technique. In l-diversity divide data into sensitive and no sensitive part. Sensitive data is key-attribute which    VOL. 10, NO. 6, APRIL 2015 ISSN 1819-6608 ARPN Journal of Engineering and Applied Sciences  ©2006-2015 Asian Research Publishing Network (ARPN). All rights reserved. www.arpnjournals.com 2552 gives the idea about particular user and non sensitive is other than sensitive attribute. Represent this attribute with verity of different arrangement [14]. Then add p-sensitive technique which is nothing but masking of sensitive data [13]. There are different ways to mask or modify data generalization, suppression, noise adding, substitution, and data swapping [13]. We mask data and also provide sensitive level of attribute like high, low, or medium. This new algorithm overcomes drawback of k-anonymity and other existing technique. This entire task can be done by intelligent agent [2]. RELATED WORK Privacy preservation Privacy preserved data mining is a widely researched topic since a lot of private data is being collected by a collector for the purpose of mining the data. The collector essentially does not protect the privacy of the data collected. So to preserve data privacy random  perturbation techniques can be used. Many techniques have been developed so far to randomly perturb data so as to preserve data privacy. The perturbation is done in such a way that the mined patterns are almost identical to the  patterns that could be mined from the original data. The problem addressed in this paper is outsourcing of data mining. The main distinction from other data mining problems is that here both the data required for mining as well as the pattern that is mined should remain private. Also this paper looks into privacy  preservation at the user side. Authentication and privacy  preserving techniques are used to ensure high security of the underlying data. Partitioning Vertically partitioning a table splits it into more sub-tables each containing a subset of the columns. Vertical partitioning substantially reduces the amount of data to be scanned on query since many of these queries access only a small subset of columns present in the table. Sub-tables can be partitioned such that each of them contains distinct set of columns in them except for the key attributes. These key attributes are needed while reconstructing the srcinal table from the sub-tables. Horizontal partitioning can be done on tables or sub-tables, a non-clustered index or a view. This  partitioning can be specified using a partitioned method that maps a given row in database to a partition number. All rows that are having the same partition number will be stored in the same partition. This single node horizontal  partitioning can increase the performance as well as the manageability. Two other kinds of partitioning methods are hash and range partition. In hash partitioning the partition number is generated using a hash function on the columns in a set of columns that are specified. It is defined by a row (C, n) where C is the set of attribute types and n is the number of partitions. For example, let S be the table to be  partitioned with attribute types given as (c1 int, c2 float, c3 int, c4 date). The hash partition defined on it H=({int, float}, 10) will yield 10 partitions containing values after hash functions are applied first two columns c1 and c2 in each row of S [11]. In range partitioning, the partitions are formed on the basis of different range of values in an attribute. It is defined by a row (c, V) where c denotes the attribute type and V is a sorted sequence of values in the domain of c. For example, a range partitioning on S, R= (int, <10, 30>) when applied on column c1 divides the table into 3  partitions. The first partition will contain all rows with c1 column value less than 10, second partition contains value  between 10 and 30 and the third partition contains value ranging above 30. Range partitioning is defined on a single column rather than a set of columns [12]. A hybrid  partition can be formed by first range partitioning a table and then hash partitioning each range obtained. When we are say data, it is mean by the information in structured format. Structure format means in the form of tables. In table rows and column are given. We can say tuple and attributes. In this attributes are unique [9]. We assumed that no two tuple pertain or contain same user information we can just create a link  between private information and external information. Where external information refer as quasi identifier which contain same meaning data but not srcinal data. When release of data is done it’s simply like to external data and release so it offers for privacy protection [12]. In our system we are using different agents for  particular task. Like user agent, mining agent, task agent [2]. They are communicating with each other and then work together. Due to the adding agent in existing system lots of advantages are comes out over the existing system. Work on k-anonymity and its drawback K-anonymity is widely discussed because of its simplicity but k-anonymity not give guarantee of data  privacy. In this session we will see drawbacks of K-anonymity and the existing work done on l-diversity and  p-sensitive.    VOL. 10, NO. 6, APRIL 2015 ISSN 1819-6608 ARPN Journal of Engineering and Applied Sciences  ©2006-2015 Asian Research Publishing Network (ARPN). All rights reserved. www.arpnjournals.com 2553 Table-1.  Original table. Non-Sensitive Sensitive Zip code Age Nationality condition 1 120** <30 * Heart disease 2 120** <30 * Heart disease 3 120** <30 * Viral infection 4 130** >=40 * Cancer 5 130** >=40 * Heart disease 6 130** >=40 * Viral infection 7 120** 3* * Cancer 8 120** 3* * Cancer 9 120** 3* * Cancer Table-2.  4-anonymus table. Zip code Age Nationality condition 12054 21 Russian Heart disease 12068 23 Indian Heart disease 12068 24 Japanese Viral infection 13053 50 Russian Cancer 13053 55 Japanese Heart disease 13053 47 Indian Viral infection12054 37 Russian Cancer 12068 36 Indian Cancer 12068 35 Japanese Cancer Drawbacks of K-anonymity In this session we discuss two simple attacks. Homogeneity attack and background knowledge attack [15]. Homogeneity attack  Let consider simple example, shriya and Krishna are neighbours. One day Krishna falls ill and he admitted in hospital. To seen ambulance shriya discover that Krishna is ill. And from following thing shriya come up with what disease Krishna is suffering from. Shriya discover from hospital record (Table-2) she knows one of them record having Krishna’s information. She is Krishna neighbour so she knows his age is 37, and he is Indian male lives in 12068 zip codes. She knows Krishna’s record between 7, 8 or 9. Every patient having some health condition so she conclude Krishna having cancer. It is nothing but lack of diversity in sensitive attribute. Diversity is defined as the repetition of multiple time occurrence of key attribute so it can be easy to find out  particular information. This is homogeneity attacks. Background knowledge attack Let continue previous example. Shriya having another friend abhay who is admitted in same hospital where Krishna admitted. Both records are saved in Table-2. Shriya knows his age is 24 and zip code is 12068. Based on this information shriya come up with some records like 1, 2 and 3. Without knowing extra information related abhay, shriya cannot find out abhay disease. But she knows background knowledge related abhay so she concluded abhay having viral infection .so k-anonymity not protect against the background knowledge attack. Drawback of l-diversity  In this we discuss related single l-diversity technique. In L-diversity, divide the data in sensitive and non-sensitive part is done [13]. Where the sensitive part is not released to anyone. It is send only etherised person. It is not provide sufficient privacy. If sensitive or key attribute is repeatedly occurs again and again then it  becomes easily find out particular person record or information. The grouping of sensitive attribute with non sensitive attributes [13]. Let’s see one example, Table-3.  Before l-diversity. Wine Apple But-ter Ice-cream Pregnancy test HIV test Abhay X X Krish X X X  Nupur X X X Dhanu X X millind X X X In given example if Dhanu and Nupur both used  pregnancy test then it become easy to find out attribute. It happens because sensitive attribute repeatedly occurs so it  becomes drawback as shown Table-4.    VOL. 10, NO. 6, APRIL 2015 ISSN 1819-6608 ARPN Journal of Engineering and Applied Sciences  ©2006-2015 Asian Research Publishing Network (ARPN). All rights reserved. www.arpnjournals.com 2554 To overcome both technique drawbacks we  proposed new technology which is combination of l-diversity and p-sensitive. Let’s see in next section. Table-4.  After l-diversity. Wine Butter Ice-cream Apple Abhay X X Krish X X millind X X X Dhanu X X  Nupur X X PROPOSED APPROACH In the proposed system, the database is divided into smaller segments called partitions. The privacy of outsourced data is preserved using i-privacy scheme, to achieve this different agents are used. Architecture The architecture consists of both server and client side. The server here is the data mining server that mines the data send over to it. The user access of database is depicted in the second architecture. The steps taken for  preserving privacy of database are encryption, grouping, adding some noisy data and finally shuffling or rearranging the indexes. Architecture for data mining Figure-1. Privacy preserved data mining. The system consists of an intelligent encryption and decryption module that transforms the items in the database, C, according to the scheme to an encrypted database C*. The module first encrypts some items in the database using substitution cipher method. In this method, a unit of the database item is replaced with another item. The transformation mapping that is made for each item is stored in a file so as to decrypt the returned result. This module after encryption, groups the items in the database depending on the total number of items in it. Then irrelevant items are detected and masked by this module. The indexes on each group are then jumbled or shuffled to obtain the transformed database. This encrypted database is send across to the mining server. On receiving a query, the server mines the encrypted data and returns back the result. The EN/DN module on receiving it decrypts and recovers the true patterns. Architecture for user access The central database is divided into smaller  partitions using hybrid partitioning. The partitioned items are encrypted first using substitution method. The cipher texts are grouped based on number of records present. The indexes of these partitions are shuffled in a random order.  Figure-2.  Authenticated user access. The order of the shuffling is retained by the index server. When an authenticated user tries to access the database the intelligent module will generate a key based on the users request and send it to the users mobile. This code is the key to access the database for the user. The code generated by the intelligent module will consist of  pointers to information like the access rights of the user; the index entry related the query raised the partition identifier etc. Aided by this code the intelligent agent retrieves the result and gets back to the user. A. Partitioning The central database C is divided into smaller  partitions PDB 1 , PDB 2 … PDB n . This can be done by horizontal, vertical or hybrid partitioning. Hybrid  partitioning is done by first horizontally partitioning the database and then again vertically dividing each partition. B. Encryption This paper introduces a new encryption scheme, i-privacy consisting of mainly four steps. In this scheme the CDB C when encrypted transforms to C*.The following are the steps: a)   Plain text is transformed to cipher text using substitution method.  b)   When database is encrypted for mining purpose, some of the plain texts are transformed to cipher text using 1-1 substitution method. c)   All the database items are encrypted using substitution as it is stored in partitions for user access. d)   Items are grouped based on the total number of records present. e)   Duplicate records are added are to increase the noise in the database. f)   Indexed files are jumbled up in a random order. The unwanted fields are masked by the intelligent agent before sending the encrypted database C* to the data    VOL. 10, NO. 6, APRIL 2015 ISSN 1819-6608 ARPN Journal of Engineering and Applied Sciences  ©2006-2015 Asian Research Publishing Network (ARPN). All rights reserved. www.arpnjournals.com 2555 mining server. Also most sensitive data like credit card credentials or other private information will be eliminated  before sending the database for data mining purpose. When it comes to user access, irrelevant data are masked  based upon the query posed by the user. For example, consider a sample dataset crime. Table-5.  Crime dataset. Applying cipher substitution on first two rows, Plain text: a b c d e f g h i j k l m n o p q r s t u v w x y z Cipher text: z e b r a y w x u p q t m j k l o d n c f i g s h t Plain digit:  0 1 2 3 4 5 6 7 8 9 Cipher text:  c r i m e r a t s n Table-6. Encrypted dataset. C. Decryption The mined result send back from the server is decrypted by the intelligent decryption module. First the noise added will be removed from the mined result. Then the masked data items are restored. The jumbled indexes are ordered with the aid of the file that stores the information about the cipher text, jumbling and the noise table. The cipher text item is converted back into plain text reversing the substitution to obtain the true patterns that are mined. D. Grouping items Items are grouped by considering the odd and evenly placed records. The number of items in each group is selected on the basis of number of records present. Let c 1 ,c 2 … c n  be the set of cipher items then the grouping method groups as { c 1 ,c 3 … c k   }, { c 2 ,c 4 … c 2k   } and so on. If the last group has elements less than k then it are merged with the previous group. E. Constructing duplicate records  Noise table is formed specifying the noise N(c), corresponding to each cipher item c. The duplicate records are generated by first dropping all rows with noise value 0. The remaining is sorted in descending order of noise. Let c ’1 , c ’2 … c ’k   be the sorted order obtained. Then the duplicate records generated are:     N(c ’1 )-N(c ’2 ) instances of transaction {c ’1 }.     N(c ’2 )-N(c ’3 ) instances of transaction {c ’1 , c ’2 }.    …..    ….     N(c ’m-1 )-N(c ’m ) instances of transaction{c ’1 , c ’2 .. c ’m-1 }.     N(c ’m ) instances of transaction {c ’1 , c ’2 … c ’m }. These are added to the database and then updated in the file managed by the intelligent module so as to retrace it as and when needed. F. Shuffling Index and data masking The index items on each partition are shuffled within a group in a random order. The ordering of the shuffling is stored by the index server. The intelligent module interprets the correct order and index when a authenticated user tries to access the database. Data masking is done by the intelligent module. Based on the query given to the data mining server as well as on the query raised by the user some of the data which is sensitive or which are irrelevant on the context of the query are masked or removed from the dataset. These are stored in a file so that it can be restored when the mining server return the result or the user with write access updates and sends back the data. Table-7.  Masked response. For example consider the crime dataset; if a user, say magistrate, is accessing the database then the fields relevant to him will be the rate of crimes in the city. The field single, poverty will not be of much importance. The intelligent module masks this data and customizes the result according to the user trying to access the database.  L-diversity and P-sensitive The proposed system having different advantages over the drawback of existing system.
Search
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks