AN EFFECTIVE APPROACH IN LARGE DATASETS- SINGLE INSTANCE STORAGE
- Department of Computer Science and Engineering, Rajagiri School of Engineering and Technology Kochi, India.
- Department of Computer Science and Engineering, Rajagiri School of Engineering and Technology Kochi, India
- Abstract
- Keywords
- Cite This Article as
- Corresponding Author
SIS (Single Instance Storage) Framework is used in combining data from multiple sources into one comprehensive and easily manipulated database. The primary aim of SIS Framework is to provide a business with analytics results from data mining.SIS is designed to provide an architecture that will make social data accessible and useful to users. The deduplication process is finding duplicate records or redundant data when comparing with one or more database or data sets .This information is too costly to acquire because of which SIS process getting more attention nowadays. In data cleaning process removing redundant records in a single database is a difficult step, because outcomes of large data processing or data mining may get greatly influenced by duplicates data. As the database size increasing day by day the matching processes complexity becoming one of the major challenges for SIS Framework. The basic steps in implementing SIS include Blocking, Selection and classification. Semantic similarity is used for Blocking. The selection consists of: Sample selection, redundancy removal. The intermediate Subsets is given to the classifier after feature selection using Principle Component Analysis (PCA). Classification is done to efficiently identify the most ambiguous data in the training set.
[Catherine Mathew and Varghese S Chooralil (2016); AN EFFECTIVE APPROACH IN LARGE DATASETS- SINGLE INSTANCE STORAGE Int. J. of Adv. Res. 4 (Aug). 120-125] (ISSN 2320-5407). www.journalijar.com