Thursday, October 3, 2019
Approaches to Data Cleaning
Approaches to Data Cleaning Data Cleaning approaches: generally, data cleaning contains several steps Data Analysis: A detailed analysis is required to check what type of inconsistencies and errors are to be resolved. An analysis program should be used along with manual analysis of data to identify data quality problems and to extract metadata. Characterization of mapping rules and transformation workflow: We might have to execute a great amount of data cleaning and transformation steps depending upon the degree of dirtiness of data, the amount of data sources and their level of heterogeneity. In some cases schema transformation is required to map sources to a common data model for data warehouse, usually relational model is utilized. Initial data cleaning phases organize data for integration and fix single ââ¬âsource instant complications. Further phases deal with data/schema integration and resolving multi-source glitches, e.g., redundancies. Workflow that states the ETL processes should specify the control and data flow of the cleaning steps for data warehouse. The schema associated data conversions and the cleaning steps should be quantified by a declarative query and mapping language to the extent possible, to allow auto generation of the conversion program. Along with it there should be a possibility to call user written program and special tools during the process of data transformation and cleaning process. A user opinion is required for data transformation for whom there is no built in cleaning logic. Verification: The accuracy and efficiency of a conversion process and transformation designs should be verified and assessed on a sample data to improve the definitions. Repetition of the verification, design and analysis phases may be required because some faults may appear after performing some conversions. Transformation: Implementation of the transformation phase either by running the ETL process for refreshing and loading a data warehouse or during returning queries from heterogeneous sources. Reverse flow of transformed data: once the single source problems are resolved the transformed data should be overwritten in the base source so that we can provide legacy programs cleaned data and to escape repeating of the transformation process for future data withdrawals. For the data warehousing, the cleaned data is presented from the data staging area. The transformation phase requires a huge volume of metadata, such as, workflow definitions, transformation mappings, instance-level data characteristics, schemas etc. For reliability, tractability and reusability, this metadata should be kept in a DBMS-based repository. For example the consequent table Customers holds the columns C_ID and C_no, permitting anyone to track the base records. In the next sections we have elaborated in more detail probable methodologies for data examination, conversion definition and conflict determination. Along with it there should be a possibility to call user written program and special tools during the process of data transformation and cleaning process. A user opinion is required for data transformation for whom there is no built in cleaning logic. The accuracy and efficiency of a conversion process and transformation designs should be verified and assessed on a samp le data to improve the definitions. Repetition of the verification, design and analysis phases may be required because some faults may appear after performing some conversions. Transformation: Implementation of the transformation phase either by running the ETL process for refreshing and loading a data warehouse or during returning queries from heterogeneous sources. Reverse flow of transformed data: once the single source problems are resolved the transformed data should be overwritten in the base source so that we can provide legacy programs cleaned data and to escape repeating of the transformation process for future data withdrawals. For the data warehousing, the cleaned data is presented from the data staging area. The transformation phase requires a huge volume of metadata, such as, workflow definitions, transformation mappings, instance-level data characteristics, schemas etc. For reliability, tractability and reusability, this metadata should be kept in a DBMS-based reposito ry. To maintain data excellence, thorough data about the transformation phase is to be stored, both in the in the transformed occurrences and repository , in precise information about the extensiveness and brilliance of source data and extraction information about the source of transformed entities and the transformation applied on them. For example the consequent table Customers holds the columns C_ID and C_no, permitting anyone to track the base records. In the next sections we have elaborated in more detail probable methodologies for data examination, conversion definition and conflict determination. DATA ANALYSIS Metadata mirrored in schemas is usually inadequate to evaluate the data integrity of a source, particularly if only a small number of integrity constraints are imposed. It is therefore necessary to examine the original instances to get actual metadata on infrequent value patterns or data features. This metadata assists searching data quality faults. Furthermore, it can efficiently subsidize to recognize attribute correspondences among base schemas (schema matching), based on which automatic data conversions can be developed. There are two associated methods for data analysis, data mining and data profiling. Data mining assists in determining particular data forms in huge data sets, e.g., relationships among numerous attributes. The focus of descriptive data mining includes sequence detection, association detection, summarization and clustering. Integrity constraints between attributes like user defined business rules and functional dependencies can be identified, which could be utilized to fill empty fields, resolve illegitimate data and to detect redundant archives throughout data sources e.g. a relationship rule with great certainty can suggest data quality troubles in entities breaching this rule. So a certainty of 99% for rule ââ¬Å"tota_price=total_quantity*price_per_unitâ⬠suggests that 1% of the archives do not fulfill requirement and might require closer inspection. Data profiling concentrates on the instance investigation of single property. It provides information like discrete values, value range, length, data type and their uniqueness, variance, frequency, occurrence of null values, typical string pattern (e.g., for address), etc., specifying an precise sight of numerous quality features of the attribute. Table3. Examples for the use of reengineered metadata to address data quality problems Defining data transformations The data conversion phase usually comprises of numerous steps where every step may perform schema and instance associated conversions (mappings). To allow a data conversion and cleaning process to produce transformation instructions and therefore decrease the volume of manual programming it is compulsory to state the mandatory conversions in a suitable language, e.g., assisted by a graphical user interface. Many ETL tools support this functionality by assisting proprietary instruction languages. A more common and stretchy method is the use of the SQL standard query language to accomplish the data transformations and use the chance of application specific language extensions, in certain user defined functions (UDFs) are supported in SQL:99 . UDFs can be executed in SQL or any programming language with implanted SQL statements. They permit applying a extensive variety of data conversions and support easy use for diverse conversion and query processing tasks. Additionally, their impleme ntation by the DBMS can decrease data access cost and thus increase performance. Finally, UDFs are part of the SQL:99 standard and should (ultimately) be movable across many stages and DBMSs. The conversion states a view on which additional mappings can be carried out. The transformation implements a schema rearrangement with added attributes in the view achieved by dividing the address and name attributes of the source. The mandatory data extractions are achieved by User defined functions. The U.D.F executions can encompass cleaning logic, e.g., to eliminate spelling mistakes in city or deliver misplaced names. U.D.F might apply a significant implementation energy and do not assist all essential schema conversions. In specific, common and often required methods such as attribute dividing or uniting are not generally assisted but often needed to be re-applied in application particular differences. More difficult schema rearrangements (e.g., unfolding and folding of attributes) are not reinforced at all. Conflict Resolution: A number of conversion phases have to be identified and performed to solve the numerous schema and instance level data quality glitches that are mirrored in the data sources. Numerous types of alterations are to be executed on the discrete data sources to deal with single-source errors and to formulate for integration with other sources. Along with possible schema translation, these preliminary steps usually comprises of following steps: Getting data from free form attributes: Free form attributes mostly take numerous discrete values that should be obtained to attain a detailed picture and assist additional transformation steps such as looking for matching instance and redundant elimination. Common examples are address and name fields. Essential transformations in this phase are reorganization of data inside a field to comply with word reversals, and data extraction for attribute piercing. Authentication and alteration: This step investigates every source instance for data-entry mistakes and attempts to resolve them automatically as much as possible. Spell-checking built on dictionary searching is beneficial for finding and adjusting spelling mistakes. Additionally, dictionaries on zip codes and geographical names assist to fix address data. Attribute reliance (total price ââ¬â unit price / quantity, birth date-age, city ââ¬â zip area code,â⬠¦) can be used to identify mistakes and fill missing data or resolve incorrect values. Standardization: To assist instance integration and matching, attribute data should be changed to a reliable and identical form. For example, time and date records should be transformed into a defined form; names and other string values should be changed to lower case or upper case, etc. Text data might be summarized and combined by stop words, suffixes, executing stemming and removing prefixes. Additionally, encoding structures and abbreviations should continuously be fixed by referring distinctive synonym dictionaries or implementing predefined transformation rules.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.