Data Mining Concepts: Data Preprocessing
Today's real word database are highly susceptible to noisy, missing and inconsistent data, largely due to it's large size. Such low quality data will result in low quality mining results. Hence, it is important that we preprocess the data before we use it.
While preprocessing the data, several tasks are performed. Let's look at them one by one.
Data Cleaning
The main goal of data cleaning is to fill the missing values, smooth out noisy data and correct inconsistencies in data.
Missing Values
There are some cases where it is beneficial to fill in the missing data. However, if only a small amount of data (~1%) has missing values, then it is better to delete the data that has missing values. In a different case, if some attribute has a large amount of missing values (~90%), it is better to neglect that attribute for mining than to fill the missing values.
Several methods are used to fill in the missing values in the database.
1. Filling the missing values manually: Due to the enormous size of the database, it is likely to have several missing values. Hence, this method is likely to consume a large amount of time and is not very efficient. This is generally used when the dataset is very small and only a few of the data has missing values.
2. Filling a global value: Another solution to filling the missing data is to fill something like Unknown or $-\infty$. But the problem in this approach is that the mining program might think that it forms an interesting concepts, since they all have a value in common.
3. Using mean or median: We can use the mean for normally distributed data and median for skewed data. This is a simple, yet effective method to fill the missing values.
4. Using different mean or median for samples belonging to different classes: This method is better than the previous one. For example, if the data has an attribute of say income range, the missing values in each income range can be filled with the mean or median of the corresponding income range.
Noisy Data
Noise is the random error or variance in a measured variable. Several techniques can be use to smooth noisy data. Let's look at one of those techniques.
Binning: Binning is used to smooth sorted data by consulting the values around it. The data is separated into several bins. In smoothing by bin means, each value in a bin is replaced by the mean of the values in the bin. In smoothing by bin median, each bin value is replaced by bin median.
For example, consider the data: 4, 8, 15, 21, 21, 24, 25, 28, 34
Partitioning into bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Data Reduction
The data reduction strategy can be used to get a reduced representation of data that is much smaller in size but maintains the integrity of the original data. Therefore, after applying data reduction strategies, mining should be more efficient yet produce almost the same results.
Let's look at some of the data reduction strategies.
1. Dimensionality reduction: It is the process of reducing the number of random variables or attributes under consideration. This method transforms the data onto a smaller space.
2. Numerosity reduction: It reduces the original data volume by smaller forms of data representation. Some examples are clustering, sampling and data cube aggregation.
3. Compression: In this method, the data is "compressed" to obtain a reduced representation of the original data. Compression can be lossy or lossless. In lossy compression, only an approximation of the original data can be obtained from the compressed data. Alternatively, in lossless compression, the exact representation of the original data can be obtained from the compressed data.
Data Integration
Data Integration is a technique that combines data from multiple sources. While mining data, data integration is usually required. There are 2 major approaches for data integration.
Tight Coupling
In tight coupling, the data is pulled from different data sources and stored in a single physical location through the process of ETL - Extraction, Transformation and Loading.
This approach provides a uniform interface for querying the data.
Loose Coupling
In loose coupling, the data remains in their source databases. An interface is provided which transforms the query in a way that the source database can understand and sends it directly to several databases to get the result.
Data Transformation
In data transformation the data is transformed into forms that are appropriate for mining. Several techniques used for transforming data. Let's take a look at some of them.
1. Attribute construction: New attributes are constructed and added from given set of attributes to help the mining process.
2. Aggregation: Here, aggregation operations are applied to the data. For example, the daily sales data can be aggregated to compute monthly sales and annual sales amount.
3. Normalization: This is a commonly used technique while preprocessing data. The data may be spread out from a very small value to a very large value. This data is transformed to fit into a smaller range like $(-1, 1)$ or $(0, 1)$. This helps to work with data efficiently.
4. Descretization: In this technique, the raw numeric data values are replaced by interval labels or conceptual labels. For example, the age attribute can be replaced by intervals like $20-30$ or youth/adult/senior.
Comments
Post a Comment