The Data Mining Step

Most previous work on KDD focused primarily on the data mining step. However, the other steps are equally if not more important for the successful application of KDD in practice. We now focus on the data mining component, which has received by far the most attention in the literature.

Data mining involves fitting models to or determining patterns from observed data. The fitted models play the role of inferred knowledge. Deciding whether or not the models reflect useful knowledge is a part of the overall interactive KDD process for which subjective human judgment is usually required. A wide variety and number of data mining algorithms are described in the literature—from the fields of statistics, pattern  recognition, machine learning, and databases.

Thus, an overview discussion can often consist of long lists of seemingly unrelated, and highly specific algorithms. Here we take a somewhat reductionist viewpoint. Most data mining algorithms can be viewed as compositions of a few basic techniques and principles. In particular, data mining algorithms consist largely of some specific mix of three components:

The model. There are two relevant factors: the function of the model (e.g., classification and clustering) and the representational form of the model (e.g., a linear function of multiple variables and a Gaussian probability density function). A model contains parameters that are to be determined from the data.

The preference criterion. A basis for preference of one model or set of parameters over another, depending on the given data. The criterion is usually some form of goodness-of-fit function of the model to the data, perhaps tempered by a smoothing term to avoid overfitting, or generating a model with too many degrees of freedom to be constrained by the given data.

The search algorithm. The specification of an algorithm for finding particular models and parameters, given data, a model (or family of models), and a preference criterion.

A particular data mining algorithm is usually an instantiation of the model/preference/search components

(e.g., a classification model based on a decision tree representation, model preference based on data likelihood, determined by greedy search using a particular heuristic. Algorithms often differ largely in terms of the model representation (e.g., linear and hierarchical), and model preference or search methods are often similar across different algorithms. The literature on learning algorithms frequently does not state clearly the model representation, preference criterion, or search method used; these are often mixed up in a description of a particular algorithm. The reductionist view clarifies the independent contributions of each component.


Model Functions

The more common model functions in current data mining practice include:

• Classification: maps (or classifies) a data item into one of several predefined categorical classes.

• Regression: maps a data item to a real-value prediction variable.

• Clustering: maps a data item into one of several categorical classes (or clusters) in which the classes must be determined from the data—unlike classification in which the classes are predefined. Clusters are defined by finding natural groupings of data items based on similarity metrics or probability density models.

• Summarization: provides a compact description for a subset of data. A simple example would be the mean and standard deviations for all fields. More sophisticated functions involve summary rules, multivariate visualization techniques, and functional relationships between variables. Summarization functions are often used in interactive exploratory data analysis and automated report generation.

• Dependency modeling: describes significant dependencies among variables. Dependency models exist at two levels: structured and quantitative. The structural level of the model specifies (often in graphical form) which variables are locally dependent; the quantitative level specifies the strengths of the dependencies using some numerical scale.

• Link analysis: determines relations between fields in the database (e.g., association rules to describe which items are commonly purchased with other items in grocery stores). The focus is on deriving multifield correlations satisfying support and confidence thresholds.

• Sequence analysis: models sequential patterns (e.g., in data with time dependence, such as time-series analysis). The goal is to model the states of the process generating the sequence or to extract and report deviation and trends over time.


Model Representation

Popular model representations include decision trees and rules, linear models, nonlinear models (e.g., neural networks), example-based methods (e.g., nearest-neighbor and case-based reasoning methods), probabilistic graphical dependency models (e.g., Bayesian networks), and relational attribute models. Model representation determines both the flexibility of the model in representing the data and the interpretability of the model in human terms. Typically, the more complex models may fit the data better but may also be more difficult to understand and to fit reliably. While researchers tend to advocate complex models, practitioners involved in successful applications often use simpler models due to their robustness and interpretability.

Model preference criteria determine how well a particular model and its parameters meet the criteria of the KDD process. Typically, there is an explicit quantitative criterion embedded in the search algorithm (e.g., the maximum likelihood criterion of finding the parameters that maximize the probability of the observed data). Also, an implicit criterion (reflecting the subjective bias of the analyst in terms of which models are initially chosen for consideration) is often used in the outer loops of the KDD process.

Search algorithms are of two types: parameter search, given a model, and model search over model space. Finding the best parameters is often reduced to an optimization problem (e.g., finding the global maximum of a nonlinear function in parameter space).

Data mining algorithms tend to rely on relatively simple optimization techniques (e.g., gradient descent), although in principle more sophisticated optimization techniques are also used. Problems with local minima are common and dealt with in the usual manner (e.g., multiple random restarts and searching for multiple models). Search over model space is usually carried out in a greedy fashion.

An important point is that each technique typically suits some problems better than others. For example, decision-tree classifiers can be very useful for finding structure in high-dimensional spaces and are also useful in problems with mixed continuous and categorical data (since tree methods do not require distance metrics). However, classification trees with univariate threshold decision boundaries may not be suitable for problems where the true decision boundaries are nonlinear multivariate functions. Thus, there is no universally best data mining method; choosing a particular algorithm for a particular application is something of an art. In practice, a large portion of the applications effort can go into properly formulating the problem (asking the right question) rather than into optimizing the algorithmic details of a particular data mining method.

The high-level goals of data mining tend to be predictive, descriptive, or a combination of predictive and descriptive. A purely predictive goal focuses on accuracy in predictive ability. A purely descriptive goal focuses on understanding the underlying data-generating process—a subtle but important distinction. In prediction, a user may not care whether the model reflects reality as long as it has predictive power (e.g., a model combining current financial indicators in some nonlinear manner to predict future dollar-to-deutsche-mark exchange rates). A descriptive model, on the other hand, is interpreted as a reflection of reality (e.g., a model relating economic and demographic variables to educational achievements used as the basis for social policy recommendations to cause change). In practice, most KDD applications demand some degree of both predictive and descriptive modeling.


Source: COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11