The handbook of statistical analysis & data mining applications.


Robert Nisbet, John Elder, IV, and Gary Miner

ISBN: 9780123747655, Pub Date: May 29, 2009 Price: $89.95/57.95 [euro]/45.99 [pounds sterling]. Page Count: c864

Excert Section: Chapter 2

Theoretical Considerations for Data Mining


In Chapter 1, we explored the historical background of statistical analysis and data mining. Statistical analysis is a relatively old discipline (particularly if you consider its origins in China). But data mining is a relatively new field, which developed during the 1990s and coalesced into a field of its own during the early years of the twenty-first century. It represents a confluence of several well-established fields of interest:

* Traditional statistical analysis

* Artificial intelligence

* Machine learning

* Development of large databases

Traditional statistical analysis follows

the deductive method in the search for relationships in data sets. Artificial intelligence (e.g., expert systems) and machine learning techniques (e.g., neural nets and decision trees) follow the inductive method to find faint patterns of relationship in data sets. Deduction (or deductive reasoning) is the Aristotelian process of analyzing detailed data, calculating a number of metrics, and forming some conclusions based (or deduced) solely on the mathematics of those metrics.

Induction is the more Platonic process of using information in a data set as a "springboard" to make general conclusions, which are not wholly contained directly in the input data. The scientific method follows the inductive approach but has strong Aristotelian elements in the preliminary steps.


The scientific method is as follows:

  1. Define the problem.

  2. Gather existing information about a phenomenon.

  3. Form one or more hypotheses.

  4. Collect new experimental data.

  5. Analyze the information in the new data set.

  6. Interpret results.

  7. Synthesize conclusions, based on the old data, new data, and intuition.

  8. Form new hypotheses for further testing.

  9. Do it again (iteration).

    Steps 1-5 involve deduction, and steps 6-9 involve induction. Even though the scientific method is based strongly on deductive reasoning, the final products arise through inductive reasoning. Data mining is a lot like that.

    In fact, machine learning algorithms used in data mining are designed to mimic the process that occurs in the mind of the scientist. Data mining uses mathematics, but the results are not mathematically determined. This statement may sound somewhat contradictory until you view it in terms of the human brain.

    You can describe many of the processes in the human conceptual pathway with various mathematical relationships, but the result of being human goes far beyond the mathematical descriptions of these processes. Women's intuition, mother's wisdom regarding their offspring, and "gut" level feelings about who should win the next election are all intuitive models of reality created by the human brain. They are based largely on empirical data, but the mind extrapolates beyond the data to form the conclusions following a purely inductive reasoning process.


    Data mining can be defined in several ways, which differ primarily in their focus on different aspects of data mining. One of the earliest definitions is

    The non-trivial extraction of implicit, previously unknown, and potentially useful information from data (Frawley et al., 1991).

    As data mining developed as a professional activity, it was necessary to distinguish it from the previous activity of statistical modeling and the broader activity of knowledge discovery.

    For the purposes of this handbook, we will use the following working definitions:

    * Statistical modeling: The use of parametric statistical algorithms to group or predict an outcome or event, based on predictor variables.

    * Data mining: The use of machine learning algorithms to find faint patterns of relationship between data elements in large, noisy, and messy data sets, which can lead to actions to increase benefit in some form (diagnosis, profit, detection, etc.).

    * Knowledge discovery: The entire process of data access, data exploration, data preparation, modeling, model deployment, and model monitoring. This broad process includes data mining activities, as shown in Figure 2.1.

    As the practice of data mining developed further, the focus of the definitions shifted to specific aspects of the information and its sources. In 1996, Fayyad et al. proposed the following:

    Knowledge discovery in databases is the non-trivial process of identifying valid, novel, potential useful, and ultimately understandable patterns in data.

    The second definition focuses on the patterns in the data rather than just information in a generic sense. These patterns are faint and hard to distinguish, and they can only be sensed by analysis algorithms that can evaluate nonlinear relationships between predictor variables and their targets and themselves. This form of the definition of data mining developed along with the rise of machine learning tools for use in data mining. Tools like decision trees and neural nets permit the analysis of nonlinear patterns in data easier than is possible in parametric statistical algorithms. The reason is that machine learning algorithms learn the way humans do--by example, not by calculation of metrics based on averages and data distributions.

    The definition of data mining was confined originally to just the process of model building.

    But as the practice matured, data mining tool packages (e.g., SPSS-Clementine) included other necessary tools to facilitate the building of models and for evaluating and displaying models. Soon, the definition of data mining expanded to include those operations in Figure 2.1 (and some include model visualization also).


    The modern Knowledge Discovery in Databases (KDD) process combines the mathematics used to discover interesting patterns in data with the entire process of extracting data and using resulting models to apply to other data sets to leverage the information for some purpose. This process blends business systems engineering, elegant statistical methods, and industrial-strength computing power to find structure (connections, patterns, associations, and basis functions) rather than statistical parameters (means, weights, thresholds, knots).

    In Chapter 3, we will expand this rather linear organization of data mining processes to describe the iterative, closed-loop system with feedbacks that comprise the modern approach to the practice of data mining.


    The evolutionary nature of the definition and focus of data mining occurred primarily as a matter of experience and necessity. A major problem with this development was the lack of a consistent body of theory, which could encompass all aspects of what information is, where it comes from, and how is it used. This logical concept is sometimes called a model-theoretic. Model theory links logic with algebraic expressions of structure to describe a system or complex process with a body of terms with a consistent syntax and relationships between them (semantics). Most expressions of data mining activities include inconsistent terms (e.g., attribute and predictor), which may imply different logical semantic relations with the data elements employed. Mannila (2000) summarized a number of criteria that should be satisfied in an approach to develop a model-theoretic for data mining. These criteria include the ability to

    * Model typical data mining tasks (clustering, rule discovery, classification)

    * Describe data and the inductive generalizations derived from the data

    * Express information from a variety of forms of data (relational data, sequences, text, Web)

    * Support interactive and iterative processes

    * Express comprehensible relationships

    * Incorporate users in the process

    * Incorporate multiple criteria for defining what is an "interesting" discovery

    Mannila describes a number of approaches to developing an acceptable model-theoretic but concludes that none of them satisfy all the above criteria. The closest we can come is to combine the microeconomic approach with the inductive database approach.

    Microeconomic Approach

    The starting point of the microeconomic approach is that data mining is concerned with finding actionable patterns in data that have some utility to form a decision aimed at getting something done (e.g., employ interdiction strategies to reduce attrition). The goal is to find the decision that maximizes the total utility across all customers.

    Inductive Database Approach

    An inductive database includes all the data available in a given structure plus all the questions (queries) that could be asked about patterns in the data. Both stored and derived facts are handled in the same way. One of the most important functions of the human brain is to serve as a pattern recognition engine. Detailed data are submerged in the unconscious memory, and actions are driven primarily by the stored patterns.

    Manilla suggests that the microeconomic approach can express most of the requirements for a model-theoretic based on stored facts, but the inductive database approach is much more facile to express derived facts. One attempt to implement this was taken in the development of the Predictive Modeling Markup Language (PMML) as a superset of the standard

    Extended Markup Language (XML). Most data mining packages available today store internal information (e.g., arrays) in XML format...

To continue reading