Sunday, June 23, 2013

Data Mining: What You Might Not Know

The ultimate goal of data mining is not the acquisition of data, but the exploration and analysis  of massive amounts of data resulting in patterns, rules, and relationships. One of the key outcomes is the identification of reliable and meaningful patterns.

Meaningful patterns can do the following:

* Model typical behaviors
* Identify atypical behaviors
* Express possible cause and effect relationships
* Explain past behaviors
* Describe current conditions
* Develop a predictive model for the future

While developing patterns in data drawn from different types of data can show meaningful  relationships, time-series data mining can be used to postulate

* causality
* life-cycle behaviors
* impacts of proximal or distal relationships
* cluster formation and disaggregation

Characteristics of Data Mining Data

The data may represent changes of behavior and activities over time, or, alternatively, it could  represent the relationship of different types of data which have been collected at one point in  time.

* Same type of data, collected at different points in time
* Different types of data, collected at the same point in time
* Data streams, which involve ordered sequences of items that arrive over time

* offline: regular chunked arrivals
* online: continuous flow

Planning the Data Mining Process

The development of data mining can follow a fairly clear process:

1.  Determine the problem and definition
2.  Determine the characteristics of the data
3.  Develop a plan for data mining
4.  Review of similar data mining projects and algorithms
5.  Become familiar with data, data issues, potentially meaningful data subsets
6.  Data preparation and conditioning
7.  Model and algorithm development
8.  Evaluation of model / comparison with other models
9.  Implementation, which involves generating reports, or continuing to develop ongoing activities

Data Mining: Key Tasks

Key tasks in data mining include the following:

* Identification
* Eliminate unnecessary or distracting frequent item sets
* Definition and differentiation
* Optimize storage and recovery of streams
* Classification
* Cluster recognition
* Segmentation
* Discovery of motifs (sub-sequences)
* Detection of similar clusters or sets
* Detection of outliers and anomalies
* Create predictive models

Data mining that involves continuous streams of data presents unique challenges because of the  nature of the data and types of patterns that are meaningful, given the array of patterns that are  possible to develop. It is also challenging to integrate incoming data with existing databases in  order to qualitatively evaluate patterns in a timely way. It is also challenging to avoid the  "concept drifting" problem, which means that the usefulness and validity of the results will  degrade over time.

In general,

* Sensors and surveillance: networks, physical locations, manufacturing, transportation
* Performance monitoring: manufacturing, networks, controls
* Transaction / activity monitoring: retail, web performance, manufacturing


A literature review of algorithms suggests that data mining for data streams is generally  performed using three different major classifications of algorithms, and that they do not yield  the same results, which could be quite significant, depending on the application.

Landmark Window Based Data Mining

What is measured is the difference between a specific time-stamp (the landmark) and the present.

Pros: Complete comparision with an a priori property
Cons: The order in which information is considered and placed into sets can lead to errors

Damped Window

This approach privileges new data over old or historical data, which means that the older data drops out of  consideration for developing sets

Pros: Efficient use of resources, eliminates old and obsolete information
Cons: Large errors may be made because the information being eliminated may be important for the  rule to be effective

Sliding Window

Sliding favors new data (as in the Damped Window approaches), but does not completely eliminate  old data. Instead, it incorporates summarized versions of old data and data relations.

Pros:  Can incorporate past data and do so relatively quickly
Cons:  The assumptions made to create summaries of old data sets can be flawed

General Observations and Conclusions

At this point in time, the ability to collect data continues to expand and sometimes dramatically,  thanks to technological advances in both hardware and software. However, a review of the processes  and the literature make it clear that the algorithms use to process and make meaning of the data  batchs and streams differ widely. Consequently, the results and conclusions that are created using  data mining techniques (both collecting and in analyzing), can be highly variable. Thus, decisions  made through data mining need to be made carefully, and more than one analytical technique and set  of algorithms should be used.


Esling, P., & Agon, C. (2012). Time-Series Data Mining. ACM Computing Surveys, 45(1), 12:1-12:34.

Mala, A. A., & Dhanaseelan, F. (2011). Data Stream Mining Algorithms: A Review of Issues and  Existing Approaches. International Journal On Computer Science & Engineering, 3(7), 2726-2732.

Ramageri, B. M., & Desai, B. L. (2013). Role of data mining in retail sector. International  Journal On Computer Science & Engineering, 5(1), 47-50.

Blog Archive