White
Paper: Data Mining with XpertRule® Miner
Organisations are increasingly storing
large amounts of data generated by their operating activities. Such
historical data has buried within it patterns relating to the effectiveness
of the various business processes. Data mining can discover such
patterns in data and is now considered a catalyst for enhancing
business processes through avoiding failure patterns and exploiting
success patterns.
The potential for discovering knowledge buried in data has created
the need for better management of corporate historic data. This
has led to the concept of data warehousing, whereby operational
data is maintained in a database dedicated to providing business
users with online data for business analysis. A data warehouse can
be a large corporate database, a departmental database (data mart)
or a local database on a single client PC. The quality of knowledge
that can be discovered from data is not dependent on the scale and
architecture of the data warehouse. Quality is dependent on having
the right data and the appropriate data mining tools and development
methodology.
The business benefits of data mining have created a scramble by
software suppliers to position their products as data mining tools.
Anything from simple query and reporting products to the most advanced
pattern discovery products have been put forward as "data mining"
tools. This has caused confusion among business users as to what
data mining actually means. There are three technologies for the
discovery of patterns in data:
- Query and reporting tools:
These allow the user to find answers (confirmation) to queries
(patterns) already being suspected. Such tools can be best described
as hypothesis driven data exploration tools, with the user volunteering
all the patterns to be investigated.
- OLAP tools: These are advanced
forms of query & reporting tools which allow large multi-dimensional
databases to be interrogated speedily and graphically. These tools
can be best described as visualisation driven data exploration
tools. The discovery process is still user driven. However, the
user is armed with a multi-dimensional view of the data to drill
down at will, thereby aiding the exploration/discovery process.
- Data Mining Tools: These
automate the process of discovering patterns/knowledge in data.
They enable business goal driven discovery. For example, instead
of the user asking for a report or a graph of sales per region
and product - hoping to detect a pattern - the user can instead
ask for patterns relating to high sales volumes (a business goal).
The process of discovering patterns from data (also known as Knowledge
Discovery in Databases) is a process that combines all of the above
technologies since it requires hypothesis, exploration and automatic
discovery. It follows that the above technologies are complimentary.
In addition to supporting automatic pattern generation, XpertRule
Miner also supports the ability to query/report and to visualise/explore
the data in conjunction with the discovered patterns.
Important considerations when deploying Data Mining
Data mining is emerging as a mature technology which is being incorporated
into mainstream business applications. Data Mining has evolved beyond
the point where the algorithms are the main criteria for assessing
the technology. The important considerations when deploying data
mining in an organization are:
- The need for a data mining process (methodology) supported effectively
by the data mining environment.
- The need for an interactive knowledge discovery environment
in which the business knowledge of the user is combined with the
power of the discovery algorithms in order to derive business
knowledge (patterns) from data.
- The effective and active deployment of the data mining models
and patterns.
- Flexibility in addressing various computing architectures.
- Scalability and performance on large data volumes.
Graphical Support for a Data Mining Process
The effectiveness of data mining as a business intelligence tool
has been demonstrated with a large number of successful applications.
However, in order to give data mining a wider appeal it has become
apparent that a methodology or process is required to allow non
data mining specialists to achieve the same degree of success as
seasoned practitioners. Such a systematic and repeatable process
will allow data mining to be successfully deployed by many people
across organizations. There are a number of initiatives and projects
to develop such a process, two of which are partly funded by the
European Commission. XpertRule Software has been involved directly in
one of these (CRITIKAL) and
is a member of the Special Interest Group set up in conjunction
with the second (CRISP DM). It is reassuring to see a common data
mining process (methodology) starting to emerge. There is broad
agreement on the main tasks within such a process which are data
preparation, data exploration, pattern discovery, pattern validation
and pattern deployment.
XpertRule Miner provides a graphical environment for supporting
all the stages of the data mining process. The click, drag and drop
environment allows non programmers to carry out complex data preparation,
mining and deployment processes.
Data Sources
XpertRule Miner uses data drivers known as CAF servers to read/write
to data sources. The standard ODBC CAF server will support all ODBC
compliant data sources. The open architecture of the CAF drivers
allows the development of additional CAFs using the API of non ODBC
data sources. CAFs for client-server architectures are also available
- for example, the TCP/IP STUB CAF.
Data preparation & Transformation
It is now accepted by most data mining practitioners that between
50% to 80% of the total life cycle of a data mining project can
be taken up by the data preparation stage. The objectives of this
stage are to cleanse the data and to transform it into a format suitable
for the application of pattern discovery techniques.
XpertRule Miner allows non programmers to carry out complex data
transformations using an intuitive drag and drop graphical interface.
It can process data tables with millions of records. The data transformation
operations supported is:
- Data Aggregation: This is
used to summarise detailed data (e.g. aggregating 1 second data
into 5 minute averages) and also to transform time series data
into attribute/value (case) data suitable for tree induction and
cluster analysis.
- Data Table Manipulation:
This includes record filtering, random sampling, merging, joining
and sorting.
- Column Derivation: This
allows the user to define new data columns which are derived from
existing data columns. It is also used for data cleansing (processing
blanks & outliers) and the grouping or banding of field values.
XpertRule Miner supports a comprehensive VB like script for calculations
and string manipulations.
- Data Visualisation & Reporting:XpertRule
Miner can generate field statistics, frequency distribution graphs,
2D and 3D multi field graphs and time series graphs. These graphs
and reports allows the user to get a better understanding of the
raw data, design effective data cleansing and transformation strategies
and validate the transformed data. The prepared data can be browsed
and explored before the pattern discovery process is started.
This gives a better understanding of the data and enables the
user to better interpret the discovered patterns
Pattern Discovery
In order to address industry wide data mining needs, XpertRule
Miner supports a basket of knowledge discovery techniques:

Tree Induction: This
is goal driven discovery and is the most widely used technique involving
the induction of patterns (trees) relating to a business event (goal),
such as mortgage arrears, customer attrition, energy consumption,
insurance claims, etc.

Interactive/incremental Data Mining:
This combines automatic tree induction and manual tree construction.
It enables the business user to develop tree patterns in collaboration
with the induction algorithm. At every node (branch) in the tree,
XpertRule Miner shows the importance of the various attributes at
that point. The user is given the opportunity to impart their background
business knowledge and influence the choice of attribute splits
while respecting the information evidence provided by Miner.

Discovering Association Rules: This is the discovery
of associations between business events. For example, which items
are purchased together in a supermarket (basket analysis), which
product options are taken up together, which faults occur together,
etc. XpertRule Miner supports the discovery of association rules
and frequent item sets from transaction data of items or events.
Discovering Clusters in data:
This is the discovery of natural clusters or segmentation in data.
An example would be segmenting a mortgage portfolio. XpertRule Miner
generates clusters in 'case' (attribute based) data by discovering
sets of attribute values that are frequently associated with each
other.

Pattern Exploration and Validation
Data visualisation and exploration plays an important role throughout
the data mining process. During the tree induction process, XpertRule
Miner allows user defined reports and data graphs to be updated dynamically
as the user is exploring the various nodes and leafs (profiles) of
the discovered tree. In addition to giving the user a method of validating
the accuracy and meaning of tree patterns, the pattern exploration
process helps the user obtain a better understanding of the patterns
being discovered and their implications. XpertRule Miner supports
a number of tree exploration reports; field statistics, frequency
distribution, field propensity/value across profiles and "gain or
lift" graphs.
Pattern deployment
Patterns discovered using data mining can be deployed in a number
of ways to address the relevant business requirements. XpertRule
Miner supports a number of deployment strategies:
- Reporting and Dissemination:
Graphical tree patterns can be generated in Windows Meta File
format which allows them to be easily embedded in other Windows
applications such as Word, Excel and PowerPoint.
- Data Filtering: XpertRule
Miner can generate the discovered patterns as C code, SQL or SAS
procedures. This allows the user to select, for further processing,
data records matching the discovered patterns.
- Decision Support: The tree
patterns discovered in XpertRule Miner can be used as part of
an online decision support system. This can be achieved by generating
the tree patterns as C code or by embedding the tree mining client
ProfilerX (shown in the illustration above) as an ActiveX
component.
- Active Deployment:This is
where a small number of data and business specialists in an organization
can create a specific data mining business scenario (vertical
application) to be deployed to a large number of data mining users
inside or outside the organization. This is achieved using the
tree mining client ProfilerX as an embedded ActiveX component.
Connectivity, scalability and performance
The data mining tools available today fall into one of two distinct
architectures;
- Client based mining: These
data mining tools run on clients machines and mine data stored
on the same client or data downloaded from a server to the client
for mining. These tools limit the size of data that can be mined,
typically in the order of tens of thousands of records (table
rows). These limits are imposed by client memory/processor speed
restrictions, as well as network bandwidth restrictions
- Workstation (server) based mining:
These tools run on workstations with very thin display clients.
While high performance workstations and high bandwidth to the
server overcome the limitations of client based mining tools,
these tools have the disadvantages of high costs and the need
to make copies of the data on the server.
XpertRule Miner resolves all the problems associated with both
client and workstation based data mining by supporting multi-tier
client-server architecture. This is made possible by engineering
the data mining algorithms of Miner to be multi-tier, consisting
of Contingency And Frequency (CAF) servers which summarises
the data and ProfilerX clients which generate and display patterns
interactively. The advantages of this architecture are:
- Scalability: For stand alone
client based data mining, the database, CAF server and ProfilerX
client can all reside on the client PC. For small scale client-server
data mining, the database can reside on a server, while both the
CAF and ProfilerX client can reside on the client PC. For medium
scale client-server data mining, the database and CAF server
can reside on a server, with the ProfilerX client residing on
the client PC. Finally, for large scale client-server data
mining, the database can reside on a high performance data warehouse
server, the CAF server can reside on a middle-tier server and
the ProfilerX client resides on each client PC.
- Performance: The scalability
of the architecture ensures that performance can be optimised
regardless of the scale/architecture of data mining. This is achieved
through a number of innovative features:
- The multi tier architecture allows the CAF server with its
high bandwidth requirement to be placed at the point where it
has the maximum bandwidth to the database server. While the
ProfilerX with its low bandwidth requirements can be run on
client machines.
- The CAF server can exploit the high performance (parallelism)
of a database server by mining the data in-situ (i.e. without
moving the data) through the firing of SQL query streams at
the database. These intelligent queries will generate the required
contingency and frequency counts without needing to read all
the source data.
- The CAF server can cache data from the database server using
tokenised highly optimised data structures. This allows data
mining of millions of data records (gigabytes of data) in minutes
on standard specification Windows 95, 98 or NT machines (e.g.
333 MHz Pentium with 64MB RAM).
|