Tutorial 1 : Data Mining for Intrusion Detection

Aleksandar Lazarevic, Jaideep Srivastava, Vipin Kumar
University of Minnesota
{aleks, srivasta, kumar}@cs.umn.edu

Introduction

Today computers control power, oil and gas delivery, communication systems, transportation networks, banking and financial services, and various other infrastructure services critical to the functioning of our society. However, as the cost of the information processing and Internet accessibility falls, more and more organizations are becoming vulnerable to a wide variety of cyber threats. According to a recent survey by CERT/CC (Computer Emergency Response Team/Coordination Center), the rate of cyber attacks has been more than doubling every year in recent times. It has become increasingly important to make our information systems, especially those used for critical functions in the military and commercial sectors, resistant to and tolerant of such attacks.

Intrusion detection, as a special form of cyber threat analysis, includes identifying a set of malicious actions that compromise the integrity, confidentiality, and availability of information resources. Traditional methods for intrusion detection are based on extensive knowledge of signatures of known attacks. The signature database has to be manually revised for each new type of intrusion that is discovered. A significant limitation of signature-based methods is that they cannot detect emerging cyber threats, since by their very nature these threats are launched using previously unknown attacks. These limitations have led to an increasing interest in intrusion detection techniques based upon data mining.
The tremendous increase of novel cyber attacks has made data mining based intrusion detection techniques extremely useful in their detection. These techniques generally fall into one of two categories; misuse detection and anomaly detection. However, both approaches attempt to detect cyber attacks that occur very infrequently, but their consequences may be quite dramatic and often in a negative sense. In misuse detection, each instance in a data set is labeled as ¡®normal¡¯ or ¡®attack/intrusion¡¯ and a learning algorithm is trained over the labeled data. However, standard data mining techniques are not applicable due to issues including (i) dealing with skewed class distribution (attacks/intrusions correspond to a class of interest that is much smaller, i.e. rarer, than the class representing normal behavior) and (ii) learning from data streams (attacks/intrusions very often represent sequence of events). Anomaly detection, on the other hand, builds models of normal behavior, and automatically detects new types of intrusions as deviations from normal usage.

This tutorial will provide an up-to-date introduction to the increasingly important field of the data mining in intrusion detection, as well as an overview of research directions in this field. It will cover the most representative research projects and directions in intrusion detection based on data mining. There is also an ongoing project at our center related to the data mining applications in network intrusion detection funded by the army. We also plan to cover some of these activities in the tutorial.

The tutorial will also discuss applicability of presented techniques in other similar areas including credit card and insurance events, cardiac events, telecom circuit overloads, etc. The intended length of the tutorial is approximately 3 hours.

Participant Profile (Audience)

This tutorial will help participants to understand the key practical and research issues related to building a successful intrusion detection system. Several categories of people will benefit from this tutorial:
  • Researchers from data mining and computer security community interested in the state of the art intrusion detection techniques;
  • Officers from federal organizations interested to stop different cyber threats and information leaks;
  • Officers from military/agency organizations interested to stop different forms of terrorism;
  • Practitioners from industry and financial organizations concerned to stop different frauds into their information systems.

Key Benefits

This tutorial will help the participants understand the following:
  • What are the benefits of applying data mining techniques to intrusion detection?
  • What are possible types and categories of cyber attacks?
  • What is the typical architecture and design of intrusion detection system (IDS)?
  • What is the taxonomy of intrusion detection systems (IDSs)?
  • What are the various data mining techniques that can be applied?
  • What are potential difficulties and drawbacks in data mining based intrusion detection systems (IDSs)?
  • What are the practical implementations of data mining approaches to intrusion detection?

Tutorial Syllabus

1. Introduction

a. What is Information Assurance?
b. What is Intrusion Detection?
c. What are the general types/categories of cyber attacks

2. Standard architecture and characteristics of IDSs

a. Architecture and design
b. Evaluation of IDSs
c. Measuring efficiency of IDSs
d. IDS Taxonomy

3. Characteristics of intrusion detection problem

a. Challenges in intrusion detection
b. Host-based vs. Network-based IDSs
c. Possible approaches in intrusion detection
d. Signature-based IDSs

4. Data Mining in Intrusion Detection

a. Motivation in applying data mining techniques
b. Data Mining in host-based intrusion detection
c. Data Mining in network intrusion detection

5. Data Preprocessing for data mining models in intrusion detection

a. Data Collection
b. Data Labeling
c. Feature construction

6. Data Mining Approaches . Misuse Detection

a. Advantages and drawbacks
b. Building classification models for rare classes
b.1. Rule based techniques (RIPPER, association rules, PN rule, ¡¦)
b.2. Tree based approaches
b.3. Neural networks
b.4. Various classifiers (Bayes classifiers, genetic algorithms, LVQ, ¡¦)
b.5. Multiple classifiers (combining classifiers)
c. Cost sensitive modeling
d. Learning from data streams

7. Data Mining Approaches - Anomaly and Outlier Detection

a. Introduction
b. Advantages and drawbacks
c. Supervised anomaly detection
d. Unsupervised anomaly detection

8. Centralized vs. Distributed Intrusion Detection Systems

a. Advantages and drawbacks of centralized and distributed IDSs
b. Agent-based approach for distributed intrusion detection

9. Benchmarking IDSs

a. Proper data collection and labeling
b. Detection rate vs. false alarm rate
c. Single connection attacks vs. multi-connection attacks
d. Operational cost of IDSs
e. Publicly available data sets

10. Applicability of presented data mining approaches to other application domains

11. Conclusion and Discussion

Aleksandar Lazarevic¡¯s Profile

Aleksandar Lazarevic is a Research Associate at Army High Performance Computing Research Center, University of Minnesota. His research interests include data mining, parallel and distributed computing as well as intrusion detection. He received B.Sc and M.Sc. degrees in Computer Science and Engineering from University of Belgrade, Yugoslavia in 1994 and 1997 respectively. He received the PhD degree in Computer Science from Temple University in December 2001. During his doctoral studies he has authored around 20 research articles. Starting from January 2002, he is currently leading the project related to applications of data mining for network intrusion detection. He will be serving as a Co-Chair for the Workshop on Data Mining for Cyber Threat Analysis at the IEEE International Conference on Data Mining to be held in Japan in December 2002. He also served as Program Committee member on the same conference and will be serving as a Program Committee member at the next Pacific Asia Conference on Knowledge Discovery and Data Mining. He also serves as a Publicity Chair for the next SIAM International Conference on Data Mining to be held in May 2003. He is a member of SIAM and ACM.

Contact information:
Research Associate, Computer Science Department, University of Minnesota
200 Union Street SE, 4-192, EE/CSci Building,
University of Minnesota, Minneapolis, MN 55455
Phone: (612) 626-8096; Fax (612) 626-1596
E-mail: aleks@cs.umn.edu; Web Page: http://www.cs.umn.edu/~aleks

Jaideep Srivastava¡¯s Profile

Jaideep Srivastava received his B.Tech. from the Indian Institute of Technology, Kanpur, India, in 1983, and M.S. and Ph.D. from the University of California - Berkeley in 1985 and 1988, respectively. Since 1988 he has been on the faculty of the University of Minnesota, where is a Professor. For over 15 years he has been active as a researcher, educator, and consultant in the areas of databases, data mining, and multimedia. He has established and led a database and multimedia research laboratory, where 16 people have received their doctorate and 37 people have received their masters. Throughout his career Dr. Srivastava has had an active collaboration with the industry, both for collaborative research and technology transfer. Between 1999 and 2001 Dr. Srivastava was on leave from the University of Minnesota, during which period he has spent time at Amazon.com (www.amazon.com) as the Chief Data Mining Architect, and at Yodlee Inc. (www.yodlee.com) as Director of Data Analytics. Dr. Srivastava is an often-invited participant in technical as well as technology strategy forums. He has given more than a hundred talks in various industry, academic, and government forums. He is on the editorial boards of the IEEE Transactions on Knowledge & Data Engineering, and the WWW Journal and has been a guest editor for the Data Mining & Knowledge Discovery Journal. He is the program co-chair for PAKDD 2003 and the conference co-chair for the M2003 data mining conferences. The federal government has solicited his opinion on computer science research as an expert witness. He has served in an advisory role to the governments of India and Chile on various software technologies.

Contact information:
Professor, Computer Science Department, University of Minnesota
200 Union Street SE, 4-192, EE/CSci Building,
University of Minnesota, Minneapolis, MN 55455
Phone: (612) 626-8107; Fax (612) 626-1596
E-mail: srivasta@cs.umn.edu;

Vipin Kumar¡¯s Profile

Vipin Kumar received the B.E. degree in electronics & communication engineering from University of Roorkee, India, in 1977; the M.E. degree in electronics engineering from Philips International Institute, Eindhoven, Netherlands, in 1979; and the Ph.D. degree in computer science from University of Maryland, College Park, in 1982. He is currently Director of Army High Performance Computing Research Center and Professor of Computer Science at the University of Minnesota. Kumar's current research interests include parallel computing, parallel algorithms for scientific computing problems, and data mining. His research has resulted in the development of the concept of isoefficiency metric for evaluating the scalability of parallel algorithms, as well as highly efficient parallel algorithms and software for sparse matrix factorization (PSPACES), graph partitioning (METIS, ParMetis, hMetis) and dense hierarchical solvers. He has authored over 100 research articles, and coedited or coauthored 5 books including the widely used text book ¡°Introduction to Parallel Computing¡± (Publ. Benjamin Cummings/Addison Wesley, 1994). Kumar serves on the editorial boards of IEEE Concurrency, Parallel Computing, the Journal of Parallel and Distributed Computing, and served on the editorial board of IEEE Transactions of Data and Knowledge Engineering during 93-97. He is a Fellow of IEEE, a member of SIAM, and ACM.

Contact information:
Professor, Computer Science Department, University of Minnesota
200 Union Street SE, 4-192, EE/CSci Building,
University of Minnesota, Minneapolis, MN 55455
Phone: (612) 624-8023; Fax (612) 625-0572
E-mail: kumar@cs.umn.edu; Web Page: http://www.cs.umn.edu/~kumar


Tutorial 2 : Analyzing and Mining Data Streams

Sudipto Guha, Nick Koudas, Kyuseok Shim

Introduction

For many recent applications, the concept of a data stream is more appropriate than a data set. A data stream is an appropriate model when a large volume of data is arriving continuously and it is either unnecessary or impractical to store the entire data in some form of memory. Many applications naturally generate streams of data as opposed to simple data sets. Astronomers, telecommunications companies, banks, stock-market analysts, and news organizations, for example, have vast amounts of data arriving continuously.

Data Mining of streams is thus a necessary ingredient for many successful applications. The stream view challenges basic assumptions in data mining like random access to data. It also raises several fundamental questions like are there effective techniques for mining streams?

In this tutorial we will present a survey of algorithms and applications related to data streams. An outline of the content is given in the following sections.

Syllabus

1. Introduction to Data Streams
    - Definitions
    - Data Stream Models
    - Novel issues and questions

2. Collecting Stream Statistics
    - Sampling
    - Dimensionality reduction
    - Sketches and signatures
    - Distribution approximations and histograms
    - Order statistics

3. Stream Mining Algorithms
    - Clustering
    - Decision trees
    - Frequent/rare items
    - Association rules
    - Outliers
    - Correlations

4. Future directions and research questions

Intended Audience

We believe that a tutorial on mining streaming data is important for both practitioners and researchers.
This tutorial will provide a self contained view of the techniques available in mining streams and will be appealing to researchers as well as practitioners. We expect the material to be interest to most attendees of PAKDD.

Duration: Roughly in three fifty minute sessions.

Sudipto Guha's Profile

Sudipto Guha is an assistant professor in the Computer Information Sciences Department, University of Pennsylvania. He has previously worked for the AT&T Shannon Labs Research from 2000 to 2001 as a member of technical staff after receiving his PhD from Stanford University. His research interests are primarily in design and analysis of algorithms for computation under constrained resources. He has worked in the fields of graph approximation algorithms for NP-hard problems, single pass data stream algorithms, efficient optimization in database query and mining, randomized algorithms and combinatorial optimization.

Nick Koudas's Profile

Nick Koudas is currently a Principal Member of Technical Staff at the database department in AT&T Labs research. He received his B.Tech in Computer Science from the University of Patras in Greece and his PhD degree at the University of Toronto. His current research interests include, XML databases, IP network data management and data quality.

Kyuseok Shim's Profile

Kyuseok Shim is an Assistant Professor and a leader of the Knowledge Discovery and Database Research Laboratory at School of Electrical Engineering and Computer Science of Seoul National University, Korea. Previously, he was an Assistant Professor at Computer Science Department of KAIST, Korea. Before joining KAIST, he was a member of technical staff (MTS) and one of key contributors to the Serendip Data Mining project at Bell Laboratories. He also worked for Rakesh Agrawal's Quest Data Mining project at IBM Almaden Research Center. He received a Ph.D in Computer Science from University of Maryland at College Park in 1993. He received B.S. degree in Electrical Engineering from Seoul National University in 1986, and MS degree in Computer Science from University of Maryland at College Park in 1988. Kyuseok Shim has been working in the area of databases focusing on data mining, data warehousing, query processing and query optimization, XML and semi-structured data. He is currently on Editorial Board of the VLDB and KAIS Journals. He has published several research papers in prestigious conferences and journals. He has also served as a program committee member on many international conferences including ICDE, ICDM, PAKDD, SIGMOD, SIGKDD and VLDB conferences. He did a data mining tutorial with Rajeev Rastogi at ACM SIGKDD'99 and a tutorial with Surajit Chaudhuri on storage and retrieval of XML data using relational DB at VLDB'01.

PAKDD2003 Homepage : http://aitrc.kaist.ac.kr/~pakdd03