- Outlier Detection in High-Dimensional Data
Lecturers:
Arthur Zimek, Erich Schubert, Hans-Peter Kriegel
Abstract:
High-dimensional data in Euclidean space pose special challenges to data mining algorithms. These challenges are often indiscriminately subsumed under the term ''curse of dimensionality'', more concrete aspects being the so-called ''distance concentration effect'', the presence of irrelevant attributes concealing relevant information, or simply efficiency issues. In about just the last few years, the task of unsupervised outlier detection has found new specialized solutions for tackling high-dimensional data in Euclidean space. These approaches fall under mainly two categories, namely considering or not considering subspaces (subsets of attributes) for the definition of outliers. The former are specifically addressing the presence of irrelevant attributes, the latter do consider the presence of irrelevant attributes implicitly at best but are more concerned with general issues of efficiency and effectiveness. Nevertheless, both types of specialized outlier detection algorithms tackle challenges specific to high dimensional data. In this tutorial, we discuss those aspects of the ''curse of dimensionality'' that are most important for outlier detection in detail and survey specialized algorithms for outlier detection from both categories.
Webpage:
http://www.dbs.ifi.lmu.de/cms/Publications/OutlierHighDimensional
- Sampling and Summarization for Social Networks
Lecturers:
Shou-De Lin, Mi-Yen Yeh, and Cheng-Te Li
Abstract:
With the growing popularity of online social network services such as Twitter and Facebook, social network analysis has attracted much attention in recent years. The massive amount of data created everyday on social network services have become a great source of information for different purposes such as sociology study and marketing analysis. However, the growth in scale of the real-world social networks (usually contains millions of nodes and edges) has imposed great challenges in information extraction, processing, and analysis for humans or even computers. Usually it is unrealistic to assume the full network is known, or can be processed efficiently at once. Therefore, it is crucial for data miners to devise an effective, efficient, and systematic approach to handle cases when the size of the network is too large to be handled.
Generally there are two mainstream strategies to handle large-scaled networks, sampling and summarization. In social network sampling, it is assumed that the full network is unseen or impossible to be obtained. Techniques to sample a sub-network with the goal to preserve some specific properties of the original network are needed. For example, in order to study the existence of ‘six-degree separation’ phenomenon in an online social network such as Facebook, one need to develop a crawler to extract partial network data from the social network for experiments, assuming the complete network data is not open to public. To make a convincing conclusion, one then needs to make sure the sampling process can faithfully preserve the length distribution of shortest path between nodes.
Social network summarization (some researchers call it social network compression), on the other hand, aims at addressing different issues. In summarization, the entire network structure is known in prior, but is usually too big or complex for human to visualize and for machine to store and process efficiently. In this sense, graph summarization algorithms aim at devising strategies to condense the network as much as possible without losing too much information. The summarized network can be visualized more clearly, stored more efficiently, and processed more easily.
Webpage:
http://mslab.csie.ntu.edu.tw/tut-pakdd13/
- Mining Multiple Threads of Streaming Data
Lecturers:
Myra Spiliopoulou and Georg Krempl
Abstract:
Stream mining is a mature area of research. However, several applications that require adaptive learning from evolving data do not seem to fit to the conventional stream mining paradigm. For example, a bank grants loans to customers and uses their data for model learning; the label (loan-payed-back YES or NO) arrives some years later, though, during which years the market may have changed drastically. Is this a stream mining problem? How many streams are there? We can distinguish between the stream of customers and the stream of their labels, which arrive with a time lag of years.
As another example, a hospital monitors patients with chronical diseases that come (ir)regularly to the hospital and undergo different tests; the streams of medical recordings and of signals (EEG, fMRI) can be used for learning. The hospital wants to learn a model on how the patients' health evolves in response to the disease and to medications. This problem seems completely different from the previous one, albeit streams of data are there in both cases.
In this tutorial, we bring together research advances on model learning and adaption for dynamic applications that collect and analyze different sources of dynamic data. In the introductory part of the tutorial, we present the classic stream mining paradigm and summarize the challenges being investigated in the state-of-the-art research.
Webpage:
http://kmd.cs.ovgu.de/tutorial_pakdd2013.html
- Relevance Feature Discovery
Lecturers:
Yuefeng Li and Ning Zhong
Abstract:
It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user preferences because of the large number of terms, patterns, and noise. Most existing popular text mining and classification methods have adopted term-based approaches to reduce the noisy features. However, they have all suffered from the problems of polysemy and synonymy. Moreover the main drawback of this mechanism is that the relationship among words cannot be easily reflected. Over the years, people have often held the hypothesis that pattern-based methods should perform better than term-based ones in describing user preferences. Different from terms based approach patterns are suffer from low frequency problem which is effect the quality of extracted features. The proposed tutorial aims to provide a unifying view on the basic and applied concept drift research in text mining and related areas. In the first part we will introduce the problem of text mining include defining user intent and using patterns to extract knowledge. In the second part we will introduce available approaches and techniques that utilized both high-level patterns and low-level terms to extract high quality features and solve the low frequency problem of patterns. In the third part we will reflect on the past, present and future of concept drift research and outline future research directions. We will focus on the link between research scenarios and application needs.
Webpage:
https://dl.dropbox.com/u/37464765/PAKDD/PAKDD/PAKDD13.html
- Transfer Learning with Applications
Lecturers:
Qiang Yang and Sinno Jialin Pan
Abstract:
Transfer learning has attracted increasingly attention in artificial intelligence, machine learning and many other application areas. Different from traditional machine learning methods which assume the training and testing data come from the same task or domain, transfer learning aims to extract common knowledge across domains or tasks, such that a model trained on one domain or task can be dapted to other domains or tasks. In this tutorial, we aim to 1) give an easy-to-follow tutorial on this fast-growing research area, and discuss the relationships between transfer learning and other learning areas; 2) introduce representative transfer learning methods and their real-world applications to wireless sensor networks, natural language processing (NLP), recommender systems, social networking, etc; 3) discuss research challenges and future directions in this area.
Webpage:
http://www1.i2r.a-star.edu.sg/~jspan/tutorials/pakdd13TL.htm