Kernelbased clustering algorithms achieve better performance on real world data than Euclidean distancebased clustering algorithms. However, kernelbased algorithms pose two important challenges:
 they do not scale sufficiently in terms of runtime and memory complexity, i.e. their complexity is
quadratic in the number of data instances, rendering them inefficient for large data sets containing millions of data points, and
 the choice of the kernel function is very critical to the performance of the algorithm.
In this project, we aim at developing efficient schemes to reduce the running time complexity and memory requirements of these clustering algorithms and learn appropriate kernel functions from the data.
Approximate Kernel Clustering
In [1] and [2], we focus on reducing the complexity of the kernel kmeans algorithm. The kernel kmeans algorithm has quadratic runtime and memory complexity
and cannot cluster more than a few thousands of points efficiently. In [1], we developed an algorithm called Approximate Kernel kmeans
which replaces the kernel similarity matrix in kernel kmeans with a low rank approximate matrix,
obtained by randomly sampling a small subset of the data set. The approximate kernel matrix is used to obtain the cluster labels for the data
in an iterative manner (see Figure 1). The running time complexity of the approximate kernel kmeans algorithm is linear in
the number of points in the data set, thereby reducing the time for clustering by orders of magnitude (compared to the kernel kmeans algorithm).
For instance, the approximate kernel kmeans algorithm is able to cluster 80 million images from the Tiny data set in about 8 hours on a single core, whereas
kernel kmeans requires several weeks to cluster them. The clustering accuracy of the approximate kernel kmeans algorithm is close to that of the kernel kmeans
algorithm, and the difference in accuracy reduces linearly with the increase in the number of samples.
The approximate kernel kmeans algorithm can also be easily parallelized to further reduce the time for clustering.
In [2], random projection is employed to project the data in the kernel space to a lowdimensional space, where the clustering can be performed using
the kmeans algorithm. Figure 2 shows some sample clusters
from the Tiny data set obtained using these approximate algorithms.


Figure. 1 Approximate Kernel kmeans 
Figure. 2 Sample clusters from the Tiny data set obtained using the approximate kernel clustering algorithms 
Stream Kernel Clustering
Stream data is generated continuously and is unbounded in size. Stream data clustering, particularly kernelbased clustering, is additionally challenging
because it is not possible to store all the data and compute the kernel similarity matrix. In [5], we devise a sampling based algorithm which selects a subset
of the stream data to construct an approximate kernel matrix. Each data point is sampled as it arrives, with probability proportional to its importance
in the construction of the lowrank kernel matrix. Clustering is performed in linear time using kmeans, in a lowdimensional space
spanned by the eigenvectors corresponding to the top eigenvalues of the kernel matrix constructed from the sampled points. Points which are not sampled are
assigned labels in realtime by selecting the cluster center closest to them (Figure 3).
We evaluated this algorithm on the Twitter stream, which was clustered with high accuracy at a speed of about 1 Mbps. Figure 4 shows sample tweets from the
cluster representing the #HTML hashtag.


Figure. 3 Approximate Stream kernel kmeans 
Figure. 4 Cluster representing #HTML from the Twitter stream 
Relevant Publication(s)
[1] R. Chitta, R. Jin, T. C. Havens, and A. K. Jain, "Approximate Kernel kmeans: solution to Large Scale Kernel Clustering", KDD, San Diego, CA, August 2124, 2011.
[2] T. C. Havens, R. Chitta, A. K. Jain, and R. Jin, "Speedup of Fuzzy and Possibilistic Kernel cMeans for LargeScale Clustering", Proc. IEEE Int. Conf. Fuzzy Systems, Taipei, Taiwan, June 2730, 2011.
[3] R. Chitta, R. Jin, and A. K. Jain, "Efficient Kernel Clustering using Random Fourier Features", ICDM, Brussels, Belgium, Dec. 1013, 2012.
[4] R. Chitta, R. Jin, and A. K. Jain, "Scalable Kernel Clustering: Approximate Kernel kmeans", arXiv preprint arXiv:1402.3849, 2014.
[5] R. Chitta, R. Jin, and A. K. Jain, "Stream Clustering: Efficient KernelBased Approximation using Importance Sampling", (Under Review).