Heuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.
\textit{Clustering problems} often arise in the fields like data mining, machine learning etc. to group a collection of objects into similar groups with respect to a similarity (or dissimilarity) measure. Among the clustering problems, specifically \textit{$k$-means} clustering has got much attention from the researchers. Despite the fact that $k$-means is a very well studied problem its status in the plane is still an open problem. In particular, it is unknown whether it admits a PTAS in the plane. The best known approximation bound in polynomial time is $9+\eps$. In this paper, we consider the following variant of $k$-means. Given a set $C$ of points in $\mathcal{R}^d$ and a real $f > 0$, find a finite set $F$ of points in $\mathcal{R}^d$ that minimizes the quantity $f*|F|+\sum_{p\in C} \min_{q \in F} {||p-q||}^2$. For any fixed dimension $d$, we design a local search PTAS for this problem. We also give a ‘bi-criterion’ local search algorithm for $k$-means which uses $(1+\eps)k$ centers and yields a solution whose cost is at most $(1+\eps)$ times the cost of an optimal $k$-means solution. The algorithm runs in polynomial time for any fixed dimension. The contribution of this paper is two fold. On the one hand, we are being able to handle the square of distances in an elegant manner, which yields near optimal approximation bound. This leads us towards a better understanding of the $k$-means problem. On the other hand, our analysis of local search might also be useful for other geometric problems. This is important considering that very little is known about the local search method for geometric approximation.
Most users of online services have unique behavioral or usage patterns. These behavioral patterns can be exploited to identify and track users by using only the observed patterns in the behavior. We study the task of identifying users from statistics of their behavioral patterns. Specifically, we focus on the setting in which we are given histograms of users’ data collected during two different experiments. We assume that, in the first dataset, the users’ identities are anonymized or hidden and that, in the second dataset, their identities are known. We study the task of identifying the users by matching the histograms of their data in the first dataset with the histograms from the second dataset. In recent works, the optimal algorithm for this user identification task is introduced. In this paper, we evaluate the effectiveness of this method on three different types of datasets and in multiple scenarios. Using datasets such as call data records, web browsing histories, and GPS trajectories, we show that a large fraction of users can be easily identified given only histograms of their data; hence these histograms can act as users’ fingerprints. We also verify that simultaneous identification of users achieves better performance compared to one-by-one user identification. We show that using the optimal method for identification gives higher identification accuracy than heuristics-based approaches in practical scenarios. The accuracy obtained under this optimal method can thus be used to quantify the maximum level of user identification that is possible in such settings. We show that the key factors affecting the accuracy of the optimal identification algorithm are the duration of the data collection, the number of users in the anonymized dataset, and the resolution of the dataset. We analyze the effectiveness of k-anonymization in resisting user identification attacks on these datasets.
We consider the model averaged tail area (MATA) confidence interval proposed by Turek and Fletcher, CSDA, 2012, in the simple situation in which we average over two nested linear regression models. We prove that the MATA for any reasonable weight function belongs to the class of confidence intervals defined by Kabaila and Giri, JSPI, 2009. Each confidence interval in this class is specified by two functions b and s. Kabaila and Giri show how to compute these functions so as to optimize these intervals in terms of satisfying the coverage constraint and minimizing the expected length for the simpler model, while ensuring that the expected length has desirable properties for the full model. These Kabaila and Giri ‘optimized’ intervals provide an upper bound on the performance of the MATA for an arbitrary weight function. This fact is used to evaluate the MATA for a broad class of weights based on exponentiating a criterion related to Mallows’ C_P. Our results show that, while far from ideal, this MATA performs surprisingly well, provided that we choose a member of this class that does not put too much weight on the simpler model.
This work introduces a theoretical foundation for a procedure called testing based forward model selection in regression problems. Forward selection is a general term referring to a model selection procedure which inductively selects covariates that add predictive power into a working statistical model. This paper considers the use of testing procedures, derived from traditional statistical hypothesis testing, as a criterion for deciding which variable to include next and when to stop including variables. Probabilistic bounds for prediction error and number of selected covariates are proved for the proposed procedure. The general result is illustrated by an example with heteroskedastic data where Huber Eicker White standard errors are used to construct tests. The performance of the testing-based forward selection is compared to Lasso and Post-Lasso in simulation studies. Finally, the use of testing- based forward selection is illustrated with an application to estimating the effects of institution quality on aggregate economic output.