Active data screening for efficient feature learning

Waheed Bajwa and Anand Sarwate received a one year $160,000 grant from NSF CIF for the project entitled "Active data screening for efficient feature learning".

The abstract is given below.

Advances in sensing and data acquisition technologies make it easy to generate vast quantities of data that must be stored, communicated, processed, and understood. This data may be of variable quality and the nature of the data may vary over time -- this variability can cause difficulties for existing approaches to efficiently represent the data. Current methods use economical representations of the data in terms of a smaller number of properties, or features, of the raw data. Standard feature representations such as Fourier and wavelet representations may not be efficient at representing the data from these new acquisition technologies. One paradigm to overcome this mismatch is the data-driven approach, in which an algorithm processes the data to learn a novel and efficient feature representation for the given data. While these are more useful, such approaches may not scale well to massive data sets.

This work designs new methods for data-driven feature learning that are scalable and robust to noisy, time-varying data.

It proposes an "active screening" approach to learning new features; the data processing algorithm uses a low-complexity criterion to screen for useful and informative points for feature learning. Advantages of active screening include a reduction in the computational and storage overhead as well as the ability to reject outliers or other spurious and misleading data. The investigators develop active screening methods for consistent estimation under generative models for the data, analyze the tradeoff between representation and classification in active screening for discriminative dictionary learning, and extend the active screening analysis to distributed settings for distributed dictionary learning. They investigate the promise of these methods on two large-scale electrocardiography (ECG) datasets of 170+ patients. This work combines ideas from statistics (feature screening) and machine learning (active learning and selective sampling) to design efficient representations of complex signals from massive data sets and may inform the design of new data acquisition technologies by incorporating screening ideas into the technologies themselves.

Illustration of the benefits of active screening: selecting useful samples leads to better signal representations.