A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Two Examples of Useful, Non-Robust Features

Goh, Gabriel

doi:10.23915/distill.00019.3

Two Examples of Useful, Non-Robust Features

By kadri alaa / August 6, 2019

A Discussion of ‘Adversarial Examples Are Not Bugs, They Are Features’: Two Examples of Useful, Non-Robust Features

Ilyas et al. define a feature as a function $f$

\mathbf{E}\left[\inf_{\|\delta\|\leq\epsilon}yf(x+\delta)\right],

its correlation with the label while under attack. Ilyas et al.
suggests that in addition to the pedestrian, robust features we know and love (such as the color of the
sky), our models may also be taking advantage of useful, non-robust features, some of which may even lie
beyond the threshold of human intuition. This begs the question: what might such non-robust features look
like?

Non-Robust Features in Linear Models

Our search is simplified when we realize the following: non-robust features are not unique to the complex,
nonlinear models encountered in deep learning. As Ilyas et al
observe, they arise even in the humblest of models — the linear one. Thus, we restrict our attention
to linear features of the form:

f(x) = \frac{a^Tx}{\|a\|_\Sigma}\qquad \text{where} \qquad \Sigma = \mathbf{E}[xx^T] \quad \text{and} \quad \mathbf{E}[x] = 0.

The robust usefulness of a linear feature admits an elegant decomposition
This
$\begin{aligned} \mathbf{E}\left[\inf_{\|\delta\|\leq\epsilon}yf(x+\delta)\right] & =\mathbf{E}\left[yf(x)+\inf_{\|\delta\|\leq\epsilon}yf(\delta)\right]\\ & =\mathbf{E}\left[yf(x)+\inf_{\|\delta\|\leq\epsilon}y\frac{a^{T}\delta}{\|a\|_{\Sigma}}\right]\\ & =\mathbf{E}\left[yf(x)+\frac{\inf_{\|\delta\|\leq\epsilon}a^{T}\delta}{\|a\|_{\Sigma}}\right]=\mathop{\mathbf{E}[yf(x)]}-\epsilon\frac{\|a\|_{*}}{\|a\|_{\Sigma}} \end{aligned}$ into two terms:

\mathbf{E}\left[\inf_{\|\delta\|\leq\epsilon}yf(x+\delta)\right]

\mathop{\mathbf{E}[yf(x)]}

\epsilon\frac{\|a\|_{*}}{\|a\|_{\Sigma}}

The robust usefulness of a feature

the correlation of the feature with the label

the feature’s non-robustness

In the above equation $\|\cdot\|_*$

Plotted below is the binary classification task of separating truck and frog in CIFAR-10 on
the set of features $a_i$

The elusive non-robust useful features, however, seem conspicuously absent in the above plot.
Fortunately, we can construct such features by strategically combining elements of this basis.

We demonstrate two constructions:

It is surprising, thus, that the experiments of Madry et al.
(with deterministic perturbations) do distinguish between the non-robust useful
features generated from ensembles and containments. A succinct definition of a robust feature that peels
these two worlds apart is yet to exist, and remains an open problem for the machine learning community.

Response Summary: The construction of explicit non-robust features is
very interesting and makes progress towards the challenge of visualizing some of
the useful non-robust features detected by our experiments. We also agree that
non-robust features arising as “distractors” is indeed not precluded by our
theoretical framework, even if it is precluded by our experiments.
This simple theoretical framework sufficed for reasoning about and
predicting the outcomes of our experiments
We also presented a theoretical setting where we can
analyze things fully rigorously in Section 4 of our paper..
However, this comment rightly identifies finding a more comprehensive
definition of feature as an important future research direction.

Response: These experiments (visualizing the robustness and
usefulness of different linear features) are very interesting! They both further
corroborate the existence of useful, non-robust features and make progress
towards visualizing what these non-robust features actually look like.

We also appreciate the point made by the provided construction of non-robust
features (as defined in our theoretical framework) that are combinations of
useful+robust and useless+non-robust features. Our theoretical framework indeed
enables such a scenario, even if — as the commenter already notes — our
experimental results do not. (In this sense, the experimental results and our
main takeaway are actually stronger than our theoretical
framework technically captures.) Specifically, in such a scenario, during the
construction of the $\widehat{\mathcal{D}}_{det}$

Overall, our focus while developing our theoretical framework was on
enabling us to formally describe and predict the outcomes of our experiments. As
the comment points out, putting forth a theoretical framework that captures
non-robust features in a very precise way is an important future research
direction in itself.

Acknowledgments

Shan Carter (design overhaul), Preetum (technical discussion), Chris Olah (technical discussion), Ludwig
(overall feedback), Ria (feedback) Aditiya (feedback)

Author Contributions

Research: Alex developed …

Writing & Diagrams: The text was initially drafted by…

References

Adversarial examples are not bugs, they are features
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A., 2019. arXiv preprint arXiv:1905.02175.

Updates and Corrections

If you see mistakes or want to suggest changes, please create an issue on GitHub.

Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.

Citation

For attribution in academic contexts, please cite this work as

Goh, "A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Two Examples of Useful, Non-Robust Features", Distill, 2019.

BibTeX citation

@article{goh2019a,
  author = {Goh, Gabriel},
  title = {A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features': Two Examples of Useful, Non-Robust Features},
  journal = {Distill},
  year = {2019},
  note = {
  doi = {10.23915/distill.00019.3}
}

[ilyas2019adversarial] Adversarial examples are not bugs, they are features
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A., 2019. arXiv preprint arXiv:1905.02175.