Statistical Learnability of Strategic Linear Classifiers: A Proof Walkthrough
In the previous article in this series, we examined the concept of strategic VC dimension (SVC) and its connection to the Fundamental Theorem of Strategic Learning. We will make use of both of those in this article, alongside the ideas of achievable labelings and strategic shattering coefficients that we explored in the lead-up to them.
Quantifying the Complexity and Learnability of Strategic Classification Problems
If you haven't read the first article in this series yet, I encourage you to start from there before moving on to the aforementioned article on SVC:
Extending PAC Learning to a Strategic Classification Setting
With the context from the other articles in this series in place, a basic grasp of set theory and geometry will be all you'll need to understand the theorem and its proof.
How an Instance-Wise Cost Function Affects SVC: Stating the Theorem
As we saw, SVC can be used as a tool to estimate the expressive power of a hypothesis class within a strategic classification context. Having carefully defined SVC as a generalization of the canonical VC dimension, we understand that the two have much in common. When, though, does SVC diverge from its canonical counterpart? Can we come up with a scenario in which the strategic aspect of a classification problem significantly increases its complexity? It turns out we can, with relative ease: linear classification.
Linear classification involves determining whether a data point should be positively or negatively classified based on a linear function applied to its features. Geometrically, we can imagine a linear classifier inducing a linear decision boundary in d-dimensional real space (ℝᵈ ). Anything on one side of the boundary is positively classified, and anything on the other side is negatively classified. In one-dimensional space, the decision boundary is a threshold (as we saw in the previous article). In two-dimensional space, it's a line dividing the plane. In general, it's a hyperplane.
In canonical binary classification, the VC dimension of the hypothesis class comprising all linear classifiers in ℝᵈ is d + 1, which is finite. For example, for d = 2 (linear classifiers in ℝ²), the VC dimension is 3. The Fundamental Theorem of Statistical Learning dictates that canonical linear classification is therefore PAC learnable.⁽¹⁾
Intuitively, we might expect the same conclusion to hold for the strategic analog of the problem. After all, linear classifiers are some of the simplest classifiers there are, and reasoning about them can be rather natural.⁽²⁾
However, that simplicity goes out the window as soon as we throw instance-wise cost functions into the mix. As we will prove:
_Given a strategic linear classification problem Sᴛʀᴀᴄ⟨H, R, c⟩, there exists an instance-wise cost function **c(z; x) = ℓₓ(z – x)** such that SVC(H, R, c) =_ ∞.
In other words, using the Fundamental Theorem of Strategic Learning, we find that linear classification in a strategic setting equipped with an instance-wise cost function is not generally PAC learnable. Interestingly, it will not be PAC learnable even if we strip away as much complexity as we can. In this case, we will do so by focusing on strategic linear classification on the Cartesian plane ( X ⊆ ℝ²) with the homogeneous preference class (R = { 1 }).
The more general conclusion will follow from the counterexample we will show under those simplifying conditions. If strategic linear classification is not PAC learnable in ℝ², there is no way it could be PAC learnable in any higher dimension. Likewise, every other preference class we laid out in our setup is a strict generalization of the homogeneous preference class. If we could prove PAC learnability for any of those preference classes, we would also be able to do so for the simpler case where R = { 1 }.
From Labelings to Points on the Unit Circle: Proof Setup
Based on the assumptions above, we begin by turning our attention to the special case Sᴛʀᴀᴄ⟨Hₗ, { 1 }, c⟩, with Hₗ being the hypothesis class comprising all linear classifiers in ℝ². We then initialize n two-dimensional feature vectors at the origin: ∀ i ≤ n . xᵢ = (0, 0). Since we're using the homogeneous preference class, we have that ∀ i ≤ n . rᵢ = 1. The only difference between the data points will be in how our cost function behaves on each of them. This is where the crux of the proof lies, as we will soon see.
Before we discuss the cost function at length, though, we need to geometrize the possible labelings of our unlabeled data points. As we saw last time, a set of n unlabeled data points must have exactly 2ⁿ possible labelings. Representing a set of labelings (n-tuples) geometrically in ℝ² is relatively straightforward: we simply select an arbitrary point for each possible labeling. In particular, we will choose 2ⁿ such representative points on the unit circle, each assigned to a possible labeling. While the particular coordinates of the representative points themselves are unimportant, we do require that each such point be unique. We also require that no two points be origin-symmetric with respect to one another.
We will denote this set of representative points by S. Having selected our representative points, we use them to define the origin-symmetric set S', i.e., S' = { (-x, –y) : (x, y) ∈ S }. Note that S and S' are disjoint (S ∩ S' = ∅) as a consequence of how we selected the points in S.
For a particular _x_ᵢ, we define Sᵢ as the subset of S that includes only the points that represent labelings in which _x_ᵢ is positively classified. Similarly, we derive the origin-symmetric Sᵢ' ⊂ S' from Sᵢ. In the example below, the points above the x-axis are those representing labelings in which _x_ᵢ is positively classified, i.e., Sᵢ. Those below the x-axis comprise their origin-symmetric set Sᵢ' (with the numbering matching between symmetric pairs of points). Note that the selection of points above the x-axis is completely arbitrary.

We proceed to construct a convex polygon Gᵢ, with its vertices being the points in Sᵢ ∪ Sᵢ'. The Gᵢ for each unlabeled data point will be key in designing an instance-wise cost function c with which we will always be able to achieve all possible labelings, thus showing that SVC(Hₗ, { 1 }, c) = ∞. Towards this end, the convexity of Gᵢ will prove critical, as will its origin symmetry (stemming from our choice of Sᵢ' ).

Turning Polygons into Preferences: Constructing the Cost Function
For each of the n origin-initialized, unlabeled data points we started with, we now have a convex, origin-symmetric polygon that represents the labelings in which it is positively classified. Each Gᵢ can now be used to define the behavior of our instance-wise cost function c on its corresponding xᵢ. We will use Gᵢ to define a seminorm⁽³⁾:
∥ y _∥_ɢᵢ = inf { ε ∈ ℝ⁺ : y _∈ εGᵢ }_
This definition implies that the distance between xᵢ and some point z is less than 1 if and only if z lies within Gᵢ. I.e.:
∥ z – xᵢ _∥ɢᵢ < 1 ⇔ z ∈_ Gᵢ
For the rest of the proof, it is sufficient that we understand this connection between ∥ ⋅ ∥ɢᵢ and a point being inside _Gᵢ._ (See Footnote (3) for a discussion of why ∥ ⋅ ∥ɢᵢ qualifies as a seminorm and for more details about its geometric interpretation.)
We thus define the instance-wise cost function c:
c_(z; xᵢ) = ℓᵢ(z – xᵢ)_
Where:
_ℓᵢ(z – xᵢ) = ∥ z – xᵢ ∥_ɢᵢ
That is, for each unlabeled data point xᵢ, c behaves as ∥ ⋅ ∥ɢᵢ would. Note that this behavior is different for each data point. This is because we constructed a unique Gᵢ for every xᵢ, and each ∥ ⋅ ∥ɢᵢ is derived from its corresponding polygon Gᵢ.
Data Point Best Response as a Result of the Cost Function
With the instance-wise cost function c in place, we may turn our attention to how our data points interact with linear classifiers. Recall that we have constrained our consideration to the homogeneous preference class, meaning that r = 1 for all of our points. I.e., xᵢ stands to gain a reward of magnitude 1 for being positively classified. Given a linear classifier, each data point will thus be willing to incur any cost less than 1 to manipulate its feature vector to ensure it falls on the positive side of the decision boundary. This will guarantee it receives positive utility as a result of the manipulation.
c is designed so that a data point with feature vector xᵢ has to pay ∥ z – xᵢ ∥ɢᵢ to change its feature vector to z. As we saw, **** as long as __ z lies inside Gᵢ, this cost will be less than 1.
Suppose we have a decision boundary that crosses Gᵢ (intersects it at two points) with xᵢ falling on its negative half-plane. As illustrated in Figure 3 below, this creates a sub-polygon such that for any z within it:
- The cost to move to z is less than 1: c(z; xᵢ) = ∥ z – xᵢ ∥ɢᵢ < 1
- The realized reward for moving is precisely 1: