Introducing TernaryMixOE
New to OOD? Read my post “A Beginner’s Guide to Out-of-Distribution Detection”
This project was the crowning jewel of my time at the Air Force, and the most involved single project I’ve taken to date. It’s also one of the things I’m most proud of.
The final paper includes four major achievements:
The restating of the problem domain for fine-grained out-of-distribution (OOD) detection, including a novel description of two independent classifications of granularity
A proposal for two new metrics to evaluate fine-grained OOD performance, Hierarchical AUROC/FPR and Semantic vs. True OOD AUROC/FPR
The introduction of a new training methodology for fine-grained OOD, called TernaryMixOE
The evaluation our new training methodology against existing models, and the establishment baselines for our new metrics.
Our achievements are summarized below:
Restating the Problem
First, we should acknowledge that there is so-far little theoretical basis to measure the granularity of OOD detection tasks for various datasets. Granularity can be measured empirically by training the same architecture with the same objective on two datasets and comparing their performance, but this method is fuzzy and computationally expensive. Our paper does not solve this problem, but it does establish a framework to compare datasets qualitatively.
Fig. 1 – Our new understand of Fine-grained OOD Detection. The horizontal axis shows the difficulty of the OOD detection task. The vertical axis shows the difficulty of separating classes in the In-distribution, also known as the core task.
For example, separating CIFAR10 from SVHN is comparatively easier than CIFAR100, as CIFAR100 are subsets of the classes of CIFAR 10.
Similarly, separating the classes of FGVC-A is easier than the classes of ShipsRSImageNet, as the former is models of aircraft (Airbus A330-200 vs. Airbus A330-300) while the latter is types of ships based on function (Frigate vs Carrier)
To achieve this, we break the concept down into two parts: In-distribution (ID) granularity and Out-of-Distribution granularity, shown on the vertical and horizontal axes of Fig. 1, respectively. These concepts correspond to the two tasks inherent in any OOD exercise, namely the core classification task and the ability to detect OOD samples. The difficulty of these two tasks scale independently of one another. A task may be relatively easy (coarse-grained) on the in-distribution side but difficult (fine-grained) in the OOD task – e.g. separating dogs and cats as the ID task with OOD samples of wolves thrown in the mix. Similarly, the in-distribution task can be fine-grained while the OOD task is coarse – separating breeds of dog with pictures of airplanes thrown in for OOD. Likewise both tasks can be easy or difficult.
Our New Metrics: Hierarchical AUROC, and Semantic vs. True OOD AUROC
Hierarchical AUROC
To evaluate models along the OOD granularity axis described above, we use hierarchical datasets and isolate OOD test sets using that hierarchy. For example, using FGVC-A we create three levels of OOD granularity by creating three test sets. The first, Level 1, are samples held out at the broadest level - manufacturer. Similarly, Level 2 are samples held out at the level of Family (e.g. all samples of Airbus A340: A340-200, A340-300, etc.). Level 3 classes are the most granular, holding out specific variants within a family.
By dividing our OOD test set this way, we can evaluate each independently and create a suite of metrics. This is the basis of Hierarchical AUROC, evaluating the relative performance of a model on comparatively more or less difficult OOD tasks.
Semantic vs. True AUROC (STAUROC)
To introduce our second metric, we propose a hypothetical OOD object detection model which makes two distinct kinds of errors. The first is what we identify as “Semantically meaningful” OOD, in this case an unknown airplane. These are OOD samples that are relevant to the application of the model and could be passed to a subject-matter expert for identification. In contrast, a piece of jetway misidentified as an airplane would be what we call “True OOD”. While an ideal object detection model would not make these kinds of mistakes, it would be beneficial to have the structural redundancy to assure the efforts of expensive human experts are not wasted.
Fig. 2 – Our hypothetical object detector could produce two kinds of OOD error. Semantic OOD is the identification of relevant out-of-distribution examples like unknown airplanes. True OOD are irrelevant OOD samples. The line between the two cases is blurry and subject, but relevant in the application of models in real world tasks.
To evaluate a model at this task, we ignore the in-distribution task and instead ask the model to separate an approximation of the semantically meaningful set (OOD Airplanes) from another unrelated dataset (Stanford Dogs). A model that performs well at this metric will have a strong separation of the two OOD sets.
TernaryMixOE
Existing methods MixOE-Linear and MixOE-Cut outlined in Zhang, et al., use mixing operations to randomly blend images from the in-distribution set with outlier images. This creates a virtual outlier set that exists in the feature space between the two sets, improving fine-grained performance. The theoretical justification for this change is that fine-grained OOD samples will have aspects of both in- and out-of-distribution samples. We extend this method by creating an additional virtual set created by randomly blending in-distribution images with each other.