Both papers tackle the problem of image retrieval and explore different ways to learn deep visual representations for this task. In both cases, a CNN is used to extract a feature map that is aggregated into a compact, fixed-length representation by a global-aggregation layer*. Finally, this representation is first projected using a FC layer, and L2 normalized so images can be efficiently compared with the dot product. All components in this network, including the aggregation layer, are differentiable, which makes it end-to-end trainable for the end task. In , a Siamese architecture that combines three streams with a triplet loss was proposed to train this network. In , this work was extended by replacing the triplet loss with a new loss that directly optimizes for Average Precision.