GrokNet: Unified Computer Vision Model Trunk and Embeddings for Commerce

Aug 27, 2020 Search Comments

Multi-task learning to train a single computer vision trunk.
Missing information like color, material, brand or year of production.
In MSURU, we use search log interaction data to train these image classifiers with large-scale weakly-supervised data.

GrokNet

Trained on human annotations, user-generated tags and noisy search engine interaction data.
Predict the following - object category (bar stool), home attributes (object color, material, decor style), fashion attributes (style, color, material, sleeve length), vehicle attributes (make, model, external color, decade), Search queries (text phrases likely used by users to find product) and Image Embedding (256-bit hash).
Serve various needs like Feed and Catalog.

Training Data

Object Categories: 566 labels such as chair, bracelet and bicycle.
Attributes (Fashion, Home and Vehicles)
Product Identities - Catalog.
Weakly Supervised Data Augmentation - Object detection, top match if distance is below threshold.
Marketplace Search Queries - (Query, Image pairs) which have initiated a message. 45k most commmon query for an image.

Trunk Architecture

ResNext-101 32x4d - 101 layers, 32 groups and group width 4
GeM Pooling - Generalized Mean Pooling (instead of average pooling.)
Loss Functions - Softmax and Multi-label Softmax, 3 metric learning losses for embedding (ArcFace, Multi-label ArcFace and Pairwise Embedding Loss).

Code. Learn. Explore