GrokNet: Unified Computer Vision Model Trunk and Embeddings for Commerce
Introduction
- Multi-task learning to train a single computer vision trunk.
- Missing information like color, material, brand or year of production.
- In MSURU, we use search log interaction data to train these image classifiers with large-scale weakly-supervised data.
GrokNet
- Trained on human annotations, user-generated tags and noisy search engine interaction data.
- Predict the following - object category (bar stool), home attributes (object color, material, decor style), fashion attributes (style, color, material, sleeve length), vehicle attributes (make, model, external color, decade), Search queries (text phrases likely used by users to find product) and Image Embedding (256-bit hash).
- Serve various needs like Feed and Catalog.
Architecture
Training Data
- Object Categories: 566 labels such as chair, bracelet and bicycle.
- Attributes (Fashion, Home and Vehicles)
- Product Identities - Catalog.
- Weakly Supervised Data Augmentation - Object detection, top match if distance is below threshold.
- Marketplace Search Queries - (Query, Image pairs) which have initiated a message. 45k most commmon query for an image.
Trunk Architecture
- ResNext-101 32x4d - 101 layers, 32 groups and group width 4
- GeM Pooling - Generalized Mean Pooling (instead of average pooling.)
- Loss Functions - Softmax and Multi-label Softmax, 3 metric learning losses for embedding (ArcFace, Multi-label ArcFace and Pairwise Embedding Loss).
Kaushik Rangadurai
Code. Learn. Explore