Augmented object intelligence with XR-Objects


The implementation of XR-Objects involves four steps: (1) detecting objects, (2) localizing and anchoring onto objects, (3) coupling each object with an MLLM for metadata retrieval, and (4) executing actions and displaying the output in response to user input. We use Unity and its AR Foundation to bring these together to build a system that augments real-world objects with functional context menus.

Object detection: XR-Objects uses an object detection module powered by MediaPipe, and leverages a mobile-optimized convolutional neural network for real-time classification. The system detects objects, assigning them class labels (e.g., “bottle,” “monitor”) and generating 2D bounding boxes to serve as spatial anchors for AR content. It recognizes 80 object types originating in the COCO dataset. To prioritize privacy and data efficiency, only relevant object regions are processed, excluding, for example, people detected in a scene.

Localization and anchoring: Once an object is detected, XR-Objects anchors AR menus using 2D bounding boxes and depth data, converting them into precise 3D coordinates via raycasting. A semi-transparent “bubble” signals interactables, and the full menu appears only when tapped, reducing visual clutter. Safeguards ensure accurate placement without duplication.

MLLM coupling: Each object is paired with an MLLM session, which analyzes a cropped image to provide detailed information, like product specs or reviews. For instance, it can identify a “bottle” as “Superior dark soy sauce” and retrieve metadata, e.g., prices or ratings, using PaLI.

Leave a Reply

Your email address will not be published. Required fields are marked *