The image above is a screenshot captured directly from Meta’s new Segment Anything demo application. I didn’t draw that blue outline on the white horse. The AI model did after I clicked on it. See the original image immediately below.
The AI model can identify all of the objects in the image, and you can save them separately (see the example below). The image is from the example gallery provided by Meta, but you can also upload your own image when you try it out.
An experienced designer could do a more precise job with the object extraction, but this happened with two clicks. Meta acknowledged imperfections in some object segmentation in the research paper it released with the model announcement.
While SAM performs well in general, it is not perfect. It can miss fine structures, hallucinate small disconnected components, and does not produce boundaries as crisply as more computationally intensive methods that “zoom-in”, e.g. [18]. In general, we expect dedicated interactive segmentation methods to outperform SAM when a large number of points is provided, e.g. [67]. Unlike these methods, SAM is designed for generality and breadth of use rather than high IoU interactive segmentation.
Notice that is mentioned hallucinations as a shortcoming. The propensity of large AI models to “invent” artifacts is not limited to language.
In addition, Meta acknowledges other techniques are already available that may outperform Segment Anything. The company says the difference is that other models make trade-off decisions to optimize for certain types of segmentation tasks, while Segment Anything sacrifices precision on some tasks to be “a broadly capable model that can adapt to many (though not all) existing and new segmentation tasks via prompt engineering…An important distinction in our work is that a model trained for promptable segmentation can perform a new, different task at inference time by acting as a component in a larger system.”
Meta explains that the versatility of its models is based on its architecture, training, and large data set. To further the field and provide a general-purpose model for researchers, Meta is making the model and dataset available to researchers under the free Apache 2.0 license.
A Segmentation Foundation Model
You have heard about foundation models such as GPT-3 and Stable Diffusion for text and image generative, respectively. Meta says Segment Anything could become the first foundation model for image segmentation.
Segmentation — identifying which image pixels belong to an object — is a core task in computer vision and is used in a broad array of applications, from analyzing scientific imagery to editing photos. But creating an accurate segmentation model for specific tasks typically requires highly specialized work by technical experts with access to AI training infrastructure and large volumes of carefully annotated in-domain data…
Reducing the need for task-specific modeling expertise, training compute, and custom data annotation for image segmentation is at the core of the Segment Anything project. To realize this vision, our goal was to build a foundation model for image segmentation: a promptable model that is trained on diverse data and that can adapt to specific tasks, analogous to how prompting is used in natural language processing models. However, the segmentation data needed to train such a model is not readily available online or elsewhere, unlike images, videos, and text, which are abundant on the internet. Thus, with Segment Anything, we set out to simultaneously develop a general, promptable segmentation model and use it to create a segmentation dataset of unprecedented scale.
SAM has learned a general notion of what objects are, and it can generate masks for any object in any image or any video, even including objects and image types that it had not encountered during training. SAM is general enough to cover a broad set of use cases and can be used out of the box on new image “domains” — whether underwater photos or cell microscopy — without requiring additional training (a capability often referred to as zero-shot transfer). In the future, SAM could be used to help power applications in numerous domains that require finding and segmenting any object in any image.
The image below will give you a sense of what the AI model sees when it processes a picture.
New Models, New Applications
You may think this is how the magic eraser works that can remove unsightly objects or people from your personal photographs. A Meta spokesperson told me that this model could be used as an input for an eraser application, but the demo focuses on the extraction feature.
With that said, the ability to extract the digital object from one image and then use it separately or by adding it to a new picture is useful. You may be even more impressed after you see it working in video.
The Age of Alignment
Meta didn’t have to introduce the Segment Anything demo. The company is offering the model and dataset to researchers that are familiar with the technologies, and the demo seems more like a consumer application. However, that is the point. Making it more accessible helps everyone understand better what Meta is trying to accomplish. That may influence researchers to try it out today and start looking at the model instead of waiting until a more convenient time.
It also will drive better understanding among other observers of AI technologies, which is very broad today, given the popularity of generative AI solutions. That, in turn, will help promote the story and awareness of Meta’s activity in the field.
Meta is one of the leaders pushing the boundaries of AI today. However, not many people outside of the research community know about their work because it is not used in consumer applications.
I suspect Segment Anything, or a customized version, could soon appear as a feature on Instagram and Facebook. The alignment we saw today in terms of user interface suggests it could be applied easily and in a user-friendly manner. You can imagine how creatively this could be used to add to photos as opposed to erasing.