In less than a week, the visual field has ushered in a series of new models “bomb field”, image recognition threshold is greatly reduced –
Segment Anything: A Segment Anything tool that can accurately identify objects in images, models, and data — all of which are open source.
The vision team of domestic Zhiyuan Research Institute also proposed the universal segmentation model SegGPT(Segment Everything in Context), which is also the first universal visual model that uses visual context to complete various segmentation tasks.
Among Meta’s projects are the Segment Anything Model(SAM), and the dataset Segment Anything 1-Billion mask Dataset (SA-1B), which the company says is the largest segmented dataset ever.
It was this SAM model that caused a stir in the industry:
As the name “Segment Anything” suggests, this model can be used to segment anything in an image, including anything not included in the training data.
In terms of interaction, SAM can use prompt (click, box, text) to specify what to split in the image, which means that the same natural language processing Prompt mode is starting to be used in computer vision.
For the objects in the video, SAM can also accurately identify and quickly mark the type, name and size of the objects, and automatically record and classify these objects with ids.
Jim Fan, an AI scientist at Nvidia, calls Meta’s research one of the “PGT-3 moments” in computer vision — its general-purpose segmentation method allows zero-sample generalization of unfamiliar objects and images, a preliminary demonstration of the multimodal approach and its ability to generalize.
Furthermore, SAM can be flexibly integrated into larger AI systems. For example, understanding the visual and textual content of web pages; In AR/VR, the heads-up user’s line of sight is used as a cue to select an object and then “elevate” it into 3D; For content creators, SAM extracts areas of the image for collage or video editing; SAM can also be studied and tracked on video by locating animals or objects.
On the other hand, the SegGPT model of the vision team of Zhiyuan Research Institute focuses more on the ability of mass labeling and segmentation. Whether in the image or video environment, users can mark and identify one kind of object on the picture, and then identify and segment all other similar objects in mass.
For example, if you label a rainbow in one image, you can batch identify and segment rainbows in other images as well.