Summing Coin Values in Images using Lang-SAM and Deep Learning
In the latest developments of computer vision, image segmentation has seen impressive progress. A standout example is the “Segment Anything”Model (SAM), a dynamic deep-learning tool that predicts object masks from images using input prompts. Thanks to its advanced encoding and decoding capabilities, SAM can manage diverse segmentation challenges, proving invaluable for both researchers and developers.
Lang-SAM is a project built on SAM. It extracts the masks of all instances of the objects within the image we want with a text prompt. It intelligently incorporates textual descriptions, bridging the gap between natural language processing and computer vision. This fusion allows for more context-aware, precise, and detailed segmentations, expanding the scope of intricate imaging challenges beyond traditional capabilities.
Having explored the capabilities of the SAM model, I found an exemplary use-case: estimating the total value of coins in an image that also includes various other objects. Let’s dive deeper into how SAM operates and examine how I’ve incorporated it into my coin summing project for both generating datasets and testing the neural network.
1. Segment-Anything
Facebook’s research team, FAIR, introduced their segmentation model, SAM, in 2022. What’s amazing about SAM is that it can recognize and separate parts of images in ways it wasn’t specifically trained for.
Figure 1: SAM’s model architecture provided by Meta, downloaded by https://segment-anything.com/
At its core, SAM has three main parts: it understands the image, takes a prompt or command, and then creates a mask based on that command. To train SAM, Facebook created the biggest image dataset ever, called SA-1B, through a detailed three-step process. In technical terms, SAM uses a system similar to other popular models, but with its own unique features. Sometimes when given a vague command, it makes multiple guesses and picks the best one. In tests, SAM did better than other models on 23 different datasets. They even combined SAM with another tool for tasks like finding and highlighting specific objects in images.
Although SAM was trained with a text encoder for text prompts, Meta hasn’t released the weights with text encoder. Therefore, only box or point prompts are available in the current public model.
2. Language-Segment-Anything (Lang-SAM)
To overcome the block of text prompts of SAM, Luca Medeiras made an open-source project called Language-Segment-Anything (Lang-SAM). Lang-SAM deploys GroundingDino and SAM sequentially. GroundingDino is a text-to-bounding box model to which user feeds an image and a text prompt; the model finds the masks of those objects based on the text prompt. These bounding boxes are then used as input prompts for the SAM model, which generates precise segmentation masks for the identified objects.
The following is the code snippet for running Lang-SAM in Python:
from PIL import Image
from lang_sam import LangSAM
from lang_sam.utils import draw_image
# Initialize LangSAM model
model = LangSAM()
# Load the image and convert it to RGB
image_pil = Image.open(‘./assets/image.jpeg’).convert(“RGB”)
# Set the text prompt for the segmentation
text_prompt = ‘bicycle’
# Perform prediction to obtain masks, bounding boxes, labels, and logits
masks, boxes, labels, logits = model.predict(image_pil, text_prompt)
# Draw segmented image using the utility function
image = draw_image(image_pil, masks, boxes, labels)
Using the given code above, I made a test of segmentation of bicycles in an image. Results are visualized in the figure below.
Figure 2: Example segmentation result of Lang-SAM — Image by the author
Use-case for Lang-SAM: Counting the sum of coins
Figure 3: Workflow of coin summing — Image by the author
Let’s firstly decide on the worklow of the coin counting. As a general idea, we will have images in which there are various coins.
As a first step of the workflow, we can segment each coin in input image. This step can be done by using Lang-SAM as it enables us simply enter ‘coin’ as the text prompt. After getting coin masks, we can use a convolutional neural network to estimate the classes of coins. This neural network can be a custom one which we train with a dataset generated using Lang-SAM. Details of the architecture and how I trained it is given in Step 2, below. In the last step, estimated classes are simply summed up.
Step 1: Usage of Lang-SAM
For segmenting coins in an image, I wrote the following function which takes an image as input and returns the masks and boxes of each coin in the image by using Lang-SAM model. Boxes of individual coins are used only in the visualisation purposes later on; thus not significant for now.
def find_coin_masks(image):
# Suppress warning messages
warnings.filterwarnings(“ignore”)
text_prompt = “coin”
try:
model = LangSAM()
masks, boxes, _, _ = model.predict(image, text_prompt)
if len(masks) == 0:
print(f”No objects of the ‘{text_prompt}’ prompt detected in the image.”)
else:
# Convert masks to numpy arrays
masks_np = [mask.squeeze().cpu().numpy() for mask in masks]
boxes_np = [box.squeeze().cpu().numpy() for box in boxes]
return masks_np, boxes_np
except (requests.exceptions.RequestException, IOError) as e:
print(f”Error: {e}”)
From the input image containing multiple coins, the provided function above produces segmentation masks, which are showcased in the image below. However, the segmentation masks generated are in the image’s original resolution. Given that a significant portion of the segmented image, roughly 95%, is dominated by empty sections, this can be seen as redundant information. Such excessive data causes computational challenges when inputting into a neural network for subsequent training phases. To address this, I’ll be introducing a subsequent function to crop and focus on the relevant segmented regions, optimizing the data for further processing.
Figure 4: input and outputs of find_coin_masks function — Image by the author
I’ve created another function named generate_coin_images. This function starts by using find_coin_mask to get the masks in their original size. Next, it crops out the black area around the masks. The final mask is then resized to a standard size of 500×500 pixels. If the section with the coin is bigger than this size, it adjusts it to fit the 500×500 size, making sure we have a consistent input for our next steps.
def generate_coin_images(image_dir):
# Load the image and convert it to RGB format
image = Image.open(image_dir).convert(“RGB”)
# Use the previously defined function to obtain masks and bounding boxes
masks, boxes = find_coin_masks(image)
# Convert image to a numpy array for further processing
image = np.array(image)
# List to store final coin images
coins = []
for index in range(len(masks)):
# Apply mask to image and obtain relevant segment
mask = np.broadcast_to(np.expand_dims(masks[index],-1), image.shape)
masked_image = mask * image
# Find the bounding box coordinates for the non-zero pixels in the masked image
nonzero_indices = np.nonzero(masked_image[:,:,0])
nonzero_indices = np.array(nonzero_indices)
y_min, y_max, x_min, x_max = find_boundary_of_coin(nonzero_indices)
# Crop the masked image to the bounding box size
masked_image = masked_image[y_min:y_max,x_min:x_max]
# Creating a 500×500 mask
if (y_max – y_min)<500 and (x_max – x_min)<500:
difference_y = 500 – (y_max – y_min)
difference_x = 500 – (x_max – x_min)
if difference_y != 0:
if difference_y % 2 == 0:
masked_image = np.pad(masked_image, [(difference_y//2, difference_y//2), (0, 0), (0, 0)])
else:
masked_image = np.pad(masked_image, [((difference_y-1)//2, (difference_y-1)//2 + 1), (0, 0), (0, 0)])
if difference_x != 0:
if difference_x % 2 == 0:
masked_image = np.pad(masked_image, [(0, 0), (difference_x//2, difference_x//2), (0, 0)])
else:
masked_image = np.pad(masked_image, [(0, 0), ((difference_x-1)//2, (difference_x-1)//2 + 1), (0, 0)])
coins.append(masked_image)
else:
dim = (500, 500)
resized_masked_image = cv2.resize(masked_image, dim, interpolation = cv2.INTER_AREA)
coins.append(resized_masked_image)
return coins, boxes
The generate_coin_images function produces coin images, as exemplified below. Later on we will use this function when we create the dataset to train the neural network and also in our test pipeline . We can say that this function is the backbone of the project.
Figure 5: input and outputs of generate_coin_images function — Image by the author
Step 2: Creating a Coin Estimator Neural Network
Step 2.1: Dataset Generation
Recognizing the absence of a dedicated dataset for European coins, I took the initiative to create one for my project. I sourced photos of six distinct European coin denominations from this GitHub page : 2 euros, 1 euro, 50 cents, 20 cents, 10 cents, and 5 cents. Each image contained only a single coin, ensuring consistency in the dataset.
Utilizing the generate_coin_image function (which I previously described), I extracted and saved a masked version of each coin. These images were then systematically organized into folders based on their respective denominations.
For clarity, the training dataset consists of 2,739 images, distributed across the six classes as follows:
2 euros: 292 images1 euro: 301 images50 cents: 747 images20 cents: 444 images10 cents: 662 images5 cents: 293 images
And the validation set consists of 73 images, distributed across the six classes as follows:
2 euros: 5 images1 euro: 12 images50 cents: 8 images20 cents: 17 images10 cents: 16 images5 cents: 15 imagesoutput_dir = “coin_dataset/training/”
dataset_dir = “coin_images/”
subfolders = os.listdir(dataset_dir)
for subfolder in subfolders:
files = os.listdir(os.path.join(dataset_dir,subfolder))
if ‘.DS_Store’ in files:
files.remove(‘.DS_Store’)
if ‘.git’ in files:
files.remove(‘.git’)
files = [file for file in files if file.endswith(‘.jpg’) or file.endswith(‘.png’)]
for file in files:
# Generate coin images with generate_coin_images function and loop through them
padded_coins, boxes = generate_coin_images(os.path.join(dataset_dir,subfolder,file))
for padded_coin in padded_coins:
# Convert the numpy array image back to PIL Image object
image = Image.fromarray((padded_coin).astype(np.uint8))
if os._exists(os.path.join(output_dir, subfolder, ‘.DS_Store’)):
os.remove(os.path.join(output_dir, subfolder, ‘.DS_Store’))
last_index = find_last_index(os.listdir(os.path.join(output_dir, subfolder)))
image_name = f”img_{last_index+1}.png”
subfolder_for_padded_coins = os.path.join(output_dir, subfolder, image_name)
image.save(subfolder_for_padded_coins)
The image below provides a visual representation of our segmentation procedure, showing the processing of a 1 euro coin photo for dataset creation. Post-segmentation, the individual coin images are stored in the ‘1e/’ directory.
Figure 6: Dataset generation workflow’s input and outputs — Image by the author
Step 2.2: Training
The architecture of the neural network comprises two main components: several convolutional layers designed to extract spatial features from the input image, and two dense layers responsible for the final classification.
Specifically, the network starts by takes and RGB input image with 500x500x3 shape. As it progresses through the convolutional layers, the number of channels increases, with each convolution being followed by a ReLU activation. By a stride of 2 in these layers, the spatial dimensions of the feature maps are reduced at each stage, resulting in a encoding effect.
Following the convolutional stages, the spatial features are then flattened and passed to two fully connected layers. The output from the final layer provides a probability distribution across the classes, with a softmax activation.
Figure 7: NN architecture of coin estimator — Image by the author
I used Adam optimizer and Cross Entropy Loss to train the model. The training was continued until a point of saturation was noticed in the validation loss, which occurred at the 15th epoch.
Step 2.3: Performance benchmark
Upon concluding the training, I proceeded to benchmark the checkpoint from the final epoch utilizing the script provided below. I used the function compute_accuracy given below, which takes the model and data loader as arguments, and computes the percentage of accurate predictions within the given dataset.
def compute_accuracy(model, data_loader, device):
correct_predictions = 0
total_predictions = 0
with torch.no_grad():
for inputs, labels in data_loader:
inputs, labels = inputs.to(device), labels.to(device)
# Forward pass
outputs = model(inputs)
# Get the predicted class index by finding the max value in the output tensor along dimension 1
_, predicted = torch.max(outputs.data, 1)
total_predictions += labels.size(0)
# Update correct predictions count:
# Sum up all instances where the predicted class index equals the true class index
correct_predictions += (predicted == labels).sum().item()
return (correct_predictions / total_predictions) * 100
# Compute the accuracy on the training dataset and validation sets
train_accuracy = compute_accuracy(model, train_loader, device)
val_accuracy = compute_accuracy(model, val_loader, device)
print(f”Training set accuracy: {train_accuracy:.2f}%”)
print(f”Validation set accuracy: {val_accuracy:.2f}%”)
The subsequent computation of the average accuracy for both training and validation datasets are as follows:
training set: 87%validation set: 95%
The validation accuracy exceeded the training accuracy, potentially due to the relatively smaller size of the validation set. It’s vital to note that the primary intent of this project is to illustrate a potential application for new segmentation models rather than to construct a high-performance coin estimation network. Therefore, a deeper analytical dive into these observations will not be conducted.
Step 3: Pipeline of Coin Counting
Having trained the coin estimator network, all steps of the workflow, outlined in Figure 3, are now complete. Now, let’s come up with the pipeline that leverages both Lang-SAM and our custom Neural Network (NN) from end to end, aiming to calculate the total value of coins in an image.
I created a Python notebook named coin_counter.ipynb which guides through the counting steps. Just like in the process we used to create our dataset, initially the generate_coin_images function is employed to make a mask for each coin in the image. Then, each of these masks is individually fed into the coin estimator network. Finally, the predicted coin values are added up to find the total amount of money in the image.
Test Results
The coin image shown in Figure 3 was fed into the coin counting pipeline. The image below, which includes estimated class overlays, is presented below. While there are some incorrect estimations, the overall performance is acceptable.
As mentioned earlier, the primary goal of this project is to demonstrate a potential application for new segmentation models that can take text input rather than to build a high-performance coin estimation network.
Here is my Github repo in which you can find the codes used throughout this blog.
Thanks for reading my blog!
Coin Counting using Lang-SAM was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.