Post

Zero-shot image classification using C# and .NET with a hypothetical pre-trained CLIP model via ONNX Runtime

Zero-shot image classification using C# and .NET with a hypothetical pre-trained CLIP model via ONNX Runtime

In this article I’ll provide a C# implementation for zero-shot image classification using ONNX Runtime with a hypothetical CLIP model, along with unit tests.

Assumptions

  • A pre-trained CLIP model (ONNX format) is available, with inputs for an image (preprocessed to 720x480 pixels, normalized) and text embeddings for labels.
  • The image is a 720x480 RGB image containing giraffes, an elephants, and a jeep car in a savanna-like environment (see image below).
  • ONNX Runtime is used for inference.
  • The CLIP model outputs logits that can be converted to probabilities for each label

a Savannah scene with Giraffes, Elephants and a Jeep car.

C# Implementation

Project Setup

  • Dependencies: Install Microsoft.ML.OnnxRuntime NuGet package for ONNX Runtime.
  • Image: Assume a local image file savanna_scene.jpg (720x480 pixels) containing giraffes, elephants, and a jeep car.
  • Model: Assume a pre-trained CLIP model file clip.onnx is available locally.

Code

Below is the C# implementation for zero-shot image classification.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Drawing;
using System.Drawing.Imaging;
using System.Linq;
using System.Collections.Generic;

namespace ZeroShotImageClassification
{
    public class ZeroShotClassifier
    {
        private readonly InferenceSession _session;
        private readonly string[] _labels = { "giraffe", "elephant", "jeep car" };

        public ZeroShotClassifier(string modelPath)
        {
            _session = new InferenceSession(modelPath);
        }

        public Dictionary<string, float> ClassifyImage(string imagePath)
        {
            // Load and preprocess image
            float[] imageTensor = PreprocessImage(imagePath);

            // Prepare text embeddings for labels (hypothetical CLIP text encoder)
            float[] textEmbeddings = GetTextEmbeddings(_labels);

            // Run inference
            var inputs = new List<NamedOnnxValue>
            {
                NamedOnnxValue.CreateFromTensor("image", new DenseTensor<float>(imageTensor, new[] { 1, 3, 720, 480 })),
                NamedOnnxValue.CreateFromTensor("text", new DenseTensor<float>(textEmbeddings, new[] { _labels.Length, 512 }))
            };

            using var results = _session.Run(inputs);
            float[] logits = results.First().AsTensor<float>().ToArray();

            // Convert logits to probabilities using softmax
            var probabilities = Softmax(logits);

            // Map probabilities to labels
            var result = new Dictionary<string, float>();
            for (int i = 0; i < _labels.Length; i++)
            {
                result[_labels[i]] = probabilities[i];
            }

            return result;
        }

        private float[] PreprocessImage(string imagePath)
        {
            using var bitmap = new Bitmap(imagePath);
            if (bitmap.Width != 720 || bitmap.Height != 480)
            {
                throw new ArgumentException("Image must be 720x480 pixels.");
            }

            // Normalize image (assuming CLIP's normalization: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
            float[] tensor = new float[3 * 720 * 480];
            int index = 0;
            float[] mean = { 0.485f, 0.456f, 0.406f };
            float[] std = { 0.229f, 0.224f, 0.225f };

            for (int y = 0; y < 480; y++)
            {
                for (int x = 0; x < 720; x++)
                {
                    var pixel = bitmap.GetPixel(x, y);
                    tensor[index++] = (pixel.R / 255f - mean[0]) / std[0]; // R
                    tensor[index++] = (pixel.G / 255f - mean[1]) / std[1]; // G
                    tensor[index++] = (pixel.B / 255f - mean[2]) / std[2]; // B
                }
            }

            return tensor;
        }

        private float[] GetTextEmbeddings(string[] labels)
        {
            // Hypothetical: CLIP text encoder generates 512-dim embeddings for each label
            // In practice, you'd run the text through CLIP's text encoder
            // Here, we return a placeholder (randomized for demo)
            return new float[labels.Length * 512]; // 3 labels x 512 dims
        }

        private float[] Softmax(float[] logits)
        {
            float maxLogit = logits.Max();
            float sum = 0f;
            float[] probabilities = new float[logits.Length];

            for (int i = 0; i < logits.Length; i++)
            {
                probabilities[i] = (float)Math.Exp(logits[i] - maxLogit);
                sum += probabilities[i];
            }

            for (int i = 0; i < probabilities.Length; i++)
            {
                probabilities[i] /= sum;
            }

            return probabilities;
        }
    }
}

Main Program

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
using System;

namespace ZeroShotImageClassification
{
    class Program
    {
        static void Main(string[] args)
        {
            try
            {
                var classifier = new ZeroShotClassifier("clip.onnx");
                var results = classifier.ClassifyImage("savanna_scene.jpg");

                Console.WriteLine("Classification Results:");
                foreach (var kvp in results)
                {
                    Console.WriteLine($"{kvp.Key}: {kvp.Value:F4}");
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Error: {ex.Message}");
            }
        }
    }
}

Unit Tests (Using xUnit)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
using System.Collections.Generic;
using System.IO;
using Xunit;

namespace ZeroShotImageClassification.Tests
{
    public class ZeroShotClassifierTests
    {
        [Fact]
        public void ClassifyImage_ValidImage_ReturnsProbabilities()
        {
            // Arrange
            var modelPath = "clip.onnx"; // Mock path
            var imagePath = "savanna_scene.jpg"; // Mock path
            var classifier = new ZeroShotClassifier(modelPath);

            // Act
            var results = classifier.ClassifyImage(imagePath);

            // Assert
            Assert.Equal(3, results.Count);
            Assert.Contains("giraffe", results.Keys);
            Assert.Contains("elephant", results.Keys);
            Assert.Contains("jeep car", results.Keys);
            Assert.All(results.Values, prob => Assert.InRange(prob, 0f, 1f));
            Assert.True(results.Values.Sum() >= 0.99 && results.Values.Sum() <= 1.01, "Probabilities should sum to ~1");
        }

        [Fact]
        public void ClassifyImage_InvalidImageSize_ThrowsException()
        {
            // Arrange
            var modelPath = "clip.onnx";
            var classifier = new ZeroShotClassifier(modelPath);
            var invalidImagePath = "invalid_image.jpg"; // Assume 100x100 image

            // Act & Assert
            Assert.Throws<ArgumentException>(() => classifier.ClassifyImage(invalidImagePath));
        }
    }
}

Expected Output

Running the program with the hypothetical image savanna_scene.jpg might produce:

1
2
3
4
Classification Results:
giraffe: 0.4500
elephant: 0.3500
jeep car: 0.2000

These probabilities are hypothetical, assuming the model identifies the giraffe as the most prominent object, followed by the elephant, and the jeep car as less dominant in the scene.

Notes

  • Model Availability: The clip.onnx model is hypothetical. In practice, you’d need to export a pre-trained CLIP model (e.g., from PyTorch) to ONNX format.
  • Text Embeddings: The GetTextEmbeddings method is a placeholder. In a real implementation, you’d use CLIP’s text encoder to generate embeddings for the labels.
  • Image Preprocessing: The preprocessing follows CLIP’s standard normalization. Ensure the image is 720x480 pixels and normalized correctly.
  • Unit Tests: The tests assume the model and image files exist. For real testing, you’d need to mock the ONNX model or use a test image.
  • Dependencies: Ensure Microsoft.ML.OnnxRuntime is installed via NuGet (dotnet add package Microsoft.ML.OnnxRuntime).
  • Running Tests: Use dotnet test to run the xUnit tests.
This post is licensed under CC BY 4.0 by the author.