ONNX is an open-source
format for deep learning models. You can use netron to visualize ONNX models.
The ONNX runtime supports many platforms for efficient inference, including desktop and mobile
CPUs, GPUs, and even WebGPU. It also has APIs in many programming languages.
As described in the official documentation, here's how you'd run inference in Python:
import onnxruntime as ort
import numpy as np
from PIL import Image
x = np.array(Image.open("image.jpg"))
ort_sess = ort.InferenceSession("model.onnx")
outputs = ort_sess.run(None, {"input": x})
TensorRT provides great acceleration on Nvidia GPUs.
The ONNX Runtime supports TensorRT as an execution provider (i.e. a backend to run ONNX models).
ort_sess = ort.InferenceSession(
'model.onnx',
providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider']
)
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from PIL import Image
# Load engine
with open("model.trt", "rb") as f:
engine = trt.Runtime(trt.Logger(trt.Logger.WARNING)).deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
# Prepare input
x = np.array(Image.open("image.jpg"))
inp = cuda.pagelocked_empty(x.size, x.dtype)
out = cuda.pagelocked_empty(engine.get_binding_shape(1).prod(), np.float32)
np.copyto(inp, x.ravel())
# Allocate device memory
d_inp = cuda.mem_alloc(inp.nbytes)
d_out = cuda.mem_alloc(out.nbytes)
# Run inference
cuda.memcpy_htod(d_inp, inp)
context.execute_v2(bindings=[int(d_inp), int(d_out)])
cuda.memcpy_dtoh(out, d_out)
Our models are trained with Pytorch, and we provide the weights as a state dict for you to load, whether for inference or further training.
To load the model weights, you need to instantiate the model class with the same
hyperparameters used during training.
from sihl import SihlLightningModule
import torch
model = SihlLightningModule(**hyperparameters)
model.load_state_dict(torch.load("model.pt", weights_only=True))
model.eval() # for inference
[coming soon]