There seems to be quite a few possible ways to do this:
- PyTorch Eager Mode Quantization TensorRT Acceleration , seems a bit cumbersome:
- torchao quantization
- ONNX conversion
- Graph Surgery (changing some ops in the onnx graph)
- tensorrt conversion
- Not sure if it works, but would be ideal
- torch.export
- torchao quantization
- tensorrt conversion
- Less ideal would be:
- torchao quantization
- torch.export
- tensorrt conversion
- I’ve already sort of tried this using the vgg ptq example from tensorrt, but torch.export complained that it couldn’t translate the quantized operations