Skip to main content
Vespa provides native support for ONNX (Open Neural Network Exchange) models, enabling you to deploy machine learning models from PyTorch, TensorFlow, scikit-learn, and other frameworks.

Overview

ONNX models can be used for:
  • Ranking - Score documents during search
  • Embeddings - Generate vector representations
  • Feature extraction - Transform data for downstream tasks
  • Stateless inference - Serve predictions via REST API
Vespa evaluates ONNX models using ONNX Runtime, providing high-performance inference on CPU and GPU.

Adding ONNX Models

1

Export Your Model to ONNX

Convert your trained model to ONNX format:
import torch
import torch.onnx

# Load your PyTorch model
model = MyModel()
model.load_state_dict(torch.load('model.pt'))
model.eval()

# Create dummy input
dummy_input = torch.randn(1, 10)

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "my_model.onnx",
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    },
    opset_version=14
)
2

Add Model to Application Package

Place the ONNX file in your application’s models/ directory:
my-app/
├── services.xml
├── schemas/
│   └── doc.sd
└── models/
    └── my_model.onnx
3

Declare Model in Schema

Reference the model in your schema file:
schema doc {
    onnx-model my_model {
        file: models/my_model.onnx
        input input: my_input_expression
        output output: my_output
    }
}
4

Use Model in Ranking

Reference the model in rank profiles:
rank-profile with_onnx {
    function my_input_expression() {
        expression: tensor<float>(d0[10]):[1,2,3,4,5,6,7,8,9,10]
    }
    
    first-phase {
        expression: onnx(my_model).output
    }
}

Model Configuration

Basic Declaration

Declare an ONNX model in your schema:
onnx-model classifier {
    file: models/classifier.onnx
}

Input Mapping

Map ONNX input names to Vespa expressions:
onnx-model scorer {
    file: models/scorer.onnx
    
    # Map ONNX inputs to Vespa features
    input input_ids: tokenSequence
    input attention_mask: tokenMask
    input segment_ids: tokenTypes
}
// From: config-model/src/main/java/com/yahoo/schema/OnnxModel.java:57-84
private String validateInputSource(String source) {
    var optRef = Reference.simple(source);
    if (optRef.isPresent()) {
        Reference ref = optRef.get();
        // input can be one of:
        // attribute(foo), query(foo), constant(foo)
        if (FeatureNames.isSimpleFeature(ref)) {
            return ref.toString();
        }
        // or a function (evaluated by backend)
        if (ref.isSimpleRankingExpressionWrapper()) {
            var arg = ref.simpleArgument();
            if (arg.isPresent()) {
                return ref.toString();
            }
        }
    } else {
        // otherwise it must be an identifier
        Reference ref = Reference.fromIdentifier(source);
        return ref.toString();
    }
    // invalid input source
    throw new IllegalArgumentException("invalid input for ONNX model " + getName() + ": " + source);
}
Valid input sources:
  • attribute(field_name) - Document attribute
  • query(param_name) - Query parameter
  • constant(const_name) - Ranking constant
  • Function names defined in the rank profile

Output Mapping

Map ONNX output names to Vespa identifiers:
onnx-model encoder {
    file: models/encoder.onnx
    output embeddings: last_hidden_state
    output pooled: pooler_output
}
Reference outputs in ranking:
rank-profile semantic {
    first-phase {
        expression: onnx(encoder).embeddings
    }
}

ONNX Runtime Configuration

Configure ONNX Runtime execution in services.xml:
<container id="default" version="1.0">
  <component id="ai.vespa.modelintegration.evaluator.OnnxRuntime" 
             bundle="model-integration">
    <config name="ai.vespa.modelintegration.evaluator.onnx-evaluator">
      <!-- Execution mode: sequential or parallel -->
      <executionMode>sequential</executionMode>
      
      <!-- Number of threads for parallel execution -->
      <interOpThreads>1</interOpThreads>
      
      <!-- Intra-op threads: -4 means CPUs/4, 0 means CPUs, >0 is explicit count -->
      <intraOpThreads>-4</intraOpThreads>
      
      <!-- GPU device: 0+ for GPU device ID, -1 for CPU -->
      <gpuDevice>-1</gpuDevice>
    </config>
  </component>
</container>

Execution Modes

Single-threaded execution, best for low-latency inference:
<executionMode>sequential</executionMode>
<interOpThreads>1</interOpThreads>

GPU Acceleration

Enable GPU inference with CUDA:
<gpuDevice>0</gpuDevice>  <!-- Use first GPU -->
GPU support requires ONNX Runtime with CUDA provider. Ensure your deployment environment has compatible CUDA drivers.

Using ONNX Models

In Ranking Expressions

Reference ONNX models in rank profiles:
schema product {
    document product {
        field title type string {}
        field price type float {}
        field category type string {}
    }
    
    onnx-model ranker {
        file: models/ranker.onnx
        input features: featureVector
    }
    
    rank-profile ml_ranking {
        function featureVector() {
            expression: tensor<float>(d0[5]):[
                attribute(price),
                query(user_score),
                fieldMatch(title).completeness,
                attribute(popularity),
                freshness(timestamp)
            ]
        }
        
        first-phase {
            expression: onnx(ranker).output
        }
    }
}

With Multiple Outputs

Access specific model outputs:
onnx-model multi_output {
    file: models/multi.onnx
    output scores: output_scores
    output embeddings: output_embeddings
}

rank-profile combined {
    first-phase {
        expression: onnx(multi_output).scores
    }
    
    second-phase {
        expression: sum(onnx(multi_output).embeddings * query(q_vec))
    }
}

Stateless Evaluation API

Use the ModelsEvaluator API for stateless inference:
// From: model-evaluation/src/main/java/ai/vespa/models/evaluation/ModelsEvaluator.java:17-24
/**
 * Evaluates machine-learned models added to Vespa applications and available as config form.
 * Usage:
 * <code>Tensor result = evaluator.bind("foo", value).bind("bar", value").evaluate()</code>
 *
 * @author bratseth
 */
public class ModelsEvaluator extends AbstractComponent {
    public FunctionEvaluator evaluatorOf(String modelName, String ... names) {
        return requireModel(modelName).evaluatorOf(names);
    }
}
Access via REST API:
curl 'http://localhost:8080/model-evaluation/v1/my_model/eval' \
  -d '{"input": [1.0, 2.0, 3.0, 4.0, 5.0]}'

Model Optimization

Model Quantization

Reduce model size and improve performance with quantization:
from onnxruntime.quantization import quantize_dynamic, QuantType

# Dynamic quantization
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_quantized.onnx",
    weight_type=QuantType.QUInt8
)

Model Simplification

Simplify ONNX graphs:
import onnx
from onnxsim import simplify

# Load and simplify model
model = onnx.load("model.onnx")
model_simplified, check = simplify(model)
assert check, "Simplified model is invalid"

onnx.save(model_simplified, "model_simplified.onnx")

Dynamic Shapes

Support variable batch sizes and sequence lengths:
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    dynamic_axes={
        'input_ids': {0: 'batch', 1: 'sequence'},
        'attention_mask': {0: 'batch', 1: 'sequence'},
        'output': {0: 'batch', 1: 'sequence'}
    }
)

OnnxEvaluator Interface

The core evaluation interface:
// From: model-integration/src/main/java/ai/vespa/modelintegration/evaluator/OnnxEvaluator.java:10-29
/**
 * Evaluator for ONNX models.
 *
 * @author bjorncs
 */
public interface OnnxEvaluator extends AutoCloseable {

    record IdAndType(String id, TensorType type) { }

    Tensor evaluate(Map<String, Tensor> inputs, String output);
    Map<String, Tensor> evaluate(Map<String, Tensor> inputs);

    Map<String, OnnxEvaluator.IdAndType> getInputs();
    Map<String, OnnxEvaluator.IdAndType> getOutputs();
    Map<String, TensorType> getInputInfo();
    Map<String, TensorType> getOutputInfo();

    @Override void close();
}

Common Model Types

Classification Models

onnx-model classifier {
    file: models/classifier.onnx
    input features: featureVector
    output logits: output
}

rank-profile classify {
    function featureVector() {
        expression: tensor<float>(d0[100]):[...]
    }
    
    first-phase {
        expression: onnx(classifier).logits
    }
}

Reranking Models

onnx-model cross_encoder {
    file: models/cross_encoder.onnx
    input input_ids: inputSequence
    input attention_mask: inputMask
}

rank-profile rerank {
    first-phase {
        expression: bm25(title) + bm25(body)
    }
    
    second-phase {
        expression: onnx(cross_encoder).logits{d0:0,d1:0}
        rerank-count: 100
    }
}

Embedding Models

See the Embeddings page for embedding-specific models.

Troubleshooting

Model Validation

Vespa validates models at deployment:
vespa deploy
# Check for errors like:
# "Model does not contain required input: 'input_ids'"
# "Model contains: input_tokens, attention_scores"

Inspect Model Inputs/Outputs

Use onnx Python package:
import onnx

model = onnx.load("model.onnx")

print("Inputs:")
for input in model.graph.input:
    print(f"  {input.name}: {input.type}")

print("Outputs:")
for output in model.graph.output:
    print(f"  {output.name}: {output.type}")

Performance Issues

  • Reduce model size through quantization
  • Use dynamic batching for throughput
  • Enable GPU acceleration
  • Optimize intra-op thread count
  • Use model quantization (int8, uint8)
  • Limit number of concurrent evaluations
  • Monitor model size vs available RAM
  • Verify input tensor shapes and types
  • Check input/output name mappings
  • Validate preprocessing matches training
  • Test model with onnxruntime directly

Examples

TensorFlow to ONNX

import tensorflow as tf
import tf2onnx

# Load TensorFlow model
model = tf.keras.models.load_model('model.h5')

# Convert to ONNX
spec = (tf.TensorSpec((None, 10), tf.float32, name="input"),)
output_path = "model.onnx"

model_proto, _ = tf2onnx.convert.from_keras(
    model, 
    input_signature=spec,
    opset=14,
    output_path=output_path
)

scikit-learn to ONNX

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
from sklearn.ensemble import RandomForestClassifier

# Train sklearn model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Convert to ONNX
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

Next Steps

Embeddings

Use ONNX models for text embeddings

Model Evaluation

Stateless vs ranking evaluation

RAG Applications

Combine models with retrieval

Performance Tuning

Optimize model inference