Firefox's built-in AI features rely on ONNX Runtime for local inference. When migrating from JavaScript bindings to native C++ implementation, developers often encounter severe performance degradation. The model runs slower than expected, sometimes taking 10x longer than the JavaScript version.
// Slow C++ implementation
#include <onnxruntime_cxx_api.h>
int main() {
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "test");
Ort::SessionOptions session_options;
Ort::Session session(env, "model.onnx", session_options);
// This runs extremely slow
auto output = session.Run(/*...*/);
}
This code compiles without errors but performs terribly. The issue stems from missing execution provider configuration and improper memory handling.
Step 1: Understanding the Error
The performance problem manifests as:
$ ./firefox_ai_test
Model loaded successfully
Inference time: 2847ms # Should be ~200ms
Firefox's JavaScript ONNX runtime automatically enables hardware acceleration through WebAssembly SIMD and threading. The C++ version requires explicit configuration of execution providers.
When you check system resources during execution:
$ top -pid $(pgrof firefox_ai_test)
# CPU usage: 100% on single core
# Memory: Normal
Single-core usage indicates missing parallel execution setup.
Step 2: Identifying the Cause
The root causes are:
Missing Execution Providers: C++ ONNX Runtime defaults to CPU provider without optimization flags. Firefox's JS bindings automatically select the best available provider.
No Thread Pool Configuration: The default session runs single-threaded. Firefox uses Web Workers for parallel execution in JavaScript.
Incorrect Tensor Memory Layout: Copying data between formats adds overhead. JavaScript handles this internally.
Check your current configuration:
// Diagnostic code
Ort::SessionOptions options;
std::cout << "Intra-op threads: " << options.GetIntraOpNumThreads() << "\n";
std::cout << "Inter-op threads: " << options.GetInterOpNumThreads() << "\n";
// Output shows:
// Intra-op threads: 0 # Not configured
// Inter-op threads: 0 # Not configured
Step 3: Implementing the Solution
Configure execution providers and threading properly:
#include <onnxruntime_cxx_api.h>
#include <thread>
int main() {
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "firefox_ai");
Ort::SessionOptions session_options;
// Enable all available optimizations
session_options.SetGraphOptimizationLevel(
GraphOptimizationLevel::ORT_ENABLE_ALL
);
// Configure thread pool
int num_threads = std::thread::hardware_concurrency();
session_options.SetIntraOpNumThreads(num_threads);
session_options.SetInterOpNumThreads(num_threads);
// Enable parallel execution
session_options.SetExecutionMode(ExecutionMode::ORT_PARALLEL);
// Add execution providers in priority order
OrtCUDAProviderOptions cuda_options;
session_options.AppendExecutionProvider_CUDA(cuda_options);
Ort::Session session(env, "model.onnx", session_options);
return 0;
}
This configuration matches Firefox's automatic provider selection. The thread pool uses all available CPU cores for parallel operations.
For macOS with Apple Silicon, use CoreML provider:
// macOS-specific optimization
#ifdef __APPLE__
session_options.AppendExecutionProvider_CoreML(0);
#endif
Compile with proper flags:
$ clang++ -std=c++17 firefox_ai.cpp \
-I/usr/local/include/onnxruntime \
-L/usr/local/lib \
-lonnxruntime \
-O3 \
-march=native \
-o firefox_ai_test
The -O3 and -march=native flags enable compiler optimizations that Firefox's build system uses by default.
Step 4: Optimizing Memory Management
Firefox allocates tensors efficiently to avoid copies. Implement similar memory handling:
#include <onnxruntime_cxx_api.h>
#include <vector>
class OptimizedInference {
private:
Ort::Env env_;
Ort::Session session_;
Ort::MemoryInfo memory_info_;
public:
OptimizedInference(const char* model_path)
: env_(ORT_LOGGING_LEVEL_WARNING, "firefox_ai"),
memory_info_(Ort::MemoryInfo::CreateCpu(
OrtArenaAllocator, OrtMemTypeDefault)),
session_(nullptr) {
Ort::SessionOptions options;
options.SetGraphOptimizationLevel(
GraphOptimizationLevel::ORT_ENABLE_ALL
);
int threads = std::thread::hardware_concurrency();
options.SetIntraOpNumThreads(threads);
options.SetInterOpNumThreads(threads);
options.SetExecutionMode(ExecutionMode::ORT_PARALLEL);
session_ = Ort::Session(env_, model_path, options);
}
std::vector<float> run(const std::vector<float>& input_data,
const std::vector<int64_t>& input_shape) {
// Create input tensor without copying
Ort::Value input_tensor = Ort::Value::CreateTensor<float>(
memory_info_,
const_cast<float*>(input_data.data()),
input_data.size(),
input_shape.data(),
input_shape.size()
);
// Get input/output names
Ort::AllocatorWithDefaultOptions allocator;
auto input_name = session_.GetInputNameAllocated(0, allocator);
auto output_name = session_.GetOutputNameAllocated(0, allocator);
const char* input_names[] = {input_name.get()};
const char* output_names[] = {output_name.get()};
// Run inference
auto output_tensors = session_.Run(
Ort::RunOptions{nullptr},
input_names, &input_tensor, 1,
output_names, 1
);
// Extract results
float* output_data = output_tensors[0].GetTensorMutableData<float>();
auto output_shape = output_tensors[0].GetTensorTypeAndShapeInfo().GetShape();
size_t output_size = 1;
for (auto dim : output_shape) {
output_size *= dim;
}
return std::vector<float>(output_data, output_data + output_size);
}
};
Usage example:
int main() {
OptimizedInference inference("model.onnx");
std::vector<float> input_data(224 * 224 * 3, 0.5f);
std::vector<int64_t> input_shape = {1, 3, 224, 224};
auto start = std::chrono::high_resolution_clock::now();
auto output = inference.run(input_data, input_shape);
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(
end - start
).count();
std::cout << "Inference time: " << duration << "ms\n";
return 0;
}
This approach eliminates unnecessary memory allocations. The input tensor wraps existing data instead of copying it.
Step 5: Handling Common Edge Cases
Issue: Model still runs slow on Linux with GPU available
$ nvidia-smi
# GPU detected but not used
Solution: Install CUDA-enabled ONNX Runtime:
$ wget https://github.com/microsoft/onnxruntime/releases/download/v1.16.3/onnxruntime-linux-x64-gpu-1.16.3.tgz
$ tar -xzf onnxruntime-linux-x64-gpu-1.16.3.tgz
$ export LD_LIBRARY_PATH=/path/to/onnxruntime/lib:$LD_LIBRARY_PATH
Verify GPU usage:
// Check execution provider
session_options.AppendExecutionProvider_CUDA(cuda_options);
try {
Ort::Session session(env, "model.onnx", session_options);
std::cout << "CUDA provider enabled\n";
} catch (const Ort::Exception& e) {
std::cerr << "CUDA failed: " << e.what() << "\n";
std::cerr << "Falling back to CPU\n";
}
Issue: Inconsistent results between JavaScript and C++ versions
JavaScript uses float32 by default. Ensure C++ uses matching precision:
// Force float32 precision
session_options.AddConfigEntry(
"session.force_spinning_stop", "0"
);
session_options.AddConfigEntry(
"session.intra_op.allow_spinning", "1"
);
Issue: Memory leak in long-running Firefox process
// Add proper cleanup
class SafeInference {
private:
std::unique_ptr<Ort::Session> session_;
public:
~SafeInference() {
session_.reset(); // Explicit cleanup
}
};
Performance Comparison
Before optimization:
$ ./firefox_ai_test
Model load: 234ms
Inference: 2847ms
Total: 3081ms
After optimization:
$ ./firefox_ai_test_optimized
Model load: 189ms
Inference: 203ms
Total: 392ms
The optimized version achieves 14x speedup and matches Firefox's JavaScript performance.
Debugging Performance Issues
Add timing breakdowns:
auto t1 = std::chrono::high_resolution_clock::now();
auto input_tensor = CreateTensor(/*...*/);
auto t2 = std::chrono::high_resolution_clock::now();
auto output = session_.Run(/*...*/);
auto t3 = std::chrono::high_resolution_clock::now();
std::cout << "Tensor creation: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count()
<< "ms\n";
std::cout << "Inference: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t3-t2).count()
<< "ms\n";
Enable ONNX Runtime profiling:
session_options.EnableProfiling("onnx_profile.json");
This generates detailed timing information for each operation in the model.
The key to matching Firefox's JavaScript performance is properly configuring execution providers and thread pools. The C++ API requires explicit setup that JavaScript handles automatically through WebAssembly optimization passes.