TBLIS for Rust: Einstein Summation for Tensors

⚓ Rust    📅 2025-09-26    👤 surdeus    👁️ 9      

surdeus

Warning

This post was published 35 days ago. The information described in this article may have changed.

Introduction

This crate series is wrapper from C++ library TBLIS (Tensor BLIS, The Tensor-Based Library Instantiation Software) to Rust.

TBLIS can perform various tensor operations (multiplication, addition, reduction, transposition, etc.) efficiently on single-node CPU. This library can be an underlying driver for performing einsum (Einstein summation).

TBLIS (C++) source code is available on github by Devin Matthews research group.

This repository is not official wrapper project. It is originally intended to serve rust tensor toolkit RSTSR and rust electronic structure toolkit REST at Xin Xu (徐昕) and Igor Ying Zhang (张颖) research groups.

Crates of this work

Example (using ndarray)

The following example is to perform contraction:

G_{pqrs} = sum(μνκλ) C_{μp} C_{νq} E_{μνκλ} C_{κr} C_{λs}

This tensor contraction is utilized in electronic structure (electronic integral in atomic orbital basis E_{μνκλ} to molecular orbital basis G_{pqrs}).

To run the following code, you may need to

  • activate crate feature ndarray to make conversion between ndarray::{Array, ArrayView, ArrayViewMut} and tblis::TblisTensor;
  • properly link libtblis.so in your project (also see crate tblis-ffi and tblis-src for more information).

The following code snippet performs this contraction.

// Must declare crate `tblis-src` if you want link tblis dynamically.
// You can also call the following code in `build.rs`, instead of using crate `tblis-src`:
//     println!("cargo:rustc-link-lib=tblis");
extern crate tblis_src;

use ndarray::prelude::*;
use tblis::prelude::*;

// dummy setting of matrix C and tensor E
let (nao, nmo): (usize, usize) = (3, 2);
let vec_c: Vec<f64> = (0..nao * nmo).map(|x| x as f64).collect();
let vec_e: Vec<f64> = (0..nao * nao * nao * nao).map(|x| x as f64).collect();

let arr_c = ArrayView2::from_shape((nao, nmo), &vec_c).unwrap();
let arr_e = ArrayView4::from_shape((nao, nao, nao, nao), &vec_e).unwrap();

/// # Parameters
/// - `arr_c`: coefficient matrix $C_{\mu p}$
/// - `arr_s`: electronic integral $E_{\mu \nu \kappa \lambda}$ (in atomic orbital basis)
///
/// # Returns
/// - `arr_g`: electronic integral $G_{pqrs}$ (in molecular orbital basis)
fn ao2mo(arr_c: ArrayView2<f64>, arr_e: ArrayView4<f64>) -> Array4<f64> {
    let view_c = arr_c.view().into_dyn();
    let view_e = arr_e.view().into_dyn();
    let operands = [&view_c, &view_c, &view_e, &view_c, &view_c];
    let arr_g = tblis_einsum_ndarray(
        "μi,νa,μνκλ,κj,λb->iajb", // einsum subscripts
        &operands,                // tensors to be contracted
        "optimal",                // contraction strategy (see crate opt-einsum-path)
        None,                     // memory limit (None means no limit, see crate opt-einsum-path)
        true,                     // row-major (true) or col-major (false)
        None,                     // pre-allocated output tensor (None to allocate internally)
    )
    .unwrap();

    // transform to 4-dimensional array
    arr_g.into_dimensionality().unwrap()
}

let arr_g = ao2mo(arr_c, arr_e);
println!("{:?}", arr_g);

Why TBLIS?

TBLIS can perform many types of einsum (tensor contraction), as well as tensor transposition, addition and reduction.

Some einsum tasks can transform to matrix multiplication (GEMM) tasks. For those tasks, TBLIS may probably not the best choice (this depends on efficiency of BLIS and some other factors).

However, TBLIS can be extremely useful if

  • Contraction is very difficult that usual GEMM or batched GEMM is not sutiable to handle;
  • Layout of your tensor is strided (not contiguous) in memory.

As an example, some benchmarks on my personal computer (AMD Ryzen 7945HX, estimated FP64 1.1 TFLOP/sec with 16 cores). The shape of input tensor is (96, 96, 96, 96). For the strided case, the stride of each dimension is 128.

Benchmark of contiguous case

case description FLOPs TBLIS NumPy (MKL) PyTorch (CPU)
abxy, xycd -> abcd naive GEMM 2 n^6 1.90 sec
767 GFLOP/sec
2.13 sec
683 GFLOP/sec
1.98 sec
736 GFLOP/sec
axyz, xyzb -> ab naive GEMM 2 n^5 132.3 msec
112 GFLOP/sec
63.1 msec
241 GFLOP/sec
63.4 msec
240 GFLOP/sec
axyz, bxyz -> ab naive SYRK n^5 96.9 msec
77 GFLOP/sec
293.2 msec
26 GFLOP/sec
37.4 msec
203 GFLOP/sec
axyz, ybzx -> ab complicated GEMM 2 n^5 120.7 msec
126 GFLOP/sec
207.7 msec
73 GFLOP/sec
211.1 msec
72 GFLOP/sec
axby, yacx -> abc batched complicated GEMM 2 n^5 124.1 msec
122 GFLOP/sec
29.7 sec
0.5 GFLOP/sec
179.2 msec
85 GFLOP/sec
xpay, aybx -> ab trace then complicated GEMM 2 n^4 36.4 msec
4.3 GFLOP/sec
33.9 sec
0.0 GFLOP/sec
106.9 msec
1.5 GFLOP/sec

Benchmark of strided case

case description FLOPs TBLIS NumPy (MKL) PyTorch (CPU)
abxy, xycd -> abcd naive GEMM 2 n^6 2.02 sec
722 GFLOP/sec
7.30 sec
200 GFLOP/sec
2.10 sec
694 GFLOP/sec
axyz, xyzb -> ab naive GEMM 2 n^5 133.1 msec
114 GFLOP/sec
776.8 msec
20 GFLOP/sec
204.4 msec
74 GFLOP/sec
axyz, bxyz -> ab naive SYRK n^5 98.3 msec
77 GFLOP/sec
455.5 msec
17 GFLOP/sec
211.4 msec
36 GFLOP/sec
axyz, ybzx -> ab complicated GEMM 2 n^5 144.7 msec
105 GFLOP/sec
725.0 msec
21 GFLOP/sec
406.7 msec
37 GFLOP/sec
axby, yacx -> abc batched complicated GEMM 2 n^5 142.7 msec
106 GFLOP/sec
27.1 sec
0.6 GFLOP/sec
263.6 msec
58 GFLOP/sec
xpay, aybx -> ab trace then complicated GEMM 2 n^4 232.3 msec
0.7 GFLOP/sec
248.5 sec
0.0 GFLOP/sec
147.3 msec
1.1 GFLOP/sec

Citation

This work, TBLIS for Rust, is not the original work of TBLIS.

Please cite TBLIS as:

Matthews, D. A. High-Performance Tensor Contraction without Transposition. SIAM J. Sci. Comput. 2018, 40 (1), C1–C24. DOI: 10.1137/16M108968X. arXiv: 1607.00291.

Related work is:

Huang, J.; Matthews, D. A.; van de Geijn, R. A. Strassen’s Algorithm for Tensor Contraction. SIAM J. Sci. Comput. 2018, 40 (3), C305–C326. DOI: 10.1137/17M1135578. arXiv: 1704.03092.

1 post - 1 participant

Read full topic

🏷️ Rust_feed