Low performance on GPU

rraj · October 10, 2025, 4:58am

Hello,

I have implemented a few statistical operations on concrete and I’m testing it on a GPU machine but I noticed that the performance is better when not using gpu (use_gpu=False).

I have tensorized my code for calculating variance (numerator only).

@fhe.compiler({“array”: “encrypted”})
def calculate_variance_numerator_tensorized(array):
n = array.size

# Vectorized operations
array_sq = array * array              # elementwise square
array_sum = np.sum(array)            # sum of all elements
array_sq_sum = np.sum(array_sq)      # sum of squares

# list_into_sum = Σ(array[i] * array_sum)
list_into_sum = np.sum(array * array_sum)

component_one = array_sq_sum * (n * n)
component_two = list_into_sum * (n * 2)
component_three = array_sum * array_sum * n

return fhe.refresh((component_three + component_one) - component_two)

inputset = [np.random.randint(0, lrange, size=lsize) for _ in range(5)]
inputset.append(np.full(lsize, lrange))
circuit = calculate_variance_numerator_tensorized.compile(inputset, dataflow_parallelize = parallelize, use_gpu = gpu)

Could you please confirm why the performance on cpu is better than gpu? I also noticed that dataflow_parallelize doesn’t work on tensorized code, is it because it is already parallelized?

Thank you

yundsi · October 10, 2025, 2:22pm

Hello rraj,

That hard to say why in this case CPU is better than CPU, it often depends on how computation subgraph can be offloaded to the GPU and the actual GPU workload, if this workload is not enought you can lose more time to transfer data on GPU than keeping on CPU.

What do you mean by dataflow_parallelize doesn’t work on tensorized code? Actually it should be orthogonal.

rraj · October 10, 2025, 3:06pm

Hi @yundsi

Thank you for your response.

I have a dedicated instance on aws for testing the code using gpu, which currently has one gpu. Does that mean I need more gpus? or is it just the offloading that’s increasing the time?

For dataflow_parallelize, when I try to enable it on tensorized code (variance numerator) I get the following error

python3: /concrete/compilers/concrete-compiler/llvm-project/llvm/include/llvm/Support/Casting.h:567: decltype(auto) llvm::cast(const From&) [with To = mlir::detail::TypedValuemlir::concretelang::TFHE::GLWECipherTextType; From = mlir::OpResult]: Assertion isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed. Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var LLVM_SYMBOLIZER_PATH` to point to it):
0 libLLVM-17git-0add3cf5.so 0x0000765285b52b51 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 225
1 libLLVM-17git-0add3cf5.so 0x0000765285b50564
2 libc.so.6 0x00007652e6a45330
3 libc.so.6 0x00007652e6a9eb2c pthread_kill + 284
4 libc.so.6 0x00007652e6a4527e gsignal + 30
5 libc.so.6 0x00007652e6a288ff abort + 223
6 libc.so.6 0x00007652e6a2881b
7 libc.so.6 0x00007652e6a3b517
8 _concretelang.cpython-312-x86_64-linux-gnu.so 0x0000765284b4f029
9 _concretelang.cpython-312-x86_64-linux-gnu.so 0x0000765284b5b57a
10 _concretelang.cpython-312-x86_64-linux-gnu.so 0x0000765284b5cdc0
11 _concretelang.cpython-312-x86_64-linux-gnu.so 0x0000765284b5db55
12 _concretelang.cpython-312-x86_64-linux-gnu.so 0x0000765284b5e0d2
13 libConcretelangBindingsPythonCAPI.so 0x000076528bed3e62 mlir::detail::walk(mlir::Operation*, llvm::function_ref<mlir::WalkResult (mlir::Operation*)>, mlir::WalkOrder) + 226
14 _concretelang.cpython-312-x86_64-linux-gnu.so 0x0000765284b484ba
15 _concretelang.cpython-312-x86_64-linux-gnu.so 0x0000765284b4cfd2
16 libConcretelangBindingsPythonCAPI.so 0x000076528bd87792 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 1314
17 libConcretelangBindingsPythonCAPI.so 0x000076528bd87d69 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 361
18 libConcretelangBindingsPythonCAPI.so 0x000076528bd888bb mlir::PassManager::run(mlir::Operation*) + 1899
19 _concretelang.cpython-312-x86_64-linux-gnu.so 0x00007652848c6a81
20 _concretelang.cpython-312-x86_64-linux-gnu.so 0x00007652848a7321
21 _concretelang.cpython-312-x86_64-linux-gnu.so 0x00007652848aa6d7
22 _concretelang.cpython-312-x86_64-linux-gnu.so 0x00007652848aab94
23 _concretelang.cpython-312-x86_64-linux-gnu.so 0x000076528478bced
24 _concretelang.cpython-312-x86_64-linux-gnu.so 0x000076528472bfe1
25 python3 0x0000000000581a6f
26 python3 0x00000000005492f5 _PyObject_MakeTpCall + 117
27 python3 0x00000000005d68bf _PyEval_EvalFrameDefault + 2783
28 python3 0x000000000054ab42 _PyObject_Call_Prepend + 194
29 python3 0x000000000059da4f
30 python3 0x0000000000599513
31 python3 0x00000000005492f5 _PyObject_MakeTpCall + 117
32 python3 0x00000000005d68bf _PyEval_EvalFrameDefault + 2783
33 python3 0x000000000054cf04
34 python3 0x000000000054b525 PyObject_Call + 277
35 python3 0x00000000005daa90 _PyEval_EvalFrameDefault + 19632
36 python3 0x00000000005d4dab PyEval_EvalCode + 347
37 python3 0x0000000000607fc2
38 python3 0x00000000006b4393
39 python3 0x00000000006b40fa _PyRun_SimpleFileObject + 426
40 python3 0x00000000006b3f2f _PyRun_AnyFileObject + 79
41 python3 0x00000000006bbf45 Py_RunMain + 949
42 python3 0x00000000006bba2d Py_BytesMain + 45
43 libc.so.6 0x00007652e6a2a1ca
44 libc.so.6 0x00007652e6a2a28b __libc_start_main + 139
45 python3 0x0000000000656a35 _start + 37
PLEASE submit a bug report to GitHub · Where software is built and include the crash backtrace.
Aborted (core dumped)

Does it have something to do with numpy methods?
For a separate statistical operation (five point summary) I am using np.max() and np.min() methods. With dataflow_parallelize, it returns:

python3: symbol lookup error: /tmp/tmpz5k6t7o4/sharedlib.so: undefined symbol: _dfr_make_ready_future

is there any way to fix this issue?

yundsi · October 14, 2025, 3:34pm

You can try with more GPU but my guess is that the better GPU performance is not exploited beause overhead of data transfert on the GPU is greater that the gain of using GPU.
Using GPU lead to better of performance if the workload is enought to compensate data transfert.

For the both error you have it look like an internal bug of the compiler, the first one during the compilation pipeline, the second one when trying to run the code, the “_dfr_make_ready_future” is a symbol that should be in the concrete runtime, what release are you using?

rraj · October 16, 2025, 6:11pm

Hi @yundsi

I’m using

concrete-python 2024.12.19