I have implemented a few statistical operations on concrete and I’m testing it on a GPU machine but I noticed that the performance is better when not using gpu (use_gpu=False).
I have tensorized my code for calculating variance (numerator only).
@fhe.compiler({“array”: “encrypted”})
def calculate_variance_numerator_tensorized(array):
n = array.size
# Vectorized operations
array_sq = array * array # elementwise square
array_sum = np.sum(array) # sum of all elements
array_sq_sum = np.sum(array_sq) # sum of squares
# list_into_sum = Σ(array[i] * array_sum)
list_into_sum = np.sum(array * array_sum)
component_one = array_sq_sum * (n * n)
component_two = list_into_sum * (n * 2)
component_three = array_sum * array_sum * n
return fhe.refresh((component_three + component_one) - component_two)
inputset = [np.random.randint(0, lrange, size=lsize) for _ in range(5)]
inputset.append(np.full(lsize, lrange))
circuit = calculate_variance_numerator_tensorized.compile(inputset, dataflow_parallelize = parallelize, use_gpu = gpu)
Could you please confirm why the performance on cpu is better than gpu? I also noticed that dataflow_parallelize doesn’t work on tensorized code, is it because it is already parallelized?
That hard to say why in this case CPU is better than CPU, it often depends on how computation subgraph can be offloaded to the GPU and the actual GPU workload, if this workload is not enought you can lose more time to transfer data on GPU than keeping on CPU.
What do you mean by dataflow_parallelize doesn’t work on tensorized code? Actually it should be orthogonal.
I have a dedicated instance on aws for testing the code using gpu, which currently has one gpu. Does that mean I need more gpus? or is it just the offloading that’s increasing the time?
For dataflow_parallelize, when I try to enable it on tensorized code (variance numerator) I get the following error
Does it have something to do with numpy methods?
For a separate statistical operation (five point summary) I am using np.max() and np.min() methods. With dataflow_parallelize, it returns:
python3: symbol lookup error: /tmp/tmpz5k6t7o4/sharedlib.so: undefined symbol: _dfr_make_ready_future
You can try with more GPU but my guess is that the better GPU performance is not exploited beause overhead of data transfert on the GPU is greater that the gain of using GPU.
Using GPU lead to better of performance if the workload is enought to compensate data transfert.
For the both error you have it look like an internal bug of the compiler, the first one during the compilation pipeline, the second one when trying to run the code, the “_dfr_make_ready_future” is a symbol that should be in the concrete runtime, what release are you using?