Find the Bottleneck: Speed Up ML Pipelines by 10%–800% | ML Prague 2026
Last updated: 2026-05-04 by Michal- Please fill out participation form.
- Presentation slides link
- Lab data link
- Please fill out survey feedback form
Then please complete the following setup steps so we can save time during the workshop. The tools we will use require downloading a number of large files from the internet, which can be difficult to do using WiFi at the location. Thank you!
If you run into any issues, please write to michal (you know what) minfx.ai.
1. Setup
Install the following tools on the laptop you will be using at the workshop. You do not need to have a GPU on your laptop.
1.1. Install Nsight Systems and Nsight Compute:
-
Nsight Systems: High-level system-level performance profiling.
The easiest way to install Nsight Systems is to use the NVIDIA package manager (Ubuntu - desktop version).
Note: If the auto-detection of OS version fails, you can try to find most appropriate version manually at the nvidia repository.
-
Nsight Compute: Detailed kernel-level performance analysis.
The easiest way to install Nsight Compute is to download the nvidia binaries directly from the linked website.
1.2. Install the following packages:
python3 -m virtualenv venv source venv/bin/activate # nvtx: annotations for Nsight Systems # viztracer: event-based performance profiling pip install torch numpy nvtx viztracer
Note: The latest torch==2.11.0 may require updating NVIDIA drivers. If you do not want to do that, you can use an older torch version, e.g. torch==2.10.0.
1.3. Clone the following repository:
git clone https://github.com/minfx-ai/mlprague2026
1.4. There is an example/example.py script and profiling traces in the example/ directory:
cd example/ unzip example-traces.zip
2. Open profiling traces
Nsight Systems
2.1. Open Nsight Systems:
nsys-ui
You may need to specify the full path, e.g. /usr/local/cuda/bin/nsys-ui; please refer to the NVIDIA installation guides.
Then File > Open the profiling trace example/example.nsys-rep
Nsight Compute
2.2. Open Nsight Compute:
ncu-ui
You may need to specify the full path, e.g. /usr/local/cuda/bin/ncu-ui or /usr/local/NVIDIA-Nsight-Compute-2026.1/ncu-ui; please refer to the NVIDIA installation guides.
Then File > Open the profiling trace example/example.ncu-rep
Perfetto (viztracer)
2.3. Open the Perfetto (viztracer) profile:
vizviewer example/example.viztracer.json
Note: vizviewer might not work well with Firefox, it should be fine with Chrome.
3. (optional) Generate profiling traces
This section requires access to an NVIDIA GPU. At the workshop, we will provide captured profile traces so we can focus on analysis rather than trace generation. You do not need to complete this step for the workshop, but you will need it later to analyze your own code.
Nsight Systems
3.1. Run the following command to get the performance data:
nsys profile \
--output="my-example-trace" \
--force-overwrite=true \
--trace=cuda,nvtx,osrt,python-gil \
--sample=process-tree \
--cpuctxsw=process-tree \
--cuda-memory-usage=true \
--stats=true \
--inherit-environment=false \
python "example.py"
You should see the following new files:
my-example-trace.nsys-rep my-example-trace.sqlite
and output in the terminal similar to the following:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Style Range
-------- --------------- --------- ------------- ------------- ----------- ----------- ------------ ------- ---------------------
46,0 1 359 277 097 25 137 54 074,0 1 370,0 150 307 475 029 2 428 147,0 PushPop GIL Trace:Holding GIL
26,0 777 629 839 1 777 629 839,0 777 629 839,0 777 629 839 777 629 839 0,0 PushPop :main
15,0 443 563 028 1 443 563 028,0 443 563 028,0 443 563 028 443 563 028 0,0 PushPop :alloc
6,0 197 782 868 10 19 778 286,0 19 515 334,0 19 498 648 20 780 227 458 710,0 PushPop matmul:init
4,0 133 764 385 10 13 376 438,0 32 010,0 30 441 133 439 468 42 185 849,0 PushPop matmul:step
0,0 2 283 261 10 228 326,0 210 399,0 203 814 336 797 43 521,0 PushPop matmul:move
[4/8] Executing 'osrt_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------- ------------- ----------- ----------- ------------ ----------------------
56,0 887 345 109 17 52 196 771,0 17 060 412,0 1 150 286 512 797 74 352 657,0 poll
31,0 500 071 138 1 500 071 138,0 500 071 138,0 500 071 138 500 071 138 0,0 pthread_cond_timedwait
8,0 133 642 961 561 238 222,0 17 150,0 1 200 13 361 097 816 313,0 ioctl
0,0 12 105 437 5 280 2 292,0 2 070,0 1 040 15 070 768,0 stat64
0,0 10 713 099 7 856 1 363,0 1 330,0 1 060 9 881 321,0 lstat64
0,0 6 446 811 1 164 5 538,0 1 870,0 1 000 211 665 13 342,0 read
0,0 5 936 260 998 5 948,0 3 040,0 1 570 2 130 649 67 988,0 mmap64
0,0 5 057 676 1 5 057 676,0 5 057 676,0 5 057 676 5 057 676 0,0 nanosleep
0,0 4 034 569 929 4 342,0 4 240,0 2 710 15 931 1 356,0 munmap
0,0 3 927 183 1 083 3 626,0 3 440,0 1 720 14 210 907,0 open64
0,0 1 864 202 33 56 491,0 56 872,0 25 591 68 362 6 629,0 sleep
0,0 803 957 39 20 614,0 16 270,0 1 090 69 742 16 033,0 fgets
0,0 248 747 48 5 182,0 3 695,0 1 370 27 931 5 014,0 fopen
0,0 178 513 15 11 900,0 4 170,0 2 300 97 002 23 810,0 mmap
0,0 177 075 9 19 675,0 18 960,0 14 221 30 271 5 314,0 sem_timedwait
0,0 124 754 3 41 584,0 41 151,0 30 921 52 682 10 887,0 pthread_create
0,0 104 472 1 104 472,0 104 472,0 104 472 104 472 0,0 pthread_cond_wait
0,0 78 823 47 1 677,0 1 570,0 1 070 5 510 735,0 fclose
0,0 76 992 16 4 812,0 4 065,0 2 030 14 751 3 090,0 open
0,0 64 500 53 1 217,0 1 070,0 1 000 6 840 804,0 fstat64
0,0 25 610 4 6 402,0 5 120,0 1 950 13 420 4 986,0 fread
0,0 20 200 11 1 836,0 1 530,0 1 060 5 180 1 156,0 write
0,0 19 662 6 3 277,0 3 170,0 1 590 5 540 1 526,0 fopen64
0,0 15 800 3 5 266,0 6 190,0 3 150 6 460 1 838,0 pipe2
0,0 13 030 2 6 515,0 6 515,0 5 100 7 930 2 001,0 socket
0,0 10 650 1 10 650,0 10 650,0 10 650 10 650 0,0 connect
0,0 8 170 3 2 723,0 2 100,0 1 810 4 260 1 338,0 pthread_cond_broadcast
0,0 6 070 1 6 070,0 6 070,0 6 070 6 070 0,0 pthread_cond_signal
0,0 3 810 2 1 905,0 1 905,0 1 460 2 350 629,0 fwrite
0,0 2 040 1 2 040,0 2 040,0 2 040 2 040 0,0 bind
0,0 1 050 1 1 050,0 1 050,0 1 050 1 050 0,0 fflush
0,0 1 030 1 1 030,0 1 030,0 1 030 1 030 0,0 sigaction
[5/8] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- --------- --------- ----------- ---------------------------------
68,0 24 024 482 14 1 716 034,0 1 600 751,0 544 522 3 633 724 1 067 958,0 cuLibraryLoadData
17,0 6 252 944 2 3 126 472,0 3 126 472,0 2 993 769 3 259 175 187 670,0 cudaGetDeviceProperties_v2_v12000
3,0 1 291 650 11 117 422,0 8 341,0 7 710 1 177 237 351 577,0 cudaLaunchKernel
3,0 1 273 199 10 127 319,0 123 952,0 121 703 154 723 10 024,0 cudaMemcpyAsync
2,0 939 561 1 939 561,0 939 561,0 939 561 939 561 0,0 cudaFree
1,0 621 053 10 62 105,0 61 816,0 60 732 64 042 982,0 cudaStreamSynchronize
0,0 297 637 4 74 409,0 97 682,0 2 850 99 422 47 713,0 cudaMalloc
0,0 122 103 838 145,0 120,0 60 730 81,0 cuGetProcAddress_v2
0,0 42 762 14 3 054,0 3 650,0 290 4 541 1 514,0 cuLibraryGetKernel
0,0 9 400 18 522,0 250,0 200 1 780 578,0 cudaEventCreateWithFlags
0,0 8 150 3 2 716,0 890,0 230 7 030 3 750,0 cudaStreamIsCapturing_v10000
0,0 3 940 4 985,0 900,0 770 1 370 278,0 cuInit
0,0 1 200 3 400,0 90,0 70 1 040 554,0 cuModuleGetLoadingMode
0,0 720 2 360,0 360,0 190 530 240,0 cudaGetDriverEntryPoint_v11030
[6/8] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
98,0 789 759 10 78 975,0 78 960,0 78 848 79 231 112,0 ampere_fp16_s16816gemm_fp16_256x128_ldg8_f2f_stages_32x3_nn
1,0 11 168 1 11 168,0 11 168,0 11 168 11 168 0,0 void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- --------- --------- -------- -------- ----------- ----------------------------
100,0 1 175 353 10 117 535,0 116 896,0 116 319 120 447 1 448,0 [CUDA memcpy Host-to-Device]
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ----------------------------
20,972 10 2,097 2,097 2,097 2,097 0,000 [CUDA memcpy Host-to-Device]
Nsight Compute
3.2. Nsight Compute requires elevated privileges to access the GPU. You will need to run it as root or set up permissions for your user. See also NVIDIA: Performance Counters
Run the following command to get the performance data:
ncu -o my-example-trace python example.pyYou should see the following files:
my-example-trace.ncu-rep
and output in the terminal similar to the following:
==PROF== Connected to process 38129 (/usr/bin/python3.12) ==PROF== Profiling "distribution_elementwise_grid..." - 0: 0%....50%....100% - 9 passes ==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 1: 0%....50%....100% - 9 passes ==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 2: 0%....50%....100% - 9 passes ==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 3: 0%....50%....100% - 9 passes ==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 4: 0%....50%....100% - 9 passes ==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 5: 0%....50%....100% - 9 passes ==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 6: 0%....50%....100% - 9 passes ==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 7: 0%....50%....100% - 9 passes ==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 8: 0%....50%....100% - 9 passes ==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 9: 0%....50%....100% - 9 passes ==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 10: 0%....50%....100% - 9 passes ==PROF== Disconnected from process 38129 ==PROF== Report: my-example-trace.ncu-rep
Perfetto (viztracer)
3.3. Run the following command to get the performance data:
viztracer -o my-example-trace.viztracer.json example.py