Find the Bottleneck: Speed Up ML Pipelines by 10%–800% | ML Prague 2026

Last updated: 2026-05-04 by Michal

Then please complete the following setup steps so we can save time during the workshop. The tools we will use require downloading a number of large files from the internet, which can be difficult to do using WiFi at the location. Thank you!

If you run into any issues, please write to michal (you know what) minfx.ai.

1. Setup

Install the following tools on the laptop you will be using at the workshop. You do not need to have a GPU on your laptop.

1.1. Install Nsight Systems and Nsight Compute:

1.2. Install the following packages:

python3 -m virtualenv venv
source venv/bin/activate
# nvtx: annotations for Nsight Systems
# viztracer: event-based performance profiling
pip install torch numpy nvtx viztracer

Note: The latest torch==2.11.0 may require updating NVIDIA drivers. If you do not want to do that, you can use an older torch version, e.g. torch==2.10.0.

1.3. Clone the following repository:

git clone https://github.com/minfx-ai/mlprague2026

1.4. There is an example/example.py script and profiling traces in the example/ directory:

cd example/
unzip example-traces.zip

2. Open profiling traces

Nsight Systems

2.1. Open Nsight Systems:

nsys-ui

You may need to specify the full path, e.g. /usr/local/cuda/bin/nsys-ui; please refer to the NVIDIA installation guides.

Then File > Open the profiling trace example/example.nsys-rep

Nsight Compute

2.2. Open Nsight Compute:

ncu-ui

You may need to specify the full path, e.g. /usr/local/cuda/bin/ncu-ui or /usr/local/NVIDIA-Nsight-Compute-2026.1/ncu-ui; please refer to the NVIDIA installation guides.

Then File > Open the profiling trace example/example.ncu-rep

Perfetto (viztracer)

2.3. Open the Perfetto (viztracer) profile:

vizviewer example/example.viztracer.json

Note: vizviewer might not work well with Firefox, it should be fine with Chrome.

3. (optional) Generate profiling traces

This section requires access to an NVIDIA GPU. At the workshop, we will provide captured profile traces so we can focus on analysis rather than trace generation. You do not need to complete this step for the workshop, but you will need it later to analyze your own code.

Nsight Systems

3.1. Run the following command to get the performance data:

nsys profile \
    --output="my-example-trace" \
    --force-overwrite=true \
    --trace=cuda,nvtx,osrt,python-gil \
    --sample=process-tree \
    --cpuctxsw=process-tree \
    --cuda-memory-usage=true \
    --stats=true \
    --inherit-environment=false \
    python "example.py"

You should see the following new files:

my-example-trace.nsys-rep
my-example-trace.sqlite

and output in the terminal similar to the following:

    Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)    Style           Range        
    --------  ---------------  ---------  -------------  -------------  -----------  -----------  ------------  -------  ---------------------
        46,0    1 359 277 097     25 137       54 074,0        1 370,0          150  307 475 029   2 428 147,0  PushPop  GIL Trace:Holding GIL
        26,0      777 629 839          1  777 629 839,0  777 629 839,0  777 629 839  777 629 839           0,0  PushPop  :main                
        15,0      443 563 028          1  443 563 028,0  443 563 028,0  443 563 028  443 563 028           0,0  PushPop  :alloc               
         6,0      197 782 868         10   19 778 286,0   19 515 334,0   19 498 648   20 780 227     458 710,0  PushPop  matmul:init          
         4,0      133 764 385         10   13 376 438,0       32 010,0       30 441  133 439 468  42 185 849,0  PushPop  matmul:step          
         0,0        2 283 261         10      228 326,0      210 399,0      203 814      336 797      43 521,0  PushPop  matmul:move          
   
   [4/8] Executing 'osrt_sum' stats report
   
    Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)            Name         
    --------  ---------------  ---------  -------------  -------------  -----------  -----------  ------------  ----------------------
        56,0      887 345 109         17   52 196 771,0   17 060 412,0        1 150  286 512 797  74 352 657,0  poll                  
        31,0      500 071 138          1  500 071 138,0  500 071 138,0  500 071 138  500 071 138           0,0  pthread_cond_timedwait
         8,0      133 642 961        561      238 222,0       17 150,0        1 200   13 361 097     816 313,0  ioctl                 
         0,0       12 105 437      5 280        2 292,0        2 070,0        1 040       15 070         768,0  stat64                
         0,0       10 713 099      7 856        1 363,0        1 330,0        1 060        9 881         321,0  lstat64               
         0,0        6 446 811      1 164        5 538,0        1 870,0        1 000      211 665      13 342,0  read                  
         0,0        5 936 260        998        5 948,0        3 040,0        1 570    2 130 649      67 988,0  mmap64                
         0,0        5 057 676          1    5 057 676,0    5 057 676,0    5 057 676    5 057 676           0,0  nanosleep             
         0,0        4 034 569        929        4 342,0        4 240,0        2 710       15 931       1 356,0  munmap                
         0,0        3 927 183      1 083        3 626,0        3 440,0        1 720       14 210         907,0  open64                
         0,0        1 864 202         33       56 491,0       56 872,0       25 591       68 362       6 629,0  sleep                 
         0,0          803 957         39       20 614,0       16 270,0        1 090       69 742      16 033,0  fgets                 
         0,0          248 747         48        5 182,0        3 695,0        1 370       27 931       5 014,0  fopen                 
         0,0          178 513         15       11 900,0        4 170,0        2 300       97 002      23 810,0  mmap                  
         0,0          177 075          9       19 675,0       18 960,0       14 221       30 271       5 314,0  sem_timedwait         
         0,0          124 754          3       41 584,0       41 151,0       30 921       52 682      10 887,0  pthread_create        
         0,0          104 472          1      104 472,0      104 472,0      104 472      104 472           0,0  pthread_cond_wait     
         0,0           78 823         47        1 677,0        1 570,0        1 070        5 510         735,0  fclose                
         0,0           76 992         16        4 812,0        4 065,0        2 030       14 751       3 090,0  open                  
         0,0           64 500         53        1 217,0        1 070,0        1 000        6 840         804,0  fstat64               
         0,0           25 610          4        6 402,0        5 120,0        1 950       13 420       4 986,0  fread                 
         0,0           20 200         11        1 836,0        1 530,0        1 060        5 180       1 156,0  write                 
         0,0           19 662          6        3 277,0        3 170,0        1 590        5 540       1 526,0  fopen64               
         0,0           15 800          3        5 266,0        6 190,0        3 150        6 460       1 838,0  pipe2                 
         0,0           13 030          2        6 515,0        6 515,0        5 100        7 930       2 001,0  socket                
         0,0           10 650          1       10 650,0       10 650,0       10 650       10 650           0,0  connect               
         0,0            8 170          3        2 723,0        2 100,0        1 810        4 260       1 338,0  pthread_cond_broadcast
         0,0            6 070          1        6 070,0        6 070,0        6 070        6 070           0,0  pthread_cond_signal   
         0,0            3 810          2        1 905,0        1 905,0        1 460        2 350         629,0  fwrite                
         0,0            2 040          1        2 040,0        2 040,0        2 040        2 040           0,0  bind                  
         0,0            1 050          1        1 050,0        1 050,0        1 050        1 050           0,0  fflush                
         0,0            1 030          1        1 030,0        1 030,0        1 030        1 030           0,0  sigaction             
   
   [5/8] Executing 'cuda_api_sum' stats report
   
    Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                Name               
    --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ---------------------------------
        68,0       24 024 482         14  1 716 034,0  1 600 751,0    544 522  3 633 724  1 067 958,0  cuLibraryLoadData                
        17,0        6 252 944          2  3 126 472,0  3 126 472,0  2 993 769  3 259 175    187 670,0  cudaGetDeviceProperties_v2_v12000
         3,0        1 291 650         11    117 422,0      8 341,0      7 710  1 177 237    351 577,0  cudaLaunchKernel                 
         3,0        1 273 199         10    127 319,0    123 952,0    121 703    154 723     10 024,0  cudaMemcpyAsync                  
         2,0          939 561          1    939 561,0    939 561,0    939 561    939 561          0,0  cudaFree                         
         1,0          621 053         10     62 105,0     61 816,0     60 732     64 042        982,0  cudaStreamSynchronize            
         0,0          297 637          4     74 409,0     97 682,0      2 850     99 422     47 713,0  cudaMalloc                       
         0,0          122 103        838        145,0        120,0         60        730         81,0  cuGetProcAddress_v2              
         0,0           42 762         14      3 054,0      3 650,0        290      4 541      1 514,0  cuLibraryGetKernel               
         0,0            9 400         18        522,0        250,0        200      1 780        578,0  cudaEventCreateWithFlags         
         0,0            8 150          3      2 716,0        890,0        230      7 030      3 750,0  cudaStreamIsCapturing_v10000     
         0,0            3 940          4        985,0        900,0        770      1 370        278,0  cuInit                           
         0,0            1 200          3        400,0         90,0         70      1 040        554,0  cuModuleGetLoadingMode           
         0,0              720          2        360,0        360,0        190        530        240,0  cudaGetDriverEntryPoint_v11030   
   
   [6/8] Executing 'cuda_gpu_kern_sum' stats report
   
    Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
    --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
        98,0          789 759         10  78 975,0  78 960,0    78 848    79 231        112,0  ampere_fp16_s16816gemm_fp16_256x128_ldg8_f2f_stages_32x3_nn                                         
         1,0           11 168          1  11 168,0  11 168,0    11 168    11 168          0,0  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
   
   [7/8] Executing 'cuda_gpu_mem_time_sum' stats report
   
    Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)           Operation          
    --------  ---------------  -----  ---------  ---------  --------  --------  -----------  ----------------------------
       100,0        1 175 353     10  117 535,0  116 896,0   116 319   120 447      1 448,0  [CUDA memcpy Host-to-Device]
   
   [8/8] Executing 'cuda_gpu_mem_size_sum' stats report
   
    Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation          
    ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
    20,972         10  2,097     2,097     2,097     2,097     0,000        [CUDA memcpy Host-to-Device]   

Nsight Compute

3.2. Nsight Compute requires elevated privileges to access the GPU. You will need to run it as root or set up permissions for your user. See also NVIDIA: Performance Counters

Run the following command to get the performance data:

ncu -o my-example-trace python example.py
You should see the following files:
my-example-trace.ncu-rep

and output in the terminal similar to the following:

==PROF== Connected to process 38129 (/usr/bin/python3.12)
==PROF== Profiling "distribution_elementwise_grid..." - 0: 0%....50%....100% - 9 passes
==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 1: 0%....50%....100% - 9 passes
==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 2: 0%....50%....100% - 9 passes
==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 3: 0%....50%....100% - 9 passes
==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 4: 0%....50%....100% - 9 passes
==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 5: 0%....50%....100% - 9 passes
==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 6: 0%....50%....100% - 9 passes
==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 7: 0%....50%....100% - 9 passes
==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 8: 0%....50%....100% - 9 passes
==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 9: 0%....50%....100% - 9 passes
==PROF== Profiling "ampere_fp16_s16816gemm_fp16_2..." - 10: 0%....50%....100% - 9 passes
==PROF== Disconnected from process 38129
==PROF== Report: my-example-trace.ncu-rep

Perfetto (viztracer)

3.3. Run the following command to get the performance data:

viztracer -o my-example-trace.viztracer.json example.py