https://cis.temple.edu/~latecki/Papers/JoPanECAI2024.pdf
4.4 Settings testing subset of the FlowL arn dataset, which included assessments of 500 scientific flowchart and 2,000 simulated flowcharts. Due to cost constraints and API limita-tions, we limited our evaluations to 100 samples per ta 3-Opus, GPT-4V, and Step-1V. All other evaluations were conducted using an NVIDIA A100 80GB GPU.