This post does not contain my own research, just my takeaways from reading these papers.

Flowcharts

Flowcharts are documents that are most frequently used to depict the sequence of steps in a process. For example, the following flowchart is describing an algorithm to find the first occurence of a number within a sorted array. Image is from the FlowVQA dataset (Singh et al., 2024).

The elements of a flowchart can be thought of as nodes and edges, same as in a graph or network. Each node has a shape that defines its role (retangle for step, diamond for conditional, and so on) and a textual label. Each edge (or link) is usually a directed arrow that shows the flow of steps, and ocassionally has a label for the condition (e.g., Yes/ No). Simple enough for the human eyes, but infinitely tricky for the machine.

Vision-language models (VLMs) – such as GPT-4o, Claude 3.7 Sonnet, Qwen-2.5-VL – are AI models that can take in both image and text as input. They provide a promising venue for automated analytical tasks regarding flowchart images.

Flowchart Visual Question Answering

One popular research direction is flowchart Visual Question Answering (VQA). For example, you might ask the VLM questions like:

“What happens to the ‘right’ pointer when the search element ‘x’ is found at the ‘mid’ index?”
“How many nodes exist in the given flowchart?”
“Is the node ""Update left to mid + 1"" direct predecessor of the node ""left <= right?""?”

This topic has been covered in papers such as: FlowVQA (Singh et al., 2024), FlowLearn (Pan et al., 2024), TextFlow (Ye et al., 2025), etc. The general consensus is that off-the-shelf VLMs are useful to some extent in “understanding” and answering such questions on simple flowcharts, but they struggle against large and complex images (flowcharts with lots of nodes and edges, etc.). This seems to be the case in related structured document understanding tasks, such as VQA on data visualizations (ChartQA (Masry et al., 2022)). However, data visualization VQA capabilities have been massively improved with work suggesting specialized or fine-tuned models, such as ChartT5 (Zhou et al., 2023), UniChart (Masry et al., 2023), ChartVE (Huang et al., 2024), etc., while in the domain of flowchart this has remained unexplored.

Some papers suggest that the inherent complexity of flowcharts (i.e., to truly understand the visuals of relations between nodes in the image) is the key challenge for VLMs. A recent paper by (Hou et al., 2025) makes the case that VLMs don’t actually read the images at all, they merely rely on their pre-learnt knowledge to answer questions. One of their findings is that VLMs perform okay on real diagrams but significantly worse in synthetic diagrams, suggesting that they may not be able to answer when the knowledge in the images is not in their training data.

Flowchart Image-to-Code

Another research direction that has just received attention very recently is the task of converting flowchart images to code (image-to-flow, flowchart-to-code). “Code” here means diagramming languages – such as Mermaid, Graphviz, TikZ, etc. – that programmatically defines the elements in a flowchart image. For example, this is the original Mermaid code that creates the flowchart at the start.

flowchart TD
    A(["Start"]) --> B["Define function find_first_occurrence with parameters A and x"]
    B --> C["Initialize left and right pointers"]
    C --> D["Initialize result to -1"]
    D --> E{"left <= right?"}
    E -->|"Yes"| F["Calculate mid as (left + right) // 2"]
    F --> G{"A[mid] == x?"}
    G -->|"Yes"| H["Set result to mid"]
    H --> I["Update right to mid - 1"]
    I --> E
    G -->|"No"| J{"A[mid] > x?"}
    J -->|"Yes"| K["Update right to mid - 1"]
    K --> E
    J -->|"No"| L["Update left to mid + 1"]
    L --> E
    E -->|"No"| M[/"Return result"/]
    M --> N(["End"])

You can input the image into a VLM and prompt it to generate the Mermaid code. For example, this is the code generated by Qwen-2.5-VL-32B for the same image, which renders into a slightly different image but overall retains the correct flow.

flowchart TD
    A(["Start"]) --> B[/"Define function find_first_occurrence with parameters A and x"/]
    B --> C[/"Initialize left and right pointers"/]
    C --> D[/"Initialize result to -1"/]
    D --> E{"left <= right?"}
    E -->|No| F[/"Return result"/]
    F --> G(["End"])
    E -->|Yes| H[/"Calculate mid as (left + right) // 2"/]
    H --> I{"A[mid] == x?"}
    I -->|Yes| J[/"Set result to mid"/]
    J --> K[/"Update right to mid - 1"/]
    K --> E
    I -->|No| L{"A[mid] > x?"}
    L -->|Yes| M[/"Update right to mid - 1"/]
    M --> E
    L -->|No| N[/"Update left to mid + 1"/]
    N --> E

I am personally more interested in the flowchart-to-code task, because its code output can be used in more cases than simple VQA, such as for incorporation into RAG pipelines or performing auto-checking (i.e checking if a process is logically sound). The TextFlow paper actually explores using this code to improve performance in the flowchart VQA task (compared to using only image). Besides, my intuition is that it is easier for AI models to reason on text/code inputs than on visual inputs (which was also the intuition for the improvement in the Textflow paper).

However, findings in this topic are generally similar to flowchart VQA tasks, which is that VLMs can be useful for small and simple images, but massively underperform in more complex cases.

Recent advances

While working on the VLM topic, I came across some very recent papers (2 ICLR papers, 1 arXiv preprint, all in May 2025) that involve the flowchart understanding tasks. I will summarize what I found interesting/ applicable to my use case below.

BigDocs - Image2Flow (Rodriguez et al., 2025)

The BigDocs paper is actually about another, more general problem, which is to provide an open dataset for document understanding tasks (you can see in the picture). The flowchart-to-code task makes up a small subset of this.

This paper is notable because it is one of the first to formally define “flowchart to code” as a task (they call it Image2Flow) and define a metric for benchmarking (they call it Length-Shape Triplet F1 score). Previous papers, like TextFlow, only considered it an intermediate step for VQA.

The findings that are relevant to me are as follows:

First, Image2Flow subset is created by asking an LLM (LLaMa 3.1) to generate GraphViz code/JSON for a flowchart, given some random parameters (such as number of nodes, their shapes, etc), plus some contraints to make it a plausible graph. The results is 10k pairs of image-GraphViz code plus 10k pairs of image-JSON (I count a total of 5 such datasets, including 2 shown here). It is also really interesting that the flowcharts are random and not real processes, which I’ll explain later.

Second, the performance of the flowchart-to-code task is measured by Length-Shape Triplet F1 (LST F1), where the F1 score is calculated in the same way as in machine learning models, but on the flowchart “triplet” (s, e, d).

s, d: source and destination node’s text label and shape
e: edge label

So the generated code is only “correct” if the information of the source node, edge, and destination matches exactly with that in the ground-truth code.

Third, they fine-tuned off-the-shelf VLMs with the BigDocs (whole dataset, not just the Image2Flow subset). This led to sigifincant performance gains in the flowchart-to-code task. For example, most open-source models (including big ones like Qwen2-VL-72B, Llama-3.2.90B) achieved near 0 F1-score for the GraphViz subset, while closed-source models (GPT4o, Claude 3.5) got aroun 10-20 (note that F1 scores range from 0 to 100). Whereas, with fine-tuning, even smaller models like Phi3.5-v-4B achieved up to 63 F1-score, showing huge performance gains. But then again 63 is not that high of an F1 score, this suggests area for further improvements.

This result, however, contradicts with an earlier finding from the TextFlow, where they also measured the flowchart-to-code performance using F1 scores (but for individual nodes and edges) on another dataset (FlowVQA). Notably, Qwen2-VL-72B, GPT-4o and Claude 3.5. Sonnet (same models as above) achieved very good F1 scores (90+).

What might cause this discrepancy? My guest guess is that it’s because the flowcharts in FlowLearn are real processes while Image2Flow contain fake ones, so the VLMs cannot use pre-learnt knowledge for reasoning – which, if true, is consistent with the findings from (Hou et al., 2025).

Sketch2Diagram - Img2TikZ (Saito et al., 2025)

The Sketch2Diagram paper deals with a similar task, which is converting images of hand-drawn diagrams to computer-rendered images. They do this by first converting the sketches to code, which is then rendered using some tool.

They first create SkeTikZ a dataset of 3k pairs of hand-drawn sketches with corresponding TikZ code (TikZ is another diagramming language, used in LaTeX documents). (They collect the TikZ source code first, render them, then hire humans to draw them on paper/whiteboard)

They then augment this dataset (adding noise, varying backgrounds and brightness, etc.) to be used as training data for their Img2TikZ model (similar to our flowchart-to-code task, but use TikZ).

They employ a 3-part architecture similar to LLaVA 1.5, but they switch out key components with different models:

Vision encoder: use pre-trained SigLIP (vision language model similar to CLIP, but uses sigmoid loss)
Adapter: trainable 2-layer multi-layer perceptron, same as LLaVA
LLM: use pre-trained DeepSeekCoder for code output (instead of normal LLM for natural language0)

During training, they update the Adapter (stage 1) and Adapter & CodeLLM (stage 2) using the SkeTikZ dataset listed above. During inference, they use both iterative generation (IG - keeps generating 1 code version until it compiles) and multi-candidate generation (MCG - generate multiple code versions and uses another model to choose the best one as output). The model they use as selector for the MCG strategy is D-SigLIP (Diagram-Specialized SigLIP), which is the SigLIP with an additional layer that they fine-tune via contrastive learning, also using data from the SkeTikZ set.

For evaluation, they use both automatic metrics and human-annotated scores.

Automatic metrics include:

Compilation success rate (CSR): whether the code can compile without error (if there’s syntax error, it will not compile)
ImageSim: cosine similarity of rendered image vs ground-truth, with image embeddings by D-SigLip
CodeSim: cosine similarity of output code vs ground-truth
CharacterSim: Rouge-1 of OCR extracted texts of rendered image vs ground-truth

And for subjective, human judgement:

Alignment: 1-5 human scores of alignment (1 = completely irrelevant)
Quality: 1-5 human scores (1 = poorly arranged, unreadable, 5 = well-structured, logically arranged)

The results are as follows. Some notable points:

Compilable code (CSR): Img2TikZ can generate compilable code (near 100% CSR after 5 attempts)

Similar diagrams (ImageSim and Alignment): Best model (Claude 3.5) only achieves 3.32/5 alignment score, i.e: generated diagrams match only 50-60% of the reference diagrams, suggesting this is still an open challenge.
Similar codes (CodeSim): Img2Tikz can generate code close to reference code (highest CodeSim). But, all models have high CodeSim, so similar code doesn’t guarantee quality image?
High-quality diagrams: best model (Claude 3.5) only achieves 3.54/5, meaning models struggle with generating correct diagram layout.

So overall, while models trained on diagram images performed better compared to off-the-shelf models, the output quality (as judged by humans) of even the best-performing VLMs still leave a lot to be desired. Not to mention, the new Img2TikZ models scored worse than Claude 3.5 in both human metrics.

As a bonus, they also analyzed the model performance when training with and without data augmentation pre-processing and found that the models with data augmentation perform better, suggesting (unsurprisingly) that augmentation can be an useful technique when dealing with visual data.

Arrow-guided VLM (Omasa et al., 2025)

The last paper I’ll cover proposes a very unique idea. Whereas previous papers relied solely on the image as input for the VLM, this paper suggest using OCR and object detection model to explicitly identify relations of nodes and edges in the image before feeding all that information into a VLM. Intuitively, the reason this idea can only work (and it can work only on flowcharts) is due to the unique structure of nodes and directed edges found in these diagrams.

The idea is just as simple as it sounds:

They use an existing OCR model (Azure AI Document Intelligence OCR) with a fine-tuned object detection model (DAMO-YOLO) specialized on shapes and arrows.
They then use positional information from those models to “link” the different elements together (say, link a text with a box to form a node, or link an arrow to a certain node).
This information is then fed into a VLM (GPT-4o) for question answering.

On the 90-question, 30-image dataset that they tested, this technique helped improve VQA accuracy (as judged by humans) significantly in most cases (see Ocr-Dec vs No-Ocr-Dec).

However, a shortcoming of this paper is that they only experimented with a very small and closed dataset (90 questions). They are also solving the VQA task, not the flowchart-to-code task per se. That said, I still think this idea may prove useful for a wide variety of flowchart understanding tasks.

Conclusion

So that is my takeaways after a few months of reading on the flowchart understanding topic. Do let me know if I get something wrong. The papers I mentioned can be found in the following reference section.

Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark‘s potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.

Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that involve several logical and arithmetic operations. They also commonly refer to visual features of a chart in their questions. However, most existing datasets do not focus on such complex reasoning questions as their questions are template-based and answers come from a fixed-vocabulary. In this work, we present a large-scale benchmark covering 9.6K human-written questions as well as 23.1K questions generated from human-written chart summaries. To address the unique challenges in our benchmark involving visual and logical reasoning over charts, we present two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions. While our models achieve the state-of-the-art results on the previous datasets as well as on our benchmark, the evaluation also reveals several challenges in answering complex reasoning questions.

Building cross-model intelligence that can understand charts and communicate the salient information hidden behind them is an appealing challenge in the vision and language (V+L) community. The capability to uncover the underlined table data of chart figures is a critical key to automatic chart understanding. We introduce ChartT5, a V+L model that learns how to interpret table information from chart images via cross-modal pre-training on plot table pairs. Specifically, we propose two novel pre-training objectives: Masked Header Prediction (MHP) and Masked Value Prediction (MVP) to facilitate the model with different skills to interpret the table information. We have conducted extensive experiments on chart question answering and chart summarization to verify the effectiveness of the proposed pre-training strategies. In particular, on the ChartQA benchmark, our ChartT5 outperforms the state-of-the-art non-pretraining methods by over 8% performance gains.

Recent papers in flowchart understanding with VLMs