⚡️ Speed up function `validate_gantt` by 58x by misrasaurabh1 · Pull Request #5386 · plotly/plotly.py

misrasaurabh1 · 2025-10-30T06:23:32Z

📄 5,759% (57.59x) speedup for `validate_gantt` in `plotly/figure_factory/_gantt.py`

⏱️ Runtime : 154 milliseconds → 2.63 milliseconds (best of 246 runs)

📝 Explanation and details

The optimization achieves a 58x speedup by eliminating the major performance bottleneck in pandas DataFrame processing.

Key optimizations:

Pre-fetch column data as numpy arrays: The original code used df.iloc[index][key] for each cell access, which triggers pandas' slow row-based indexing mechanism. The optimized version extracts all column data upfront using df[key].values and stores it in a dictionary, then uses direct numpy array indexing columns[key][index] inside the loop.
Use actual DataFrame columns: Instead of iterating over the DataFrame object itself (which includes metadata), the code now uses list(df.columns) to get only the actual column names.

Why this is dramatically faster:

df.iloc[index][key] creates temporary pandas Series objects and involves complex indexing logic for each cell
Direct numpy array indexing columns[key][index] is orders of magnitude faster
The line profiler shows the original df.iloc line consumed 96.8% of execution time (523ms), while the optimized dictionary comprehension takes only 44.9% (4.2ms)

Performance characteristics:

Large DataFrames see massive gains: 8000%+ speedup on 1000-row DataFrames
Small DataFrames: 40-50% faster
List inputs: Slight slowdown (3-13%) due to additional validation overhead, but still microsecond-level performance
Empty DataFrames: Some slowdown due to upfront column extraction, but still fast overall

This optimization is most beneficial for DataFrame inputs with many rows, where the repeated iloc calls created a severe performance bottleneck.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 39 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pytest
# function to test
from plotly import exceptions, optional_imports
from plotly.figure_factory._gantt import validate_gantt

pd = optional_imports.get_module("pandas")

REQUIRED_GANTT_KEYS = ["Task", "Start", "Finish"]
from plotly.figure_factory._gantt import validate_gantt

# --- BASIC TEST CASES ---

def test_valid_list_of_dicts():
    # Test a valid list of dictionaries with required keys
    input_data = [
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04"}
    ]
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.88μs -> 1.95μs (3.54% slower)

def test_valid_dataframe():
    # Test a valid pandas DataFrame with required keys
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04"}
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 142μs -> 99.9μs (42.6% faster)

def test_valid_list_with_extra_keys():
    # Test list of dicts with extra keys
    input_data = [
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02", "Resource": "X"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04", "Resource": "Y"}
    ]
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.57μs -> 1.70μs (7.77% slower)

def test_valid_dataframe_with_extra_keys():
    # Test DataFrame with extra columns
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02", "Resource": "X"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04", "Resource": "Y"}
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 160μs -> 109μs (46.6% faster)

# --- EDGE TEST CASES ---

def test_missing_required_key_in_list():
    # Test list of dicts missing a required key
    input_data = [
        {"Task": "A", "Start": "2020-01-01"},  # Missing "Finish"
    ]
    # Should NOT raise: list input is not validated for keys
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.54μs -> 1.67μs (7.83% slower)

def test_missing_required_key_in_dataframe():
    # Test DataFrame missing a required key
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01"}  # Missing "Finish"
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 27.2μs -> 27.1μs (0.402% faster)

def test_empty_list():
    # Test empty list input
    input_data = []
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 2.39μs -> 2.40μs (0.292% slower)


def test_input_is_not_list_or_dataframe():
    # Test input that is neither a list nor a pandas DataFrame
    input_data = "Not a list or DataFrame"
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 2.58μs -> 2.64μs (2.27% slower)

def test_dataframe_with_no_rows():
    # Test DataFrame with correct columns but no rows
    import pandas as pd
    df = pd.DataFrame(columns=["Task", "Start", "Finish"])
    codeflash_output = validate_gantt(df); result = codeflash_output # 27.0μs -> 99.0μs (72.8% slower)

def test_dataframe_with_extra_rows_and_missing_keys():
    # Test DataFrame with extra columns, but missing one required key
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Resource": "X"},
        {"Task": "B", "Start": "2020-01-03", "Resource": "Y"}
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 26.3μs -> 26.8μs (2.13% slower)

def test_list_with_dict_missing_all_keys():
    # Test list of dicts missing all required keys
    input_data = [
        {"Resource": "X"}
    ]
    # Should NOT raise: list input is not validated for keys
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.61μs -> 1.87μs (13.9% slower)

def test_dataframe_with_only_required_keys():
    # Test DataFrame with only required keys
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02"}
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 108μs -> 98.6μs (9.92% faster)

# --- LARGE SCALE TEST CASES ---

def test_large_list_of_dicts():
    # Test a large list of dicts (1000 elements)
    input_data = [
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(1000)
    ]
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 2.30μs -> 2.47μs (6.69% slower)

def test_large_dataframe():
    # Test a large DataFrame (1000 rows)
    import pandas as pd
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(1000)
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 35.9ms -> 429μs (8268% faster)
    for i in range(1000):
        pass

def test_large_dataframe_missing_key():
    # Test a large DataFrame missing one required key
    import pandas as pd
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}"}  # Missing "Finish"
        for i in range(1000)
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 31.1μs -> 30.0μs (3.66% faster)

def test_large_list_with_non_dict_first_element():
    # Test large list with first element not a dict
    input_data = ["Not a dict"] + [
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(999)
    ]
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 2.91μs -> 2.96μs (1.69% slower)

def test_large_list_with_non_dict_later_element():
    # Test large list where a later element is not a dict (should NOT raise)
    input_data = [
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(999)
    ] + ["Not a dict"]
    # Should NOT raise: only first element is checked
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 2.18μs -> 2.34μs (7.01% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import sys
import types

import pandas as pd
# imports
import pytest  # used for our unit tests
# function to test
# (copied verbatim from prompt, for test completeness)
from plotly import exceptions, optional_imports
from plotly.figure_factory._gantt import validate_gantt

pd = optional_imports.get_module("pandas")

REQUIRED_GANTT_KEYS = ["Task", "Start", "Finish"]
from plotly.figure_factory._gantt import validate_gantt

# unit tests

if pd is None:
    import pytest


# --- Basic Test Cases ---

def test_valid_list_of_dicts():
    # Valid input: list of dictionaries
    input_data = [
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.87μs -> 1.94μs (3.86% slower)

def test_valid_dataframe():
    # Valid input: DataFrame with required columns
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 150μs -> 106μs (42.1% faster)

# --- Edge Test Cases ---

def test_missing_required_keys_in_dataframe():
    # DataFrame missing "Finish" column
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01"},
        {"Task": "B", "Start": "2023-01-02"}
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 26.0μs -> 25.9μs (0.424% faster)

def test_missing_required_keys_in_list_of_dicts():
    # List of dicts missing "Finish" key
    input_data = [
        {"Task": "A", "Start": "2023-01-01"},
        {"Task": "B", "Start": "2023-01-02"}
    ]
    # This should not raise, as the function does not check keys for lists
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.53μs -> 1.75μs (12.5% slower)

def test_empty_list():
    # Empty list should raise
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt([]) # 1.76μs -> 1.81μs (2.70% slower)


def test_non_list_non_dataframe_input():
    # Input is neither a list nor a DataFrame
    input_data = "not a list or dataframe"
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 1.68μs -> 1.64μs (2.56% faster)

def test_dataframe_with_extra_columns():
    # DataFrame with extra columns should still work
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02", "Extra": 123}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 173μs -> 117μs (48.3% faster)

def test_list_of_dicts_with_extra_keys():
    # List of dicts with extra keys should pass
    input_data = [
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02", "Extra": 123}
    ]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.55μs -> 1.76μs (12.2% slower)

def test_dataframe_with_wrong_column_types():
    # DataFrame with columns named correctly but with wrong types in values
    df = pd.DataFrame([
        {"Task": None, "Start": 123, "Finish": []}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 137μs -> 91.8μs (49.7% faster)

def test_list_with_first_dict_rest_non_dicts():
    # Only the first element is checked for being a dict
    input_data = [{"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"}, 123, "string"]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.58μs -> 1.75μs (9.48% slower)

def test_dataframe_with_no_rows():
    # DataFrame with correct columns but no rows
    df = pd.DataFrame(columns=REQUIRED_GANTT_KEYS)
    codeflash_output = validate_gantt(df); output = codeflash_output # 23.9μs -> 93.8μs (74.5% slower)

def test_list_of_dicts_with_empty_dict():
    # List with an empty dictionary as first element
    input_data = [{}]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.43μs -> 1.85μs (22.8% slower)

# --- Large Scale Test Cases ---

def test_large_list_of_dicts():
    # Large list of dicts (1000 elements)
    input_data = [
        {"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}"}
        for i in range(1, 1001)
    ]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 2.02μs -> 2.25μs (10.2% slower)

def test_large_dataframe():
    # Large DataFrame (1000 rows)
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}"}
        for i in range(1, 1001)
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 35.7ms -> 433μs (8135% faster)
    for i in range(1000):
        pass

def test_large_dataframe_with_extra_columns():
    # Large DataFrame with extra columns
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}", "Extra": i}
        for i in range(1, 1001)
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 81.0ms -> 511μs (15734% faster)
    for i in range(1000):
        pass

def test_large_list_with_non_dict_first_element():
    # Large list, first element not a dict
    input_data = [0] + [{"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}"} for i in range(1, 999)]
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 3.19μs -> 3.32μs (3.82% slower)

def test_large_empty_dataframe():
    # Large DataFrame with correct columns but zero rows
    df = pd.DataFrame(columns=REQUIRED_GANTT_KEYS)
    codeflash_output = validate_gantt(df); output = codeflash_output # 25.0μs -> 96.7μs (74.2% slower)

# --- Determinism and Robustness ---

def test_determinism_multiple_calls():
    # Multiple calls with same input should give same output
    input_data = [
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ]
    codeflash_output = validate_gantt(input_data); output1 = codeflash_output # 1.61μs -> 1.88μs (14.0% slower)
    codeflash_output = validate_gantt(input_data); output2 = codeflash_output # 523ns -> 586ns (10.8% slower)

def test_dataframe_column_order():
    # DataFrame with columns in different order
    df = pd.DataFrame([
        {"Finish": "2023-01-02", "Start": "2023-01-01", "Task": "A"}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 107μs -> 96.7μs (10.7% faster)

def test_dataframe_with_index():
    # DataFrame with custom index
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ], index=["x", "y"])
    codeflash_output = validate_gantt(df); output = codeflash_output # 137μs -> 91.4μs (50.1% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-validate_gantt-mhcxyu68 and push.

The optimization achieves a **58x speedup** by eliminating the major performance bottleneck in pandas DataFrame processing. **Key optimizations:** 1. **Pre-fetch column data as numpy arrays**: The original code used `df.iloc[index][key]` for each cell access, which triggers pandas' slow row-based indexing mechanism. The optimized version extracts all column data upfront using `df[key].values` and stores it in a dictionary, then uses direct numpy array indexing `columns[key][index]` inside the loop. 2. **More efficient key validation**: Replaced the nested loop checking for missing keys with a single list comprehension `missing_keys = [key for key in REQUIRED_GANTT_KEYS if key not in df]`. 3. **Use actual DataFrame columns**: Instead of iterating over the DataFrame object itself (which includes metadata), the code now uses `list(df.columns)` to get only the actual column names. **Why this is dramatically faster:** - `df.iloc[index][key]` creates temporary pandas Series objects and involves complex indexing logic for each cell - Direct numpy array indexing `columns[key][index]` is orders of magnitude faster - The line profiler shows the original `df.iloc` line consumed 96.8% of execution time (523ms), while the optimized dictionary comprehension takes only 44.9% (4.2ms) **Performance characteristics:** - **Large DataFrames see massive gains**: 8000%+ speedup on 1000-row DataFrames - **Small DataFrames**: 40-50% faster - **List inputs**: Slight slowdown (3-13%) due to additional validation overhead, but still microsecond-level performance - **Empty DataFrames**: Some slowdown due to upfront column extraction, but still fast overall This optimization is most beneficial for DataFrame inputs with many rows, where the repeated `iloc` calls created a severe performance bottleneck.

camdecoster · 2025-10-30T15:27:59Z

Thanks for the PR! Could you please add test coverage or demonstrate that test coverage is already provided? Some tests failed CI, but I think that's unrelated to your changes.

misrasaurabh1 · 2025-10-30T19:37:17Z

@camdecoster just added a test for it. fixing the formatting issue now

camdecoster

It looks like there could be some redundant tests with this test file. Could you please double check and remove any redundant tests from your PR?

camdecoster · 2025-11-17T20:30:21Z

tests/test_optional/test_figure_factory/test_validate_gantt.py

+    assert all(isinstance(x, dict) for x in result)
+
+
+@pytest.mark.skipif(pd is None, reason="pandas is not available")


Could you please remove the skipif calls? Based on CI, Pandas will always be defined.

KRRT7 · 2025-11-18T20:08:04Z

here's the coverage report

KRRT7 · 2025-11-18T20:22:28Z

I've gone ahead and removed redundant tests, it also maintains the same coverage for the validate_gantt function

KRRT7 · 2025-11-18T20:30:05Z

@camdecoster the PR should be good to go now!

camdecoster

Thanks for the contribution!

camdecoster · 2025-11-19T16:47:01Z

Actually, one final request: could you please update the changelog?

KRRT7 · 2025-11-19T18:12:28Z

I've added the changelong entry, let me know if I did it correctly.

camdecoster

I've added the changelong entry, let me know if I did it correctly.

You did! However, could you please remove the reference to codeflash? I added a suggested change.

CHANGELOG.md

Co-authored-by: Cameron DeCoster <cameron.decoster@gmail.com>

codeflash-ai bot and others added 3 commits October 30, 2025 04:46

Apply suggestion from @misrasaurabh1

6be6284

Apply suggestion from @misrasaurabh1

9e2a2f0

misrasaurabh1 changed the title ~~⚡️ Speed up function validate_gantt by 58x~~ ⚡️ Speed up function validate_gantt by 58x Oct 30, 2025

misrasaurabh1 changed the title ~~⚡️ Speed up function validate_gantt by 58x~~ ⚡️ Speed up function validate_gantt by 58x Oct 30, 2025

adding validate_gantt tests file

7ddb02b

mashraf-222 added 2 commits October 30, 2025 22:40

fix formatting

666dcc2

fixing formatting

ef98a70

camdecoster reviewed Nov 17, 2025

View reviewed changes

KRRT7 and others added 2 commits November 18, 2025 13:00

Merge branch 'main' into codeflash/optimize-validate_gantt-mhcxyu68

0e84b4a

remove conditional pandas

4c5dcd1

remove redundant tests

084595a

apply ruff formatting

df67ffb

camdecoster approved these changes Nov 19, 2025

View reviewed changes

add changelong entry

3dde3b6

camdecoster reviewed Nov 19, 2025

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

apply suggestion

79fe9f4

Co-authored-by: Cameron DeCoster <cameron.decoster@gmail.com>

camdecoster merged commit f083977 into plotly:main Nov 19, 2025
8 checks passed

		assert all(isinstance(x, dict) for x in result)


		@pytest.mark.skipif(pd is None, reason="pandas is not available")

Uh oh!

Comments

Conversation

misrasaurabh1 commented Oct 30, 2025

📄 5,759% (57.59x) speedup for validate_gantt in plotly/figure_factory/_gantt.py

📝 Explanation and details

Uh oh!

camdecoster commented Oct 30, 2025

Uh oh!

misrasaurabh1 commented Oct 30, 2025

Uh oh!

camdecoster left a comment

Choose a reason for hiding this comment

Uh oh!

camdecoster Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

KRRT7 commented Nov 18, 2025

Uh oh!

KRRT7 commented Nov 18, 2025

Uh oh!

KRRT7 commented Nov 18, 2025

Uh oh!

camdecoster left a comment

Choose a reason for hiding this comment

Uh oh!

camdecoster commented Nov 19, 2025

Uh oh!

KRRT7 commented Nov 19, 2025

Uh oh!

camdecoster left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

📄 5,759% (57.59x) speedup for `validate_gantt` in `plotly/figure_factory/_gantt.py`