fix(mcp-docugen): preprocessor HTML→Markdown per output Word leggibile

Il DOCX prodotto dalla versione precedente emetteva i div Tielogic (`<div class="cover">`, `<div class="info-col">`, `<div class="acceptance">`, `<div class="status-card">`) come testo grezzo: Pandoc non sa interpretare il CSS-flavoured HTML del PDF e li copia letteralmente nel documento Word. Anche le tabelle `<table class="financial">` finivano spezzate cella per cella. Il fix introduce un preprocessor dedicato che riscrive tutta la HTML Tielogic-flavoured in Markdown nativo prima di passare il documento a Pandoc. - docx_preprocessor.py: nuovo modulo basato su BeautifulSoup. Strippa frontmatter e <style>, poi rewrite di: * <div class="cover"> → titoli H1/H2, paragrafi, tabella pipe 2-col FORNITORE/CLIENTE, validità in italic, \newpage finale * <table class="financial"> → tabella pipe Markdown con riga total-row in **bold** * <div class="acceptance"> → heading H2 + intro + tabella pipe con riga firma `_____________________` + luogo/data * <div class="status-card"> → paragrafo "**name** — descrizione" * <span class="badge ..."> → testo **bold** * <div class="page-break"> → \newpage Pandoc-friendly - docx_renderer.py: deferisce tutto il preprocessing al nuovo modulo (più compatto, niente regex sparse). - pyproject.toml + uv.lock: aggiunta dipendenza beautifulsoup4>=4.12. - 8 nuovi test unit per il preprocessor (cover, tabelle, badge, acceptance, idempotenza, niente div residui, badge standalone). Adattati i test esistenti agli import dal nuovo modulo. 101 verde. Smoke E2E via MCP: l'offerta TieMeasureFlow esce in DOCX leggibile con tabelle Word native, heading colorati Tielogic e firme in tabella.
2026-04-26 11:26:52 +02:00
parent c783fff040
commit 54bf41efd6
7 changed files with 464 additions and 28 deletions
@@ -13,7 +13,7 @@ Due pezzi, stesso repo:
 | Servizio | Stato | Funzione |
 |---|---|---|
-| `mcp-docugen` | Implementato, 92 test verde, deploy Docker via gateway Caddy (porta 8090), **8 tool MCP** esposti (CRUD template + `document_generate` + `document_to_pdf` + `document_to_docx`), template seed versionati, CSS Tielogic iniettato inline, render server-side **PDF** via Chromium/Playwright e **DOCX** via Pandoc con reference `tielogic-reference.docx` | Genera Markdown formale da template + LLM (OpenRouter) e converte in PDF o Word. Vedi [`docs/mcp-docugen-design.md`](docs/mcp-docugen-design.md) + [`docs/mcp-docugen-implementation.md`](docs/mcp-docugen-implementation.md). |
+| `mcp-docugen` | Implementato, 101 test verde, deploy Docker via gateway Caddy (porta 8090), **8 tool MCP** (CRUD template + `document_generate` + `document_to_pdf` + `document_to_docx`), template seed versionati, CSS Tielogic iniettato inline, render server-side **PDF** via Chromium/Playwright e **DOCX** via Pandoc con preprocessor che riscrive HTML/CSS Tielogic in Markdown nativo + reference `tielogic-reference.docx` | Genera Markdown formale da template + LLM (OpenRouter) e converte in PDF o Word. Vedi [`docs/mcp-docugen-design.md`](docs/mcp-docugen-design.md) + [`docs/mcp-docugen-implementation.md`](docs/mcp-docugen-implementation.md). |
 | `mcp-convert` | Da progettare | Conversione Markdown → PDF / DOCX / HTML (pandoc/typst backend). |
 | `mcp-inbox` | Da progettare | Ingest da Telegram (+ STT opzionale via Whisper) verso draft inbox consumati da Claude Code desktop. |
@@ -96,7 +96,12 @@ Conversione Markdown→PDF: tre strade, in ordine di comodità.
 Il CSS Tielogic non viene mai referenziato come path esterno nel Markdown prodotto dal servizio: il `Renderer` lo legge da `themes/tielogic.css` (copiato nell'immagine Docker in `/app/themes/`) e lo inietta come blocco `<style>` inline subito dopo il frontmatter. Il file `.md` risultante è quindi **autocontenuto e portabile** — chi lo riceve può convertirlo in PDF stilizzato anche senza avere il CSS sull'host.
-Per il **formato Word (.docx)** il servizio espone il tool MCP `document_to_docx` (oppure `output_format ∈ {docx, all}` su `document_generate`). La conversione passa per Pandoc invocato come subprocess, con `themes/tielogic-reference.docx` come reference: heading colors (blu Tielogic), font Inter e dimensioni di carattere replicano l'identità del PDF nei limiti di quello che il formato `.docx` permette di stilizzare. La cover grafica con sfondo scuro, le card colorate, le badge e i bordi del CSS rimangono solo nel PDF (sono effetti CSS che non hanno equivalente nativo Word). Il reference `.docx` viene generato dallo script `scripts/build-reference-docx.py` partendo dal default Pandoc e riscrivendo `word/styles.xml`.
+Per il **formato Word (.docx)** il servizio espone il tool MCP `document_to_docx` (oppure `output_format ∈ {docx, all}` su `document_generate`). Il pipeline è:
 1. **Preprocessor** (`docx_preprocessor.py`, basato su BeautifulSoup) riscrive l'HTML Tielogic-flavoured presente nel Markdown generato — `<div class="cover">`, `<table class="financial">`, `<div class="acceptance">`, `<div class="status-card">`, `<span class="badge ...">` e `<div class="page-break">` — in costrutti Markdown nativi (heading, tabelle pipe, paragrafi bold, `\newpage`). Senza questo passaggio Pandoc emetterebbe i `<div>` come testo grezzo nel DOCX.
 2. **Pandoc** converte il Markdown ripulito in DOCX usando `themes/tielogic-reference.docx` come template di stili: heading colors (blu Tielogic), font Inter e dimensioni di carattere replicano l'identità del PDF nei limiti di quello che il formato `.docx` permette.
 La cover grafica con sfondo scuro, le card colorate, i bordi e gli sfondi delle status-card restano solo nel PDF (sono effetti CSS che non hanno equivalente nativo Word). Il reference `.docx` viene generato dallo script `scripts/build-reference-docx.py` partendo dal default Pandoc e riscrivendo `word/styles.xml`.
 ## Remote
@@ -17,6 +17,7 @@ dependencies = [
    "python-multipart>=0.0.9",
    "playwright>=1.48",
    "markdown-it-py[plugins]>=3.0",
    "beautifulsoup4>=4.12",
 ]
 [project.optional-dependencies]
@@ -0,0 +1,252 @@
 """Convert Tielogic-flavoured HTML inside the generated Markdown into
 native Markdown elements before passing the document to Pandoc.
 The PDF pipeline depends on `<div class="cover">`, `<table class="financial">`,
 `<div class="acceptance">`, etc. — these are CSS-styled in Chromium but
 Pandoc has no idea what to do with them and would otherwise emit them as
 raw text in the DOCX. This module rewrites those structures as headings,
 pipe tables and paragraphs so the Word output is readable and structured.
 """
 from __future__ import annotations
 import re
 from bs4 import BeautifulSoup, Tag
 _FRONTMATTER_DELIM = "---"
 _STYLE_BLOCK_RE = re.compile(r"<style\b[^>]*>.*?</style>", re.DOTALL | re.IGNORECASE)
 def _strip_style_blocks(text: str) -> str:
    return _STYLE_BLOCK_RE.sub("", text)
 def _strip_frontmatter(text: str) -> str:
    if not text.startswith(_FRONTMATTER_DELIM):
        return text
    end_marker = f"\n{_FRONTMATTER_DELIM}\n"
    idx = text.find(end_marker, len(_FRONTMATTER_DELIM))
    if idx == -1:
        return text
    return text[idx + len(end_marker) :].lstrip()
 def _text(el: Tag | None) -> str:
    return el.get_text(" ", strip=True) if el is not None else ""
 def _info_col_lines(col: Tag) -> list[str]:
    """Extract the rows of an info-col block (FORNITORE/CLIENTE), skipping
    the label (used as table header) and bolding the company name."""
    lines: list[str] = []
    for child in col.find_all("div", recursive=False):
        classes = set(child.get("class") or [])
        if "info-label" in classes:
            continue
        txt = child.get_text(" ", strip=True)
        if not txt:
            continue
        if "info-name" in classes:
            lines.append(f"**{txt}**")
        else:
            lines.append(txt)
    return lines
 def _convert_cover(soup: BeautifulSoup) -> None:
    cover = soup.find("div", class_="cover")
    if not isinstance(cover, Tag):
        return
    brand = _text(cover.find(class_="brand"))
    tagline = _text(cover.find(class_="brand-tagline"))
    title = _text(cover.find(class_="doc-title"))
    product = _text(cover.find(class_="doc-product"))
    ref = _text(cover.find(class_="doc-ref"))
    validity = _text(cover.find(class_="doc-validity"))
    info_box = cover.find(class_="info-box")
    info_cols = (
        info_box.find_all("div", class_="info-col") if isinstance(info_box, Tag) else []
    )
    blocks: list[str] = []
    if brand:
        blocks.append(f"# {brand}")
    if tagline:
        blocks.append(f"*{tagline}*")
    blocks.append("---")
    if title:
        blocks.append(f"## {title}")
    if product:
        blocks.append(f"**{product}**")
    if ref:
        blocks.append(ref)
    if len(info_cols) == 2:
        col_a, col_b = info_cols
        label_a = _text(col_a.find(class_="info-label")) or "FORNITORE"
        label_b = _text(col_b.find(class_="info-label")) or "CLIENTE"
        rows_a = _info_col_lines(col_a)
        rows_b = _info_col_lines(col_b)
        height = max(len(rows_a), len(rows_b))
        rows_a += [""] * (height - len(rows_a))
        rows_b += [""] * (height - len(rows_b))
        table_lines = [
            f"| **{label_a}** | **{label_b}** |",
            "|---|---|",
        ]
        for a, b in zip(rows_a, rows_b):
            table_lines.append(f"| {a} | {b} |")
        blocks.append("\n".join(table_lines))
    if validity:
        blocks.append(f"*{validity}*")
    replacement = "\n\n".join(blocks) + "\n\n\\newpage\n"
    cover.replace_with(BeautifulSoup(replacement, "html.parser"))
 def _convert_acceptance(soup: BeautifulSoup) -> None:
    acceptance = soup.find("div", class_="acceptance")
    if not isinstance(acceptance, Tag):
        return
    title_el = acceptance.find(class_="acceptance-title")
    intro_el = acceptance.find(class_="acceptance-intro")
    sig_grid = acceptance.find(class_="signature-grid")
    place_date = acceptance.find(class_="place-date")
    title = _text(title_el) or "ACCETTAZIONE"
    intro = _text(intro_el)
    blocks = [f"## {title}"]
    if intro:
        blocks.append(intro)
    if isinstance(sig_grid, Tag):
        cols = sig_grid.find_all("div", class_="sig-col")
        if len(cols) == 2:
            party_a = _text(cols[0].find(class_="sig-party"))
            party_b = _text(cols[1].find(class_="sig-party"))
            line_a = _text(cols[0].find(class_="sig-line")) or "Firma e timbro"
            line_b = _text(cols[1].find(class_="sig-line")) or "Firma e timbro"
            blocks.append(
                "\n".join(
                    [
                        f"| **{party_a}** | **{party_b}** |",
                        "|---|---|",
                        "| _____________________ | _____________________ |",
                        f"| {line_a} | {line_b} |",
                    ]
                )
            )
    if isinstance(place_date, Tag):
        blocks.append(_text(place_date))
    replacement = "\n\n".join(blocks)
    acceptance.replace_with(BeautifulSoup(replacement, "html.parser"))
 def _convert_status_cards(soup: BeautifulSoup) -> None:
    for card in soup.find_all("div", class_="status-card"):
        if not isinstance(card, Tag):
            continue
        name_el = card.find(class_="name")
        name = _text(name_el)
        # remaining text (sibling divs after the name)
        body_parts: list[str] = []
        for child in card.find_all("div", recursive=False):
            if child is name_el:
                continue
            txt = child.get_text(" ", strip=True)
            if txt:
                body_parts.append(txt)
        body = " ".join(body_parts)
        block = f"**{name}** — {body}" if body else f"**{name}**"
        card.replace_with(BeautifulSoup(block, "html.parser"))
 def _convert_badges(soup: BeautifulSoup) -> None:
    for span in soup.find_all("span", class_="badge"):
        if not isinstance(span, Tag):
            continue
        txt = span.get_text(" ", strip=True)
        span.replace_with(f"**{txt}**" if txt else "")
 def _convert_page_breaks(soup: BeautifulSoup) -> None:
    for el in soup.find_all("div", class_="page-break"):
        if not isinstance(el, Tag):
            continue
        el.replace_with(BeautifulSoup("\n\n\\newpage\n\n", "html.parser"))
 def _convert_financial_tables(soup: BeautifulSoup) -> None:
    """Rewrite `<table class="financial">` (with custom td/tr classes) as
    a clean pipe-table Markdown block. Pandoc handles raw HTML <table>
    inconsistently when extra attributes/classes are present."""
    for table in soup.find_all("table", class_="financial"):
        if not isinstance(table, Tag):
            continue
        header_cells: list[str] = []
        thead = table.find("thead")
        if isinstance(thead, Tag):
            for th in thead.find_all("th"):
                header_cells.append(th.get_text(" ", strip=True))
        rows: list[list[str]] = []
        body = table.find("tbody") or table
        for tr in body.find_all("tr"):
            classes = set(tr.get("class") or [])
            cells = [td.get_text(" ", strip=True) for td in tr.find_all(["td", "th"])]
            if not cells:
                continue
            if "total-row" in classes:
                cells = [f"**{c}**" for c in cells]
            # Skip if this row was already pulled in via thead
            if cells == header_cells:
                continue
            rows.append(cells)
        if not header_cells and rows:
            # Use the first row as header if no thead provided.
            header_cells, rows = rows[0], rows[1:]
        if not header_cells:
            continue
        ncols = len(header_cells)
        rows = [r + [""] * (ncols - len(r)) for r in rows]
        lines = ["| " + " | ".join(header_cells) + " |"]
        lines.append("|" + "|".join(["---"] * ncols) + "|")
        for r in rows:
            lines.append("| " + " | ".join(r) + " |")
        block = "\n".join(lines)
        table.replace_with(BeautifulSoup(block, "html.parser"))
 def preprocess_for_docx(markdown_text: str) -> str:
    """Apply the full pipeline of transformations needed to render the
    Tielogic Markdown documents in DOCX via Pandoc."""
    text = _strip_style_blocks(markdown_text)
    text = _strip_frontmatter(text)
    soup = BeautifulSoup(text, "html.parser")
    _convert_cover(soup)
    _convert_acceptance(soup)
    _convert_status_cards(soup)
    _convert_financial_tables(soup)
    _convert_badges(soup)
    _convert_page_breaks(soup)
    out = str(soup)
    # Collapse 3+ blank lines into 2 to keep the document tidy.
    out = re.sub(r"\n{3,}", "\n\n", out)
    return out.strip() + "\n"
@@ -2,14 +2,12 @@ from __future__ import annotations
 import asyncio
 import logging
 import re
 from dataclasses import dataclass
 from pathlib import Path
-logger = logging.getLogger(__name__)
+from mcp_docugen.docx_preprocessor import preprocess_for_docx
-_STYLE_BLOCK_RE = re.compile(r"<style\b[^>]*>.*?</style>", re.DOTALL | re.IGNORECASE)
+logger = logging.getLogger(__name__)
 _FRONTMATTER_DELIM = "---"
 class DocxRenderError(Exception):
@@ -22,26 +20,13 @@ class DocxRenderResult:
    size_bytes: int
 def _strip_style_blocks(markdown_text: str) -> str:
    """Remove `<style>...</style>` blocks: they're meaningless in DOCX and
    Pandoc would otherwise embed them as raw text."""
    return _STYLE_BLOCK_RE.sub("", markdown_text)
 def _strip_frontmatter(markdown_text: str) -> str:
    """Remove the YAML frontmatter so it doesn't appear as a body table in
    the DOCX. Frontmatter values were meant for the PDF renderer."""
    if not markdown_text.startswith(_FRONTMATTER_DELIM):
        return markdown_text
    end_marker = f"\n{_FRONTMATTER_DELIM}\n"
    idx = markdown_text.find(end_marker, len(_FRONTMATTER_DELIM))
    if idx == -1:
        return markdown_text
    return markdown_text[idx + len(end_marker) :].lstrip()
 def _preprocess(markdown_text: str) -> str:
-    return _strip_style_blocks(_strip_frontmatter(markdown_text))
+    """Strip the bits of the document that only make sense in the PDF
    pipeline (YAML frontmatter, inline `<style>`) and rewrite the
    Tielogic-flavoured HTML widgets (cover, status cards, financial
    tables, signatures, badges) as native Markdown so Pandoc can produce a
    clean DOCX."""
    return preprocess_for_docx(markdown_text)
 async def render_markdown_to_docx(
@@ -0,0 +1,167 @@
 from __future__ import annotations
 import textwrap
 from mcp_docugen.docx_preprocessor import preprocess_for_docx
 SAMPLE_DOC = textwrap.dedent(
    """\
    ---
    pdf_options:
      format: A4
    ---
    <style>body { color: red; }</style>
    <div class="cover">
      <div class="brand">TIELOGIC</div>
      <div class="brand-tagline">Soluzioni Software Industriali</div>
      <div class="brand-divider"></div>
      <div class="doc-title">OFFERTA PRODOTTO E INTEGRAZIONE</div>
      <div class="doc-product">TieMeasureFlow</div>
      <div class="doc-ref">Rif. OFF-2026-022 | 23 marzo 2026</div>
      <div class="info-box">
        <div class="info-col">
          <div class="info-label">FORNITORE</div>
          <div class="info-name">Tielogic SRL</div>
          <div>Via Villanova 39, 36020 Solagna (VI)</div>
          <div>P.IVA / C.F. 03954890244</div>
        </div>
        <div class="info-col">
          <div class="info-label">CLIENTE</div>
          <div class="info-name">Ricerca e Misure s.r.l.</div>
          <div>Via Brigata Julia 21, 35020 Pernumia (PD)</div>
          <div>Rif. Menoncin</div>
        </div>
      </div>
      <div class="doc-validity">Validità offerta: 23 aprile 2026</div>
    </div>
    # TieMeasureFlow
    Sistema web SPC.
    ## Costo di setup iniziale
    <table class="financial">
    <thead><tr><th>Voce</th><th class="num">Importo</th></tr></thead>
    <tbody>
    <tr><td>Setup</td><td class="num">€ 3.500,00</td></tr>
    <tr class="total-row"><td>TOTALE SETUP</td><td class="num">€ 3.500,00</td></tr>
    </tbody>
    </table>
    <div class="status-card drift">
      <div class="name">TEST Z +50MM <span class="badge badge-drift">DRIFT</span></div>
      <div>Errore cumulativo da 7.8mm a 11.5mm.</div>
    </div>
    <div class="acceptance">
    <h2 class="acceptance-title">ACCETTAZIONE</h2>
    <div class="acceptance-intro">Per accettazione, restituire copia firmata.</div>
    <div class="signature-grid"><div class="sig-col"><div class="sig-party">Per Tielogic SRL</div><div class="sig-line">Firma e timbro</div></div><div class="sig-col"><div class="sig-party">Per Ricerca e Misure s.r.l.</div><div class="sig-line">Firma e timbro</div></div></div>
    <div class="place-date">Luogo e data: ____________ 23 marzo 2026</div>
    </div>
    """
 )
 def test_preprocessor_strips_style_and_frontmatter():
    out = preprocess_for_docx(SAMPLE_DOC)
    assert "<style>" not in out
    assert "pdf_options" not in out
    assert not out.startswith("---")
 def test_preprocessor_converts_cover_to_markdown():
    out = preprocess_for_docx(SAMPLE_DOC)
    assert '<div class="cover">' not in out
    assert "# TIELOGIC" in out
    assert "*Soluzioni Software Industriali*" in out
    assert "## OFFERTA PRODOTTO E INTEGRAZIONE" in out
    assert "**TieMeasureFlow**" in out
    assert "Rif. OFF-2026-022 | 23 marzo 2026" in out
    # Info table 2-col
    assert "| **FORNITORE** | **CLIENTE** |" in out
    assert "**Tielogic SRL**" in out
    assert "**Ricerca e Misure s.r.l.**" in out
    assert "Via Villanova 39, 36020 Solagna (VI)" in out
    assert "*Validità offerta: 23 aprile 2026*" in out
    # newpage after cover
    assert "\\newpage" in out
 def test_preprocessor_converts_financial_table_to_pipe():
    out = preprocess_for_docx(SAMPLE_DOC)
    assert '<table class="financial">' not in out
    assert "| Voce | Importo |" in out
    assert "| Setup | € 3.500,00 |" in out
    # total row bolded
    assert "| **TOTALE SETUP** | **€ 3.500,00** |" in out
 def test_preprocessor_converts_status_card_to_paragraph():
    out = preprocess_for_docx(SAMPLE_DOC)
    assert 'class="status-card' not in out
    assert "**TEST Z +50MM" in out
    assert "Errore cumulativo da 7.8mm a 11.5mm" in out
 def test_preprocessor_converts_badges_to_bold():
    out = preprocess_for_docx(SAMPLE_DOC)
    assert "badge-drift" not in out
    # Badge inside status-card name gets absorbed into the bold name string;
    # checking that the badge text survives somewhere inside a bold span.
    assert "DRIFT" in out
    assert "<span" not in out
 def test_preprocessor_converts_standalone_badges_to_bold():
    md = "Verdetto: <span class=\"badge badge-fattibile\">FATTIBILE</span>"
    out = preprocess_for_docx(md)
    assert "<span" not in out
    assert "**FATTIBILE**" in out
 def test_preprocessor_converts_acceptance_to_table():
    out = preprocess_for_docx(SAMPLE_DOC)
    assert 'class="acceptance"' not in out
    assert "## ACCETTAZIONE" in out
    assert "Per accettazione, restituire copia firmata." in out
    assert "| **Per Tielogic SRL** | **Per Ricerca e Misure s.r.l.** |" in out
    assert "| _____________________ | _____________________ |" in out
    assert "| Firma e timbro | Firma e timbro |" in out
    assert "Luogo e data" in out
 def test_preprocessor_idempotent_on_clean_markdown():
    md = "# Title\n\nA paragraph.\n\n| a | b |\n|---|---|\n| 1 | 2 |\n"
    out = preprocess_for_docx(md)
    # No frontmatter to strip, no Tielogic widgets to rewrite.
    assert "# Title" in out
    assert "| a | b |" in out
    assert "<style" not in out
 def test_preprocessor_no_div_classes_left_in_output():
    out = preprocess_for_docx(SAMPLE_DOC)
    for forbidden in (
        "<div class=\"cover\"",
        "<div class=\"info-box\"",
        "<div class=\"info-col\"",
        "<div class=\"info-label\"",
        "<div class=\"info-name\"",
        "<div class=\"doc-title\"",
        "<div class=\"doc-product\"",
        "<div class=\"doc-ref\"",
        "<div class=\"doc-validity\"",
        "<div class=\"acceptance\"",
        "<div class=\"signature-grid\"",
        "<div class=\"sig-col\"",
        "<div class=\"sig-party\"",
        "<div class=\"sig-line\"",
        "<div class=\"place-date\"",
        "<div class=\"status-card",
    ):
        assert forbidden not in out, f"{forbidden!r} still present in output"
@@ -4,11 +4,13 @@ import shutil
 import pytest
-from mcp_docugen.docx_renderer import (
+from mcp_docugen.docx_preprocessor import (
    DocxRenderError,
    _preprocess,
    _strip_frontmatter,
    _strip_style_blocks,
    preprocess_for_docx as _preprocess,
 )
 from mcp_docugen.docx_renderer import (
    DocxRenderError,
    render_markdown_to_docx,
 )
@@ -75,6 +75,19 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/64/b4/17d4b0b2a2dc85a6df63d1157e028ed19f90d4cd97c36717afef2bc2f395/attrs-26.1.0-py3-none-any.whl", hash = "sha256:c647aa4a12dfbad9333ca4e71fe62ddc36f4e63b2d260a37a8b83d2f043ac309", size = 67548, upload-time = "2026-03-19T14:22:23.645Z" },
 ]
 [[package]]
 name = "beautifulsoup4"
 version = "4.14.3"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "soupsieve" },
    { name = "typing-extensions" },
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/c3/b0/1c6a16426d389813b48d95e26898aff79abbde42ad353958ad95cc8c9b21/beautifulsoup4-4.14.3.tar.gz", hash = "sha256:6292b1c5186d356bba669ef9f7f051757099565ad9ada5dd630bd9de5fa7fb86", size = 627737, upload-time = "2025-11-30T15:08:26.084Z" }
 wheels = [
    { url = "https://files.pythonhosted.org/packages/1a/39/47f9197bdd44df24d67ac8893641e16f386c984a0619ef2ee4c51fbbc019/beautifulsoup4-4.14.3-py3-none-any.whl", hash = "sha256:0918bfe44902e6ad8d57732ba310582e98da931428d231a5ecb9e7c703a735bb", size = 107721, upload-time = "2025-11-30T15:08:24.087Z" },
 ]
 [[package]]
 name = "certifi"
 version = "2026.2.25"
@@ -587,6 +600,7 @@ source = { editable = "services/mcp-docugen" }
 dependencies = [
    { name = "aiofiles" },
    { name = "aiosqlite" },
    { name = "beautifulsoup4" },
    { name = "fastapi" },
    { name = "httpx" },
    { name = "markdown-it-py", extra = ["plugins"] },
@@ -612,6 +626,7 @@ dev = [
 requires-dist = [
    { name = "aiofiles", specifier = ">=24.0" },
    { name = "aiosqlite", specifier = ">=0.20" },
    { name = "beautifulsoup4", specifier = ">=4.12" },
    { name = "fastapi", specifier = ">=0.115" },
    { name = "httpx", specifier = ">=0.27" },
    { name = "markdown-it-py", extras = ["plugins"], specifier = ">=3.0" },
@@ -1257,6 +1272,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/63/b6/aeadee5443e49baa2facd51131159fd6301cc4ccfc1541e4df7b021c37dd/ruff-0.15.11-py3-none-win_arm64.whl", hash = "sha256:063fed18cc1bbe0ee7393957284a6fe8b588c6a406a285af3ee3f46da2391ee4", size = 11032614, upload-time = "2026-04-16T18:46:34.487Z" },
 ]
 [[package]]
 name = "soupsieve"
 version = "2.8.3"
 source = { registry = "https://pypi.org/simple" }
 sdist = { url = "https://files.pythonhosted.org/packages/7b/ae/2d9c981590ed9999a0d91755b47fc74f74de286b0f5cee14c9269041e6c4/soupsieve-2.8.3.tar.gz", hash = "sha256:3267f1eeea4251fb42728b6dfb746edc9acaffc4a45b27e19450b676586e8349", size = 118627, upload-time = "2026-01-20T04:27:02.457Z" }
 wheels = [
    { url = "https://files.pythonhosted.org/packages/46/2c/1462b1d0a634697ae9e55b3cecdcb64788e8b7d63f54d923fcd0bb140aed/soupsieve-2.8.3-py3-none-any.whl", hash = "sha256:ed64f2ba4eebeab06cc4962affce381647455978ffc1e36bb79a545b91f45a95", size = 37016, upload-time = "2026-01-20T04:27:01.012Z" },
 ]
 [[package]]
 name = "sse-starlette"
 version = "3.3.4"