本文主要是介绍Python使用python-docx实现自动化处理Word文档,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
《Python使用python-docx实现自动化处理Word文档》这篇文章主要为大家展示了Python如何通过代码实现段落样式复制,HTML表格转Word表格以及动态生成可定制化模板的功能,感兴趣的...
一、引言
随着办公自动化需求的增长,python通过python-docx库实现了对Word文档的深度操作。本文将展示如何通过代码实现段落样式复制、HTML表格转Word表格以及动态生成可定制化模板的功能。
二、核心功能模块解析
1. 段落样式与图片复制
def copy_inline_shapes(new_doc, img): """复制段落中的所有内嵌形状(通常是图片)""" new_para = new_doc.add_paragraph() for image_bytes, w, h in img: # 添加图片到新段落 new_para.add_run().add_picture(io.BytesIO(image_bytes), width=w, height=h) # 设置宽度为1.25英寸或其他合适的值
功能说明:从旧文档中提取图片并复制至新文档,支持自定义宽度和高度。
使用场景:适用于需要保留原始格式的图文混排文档。
2. HTML表格转Word表格
def docx_table_to_html(word_table): # 实现HTML表单转换逻辑,包括合并单元格处理
功能说明:将解析后的HTML表格结构转换为Word文档中的表格,支持横向/纵向合并。
关键点:
- 使用BeautifulSoup解析HTML
- 处理单元格样式、边框和背景颜色
- 支持多级标题的样式继承
3. 模板生成与样式动态化
def generate_template(): doc = Document() for align in [WD_ALIGN_PARAGRAPH.LEFT, WD_ALIGN_PARAGRAPH.RIGHT, WD_ALIGN_PARAGRAPH.CENTER, None]: for blod_flag in [True, False]: # 创建不同样式的段落
功能说明:动态生成包含多种样式(左、右、居中、无)的模板文档。
优势:支持快速扩展新样式,适应不同场景需求。
三、完整示例代码
示例1:复制段落样式与图片
def clone_document(old_s, old_p, old_ws, new_doc_path): new_doc = Document() for para in old_p: if "Image_None" in para: copy_inline_shapes(new_doc, [i["image"] for i in old_s if len(i) > 3][0]) elif "table" in para: html_table_to_docx(new_doc, para) else: clone_paragraph(para)
示例2:HTML表格转Word
def html_table_to_docx(doc, html_content):
soup = BeautifulSoup(html_content, 'html.parser')
tables = soup.find_all('table')
for table in tablwww.chinasem.cnes:
# 处理合并单元格和样式转换逻辑...
四、关键实现细节
1. 样式复制策略
继承机制:通过run_style和style字段传递字体、对齐等属性。
分页符处理:使用is_page_break判断段落或表格后是否需要换页。
2. 表格转换优化
合并单元格检测:通过tcPr元素识别横向/纵向合并。
样式迁移:保留边框、背景色等视觉属性。
3. 模板动态生成
多样式支持:通过遍历所有段落样式,生成可扩展的模板。
灵活配置:允许用户自定义分页符位置和样式参数。
五、应用场景
场景 | 解决方案 |
---|---|
段落排版 | 自动复制样式并保留格式 |
数据表导出 | HTML转Word表格,支持合并单元格 |
报告模板生成 | 动态创建包含多种样式的模板文件 |
六、总结
通过python-docx库,我们实现了从样式复制到表格转换的完整流程。动态生成的模板功能进一步提升了文档处理的灵活性。无论是处理复杂的图文排版,还是需要快速生成多风格文档的需求,这套解决方案都能提供高效的实现路径。
建议:在实际应用中,可结合python-docx的Document对象特性,通过遍历所有元素实现更精细的控制。同时,对异常情况的捕获(如图片格式错误)也是提升健壮性的重要部分。
七、知识扩展
使用模版样式生成文档
from docx import Document from docx.oXML import OxmlElement from docx.oxml.shared import qn from wan_neng_copy_word import clone_document as get_para_style,html_table_to_docx import io # 剩余部分保持不变... def copy_inline_shapes(new_doc, img): """复制段落中的所有内嵌形状(通常是图片)""" new_para = new_doc.add_paragraph() for image_bytes, w, h in img: # 添加图片到新段落 new_para.add_run().add_picture(io.BytesIO(image_bytes), width=w, height=h) # 设置宽度为1.25英寸或其他合适的值 def copy_paragraph_style(run_from, run_to): """复制 run 的样式""" run_to.bold = run_from.bold run_to.italic = run_from.italic run_to.underline = run_from.underline run_to.font.size = run_from.font.size run_to.font.color.rgb = run_from.font.color.rgb run_to.font.name = run_from.font.name run_to.font.all_caps = run_from.font.all_caps run_to.font.strike = run_from.font.strike run_to.font.shadow = run_from.font.shadow def is_page_break(element): """判断元素是否为分页符(段落或表格后)""" if element.tag.endswith('p'): for child in element: if child.tag.endswith('br') and child.get(qn('type')) == 'page': return True elif element.tag.endswith('tbl'): # 表格后可能有分页符(通过下一个元素判断) if element.getnext() is not None: next_element = element.getnext() if next_element.tag.endswith('p'): for child in next_element: if child.tag.endswith('br') and child.get(qn('type')) == 'page': return True return False def clone_paragraph(para_style, text, new_doc, para_style_ws): """根据旧段落创建新段落""" new_para = new_doc.add_paragraph() para_style_ws = list(para_style_ws["style"].values())[0] para_style_data = list(para_style["style"].values())[0] para_style_ws.font.size = para_style_data.font.size new_para.style = para_style_ws new_run = new_para.add_run(text) copy_paragraph_style(para_style["run_style"][0], new_run) new_para.alignment = list(para_style["alignment"].values())[0] return new_para def copy_cell_borders(old_cell, new_cell): """复制单元格的边框样式""" old_tc = old_cell._tc new_tc = new_cell._tc old_borders = old_tc.xpath('.//w:tcBorders') if old_borders: old_border = old_borders[0] new_border = OxmlElement('w:tcBorders') border_types = ['top', 'left', 'bottom', 'right', 'insideH', 'insideV'] for border_type in border_types: old_element = old_border.find(f'.//w:{border_type}', namespaces={ 'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main' }) if old_element is not None: new_element = OxmlElement(f'w:{border_type}') for attr, value in old_element.attrib.items(): new_element.set(attr, value) new_border.append(new_element) tc_pr = new_tc.get_or_add_tcPr() tc_pr.append(new_border) def clone_table(old_table, new_doc): """根据旧表格创建新表格""" new_table = new_doc.add_table(rows=len(old_table.rows), cols=len(old_table.columns)) if old_table.style: new_table.style = old_table.style for i, old_row in enumerate(old_table.rows): for j, old_cell in enumerate(old_row.cells): new_cell = new_table.cell(i, j) for paragraph in new_cell.paragraphs: new_cell._element.remove(paragraph._element) for old_paragraph in old_cell.paragraphs: new_paragraph = new_cell.add_paragraph() for old_run in old_paragraph.runs: new_run = new_paragraph.add_run(old_run.text) copy_paragraph_style(old_run, new_run) new_paragraph.alignment = old_paragraph.alignment copy_cell_borders(old_cell, new_cell) for i, col in enumerate(old_table.columns): if col.width is not None: new_table.columns[i].width = col.width return new_table def clone_document(old_s, old_p, old_ws, new_doc_path): new_doc = Document() # 复制主体内容 for para in old_p: for k, v in para.items(): if "Image_None" == k: # print() copy_inline_shapes(new_doc, [i["image"] for i in old_s if len(i) > 3][0]) elif "table" == k: html_table_to_docx(new_doc,v) else: style = [i for i in old_s if v in list(i["style"].keys()) and "style" in i] style_ws = [i for i in old_ws if v in list(i["style"].keys()) and "style" in i] clone_paragraph(style[0], k, new_doc, style_ws[0]) new_doc.save(new_doc_path) # 使用示例 if __name__ == "__main__": body_ws, _ = get_para_style('demo_template.docx') body_s, body_p = get_para_style("南山三防工作专报1.docx") clone_document(body_s, body_p, body_ws, 'cloned_example.docx')
模版样式文本分离
from docx.enum.text import WD_BREAK from docx import Document from docx.enum.text import WD_ALIGN_PARAGRAPH from docx.oxml import OxmlElement from bs4 import BeautifulSoup from docx.oxml.ns import qn def docx_table_to_html(word_table): soup = BeautifulSoup(features='html.parser') html_table = soup.new_tag('table',) # 记录哪些单元格已经被合并 merged_cells = [[False for _ in range(len(word_table.columns))] for _ in range(len(word_table.rows))] for row_idx, row in enumerate(word_table.rows): html_tr = soup.new_tag('tr') col_idx = 0 while col_idx < len(row.cells): cell = row.cells[col_idx] # 如果该单元格已经被合并(被前面的 colspan 或 rowspan 占用),跳过 if merged_cells[row_idx][col_idx]: col_idx += 1 continue # 跳过纵向合并中被“continue”的单元格 v_merge = cell._element.tcPr and cell._element.tcPr.find(qn('w:vMerge')) if v_merge is not None and v_merge.get(qn('w:val')) == 'continue': col_idx += 1 continue td = soup.new_tag('td') # 设置文本内容 td.string = cell.text.strip() # 初始化样式字符串 td_style = '' # 获取单元格样式 if cell._element.tcPr: tc_pr = cell._element.tcPr # 处理背景颜色 shd = tc_pr.find(qn('w:shd')) if shd is not None: bg_color = shd.get(qn('w:fill')) if bg_color: td_style += f'background-color:#{bg_color};' # 处理对齐方式 jc = tc_pr.find(qn('w:jc')) if jc is not None: align = jc.get(qn('w:val')) if align == 'center': td_style += 'text-align:center;' elif align == 'right': td_style += 'text-align:right;' else: td_style += 'text-align:left;' # 处理边框 borders = tc_pr.find(qn('w:tcBorders')) if borders is not None: for border_type in ['top', 'left', 'bottom', 'right']: border = borders.find(qn(f'w:{border_type}')) if border is not None: color = border.get(qn('w:color'), '000000') size = int(border.get(qn('w:sz'), '4')) # 半点单位,1pt = 2sz style = border.get(qn('w:val'), 'single') td_style += f'border-{border_type}:{size // 2}px {style} #{color};' # 处理横向合并(colspan) grid_span = tc_pr.find(qn('w:gridSpan')) if grid_span is not None: colspan = int(grid_span.get(qn('w:val'), '1')) if colspan > 1: td['colspan'] = colspan # 标记后面被合并的单元格 for c in range(col_idx + 1, col_idx + colspan): if c < len(row.cells): merged_cells[row_idx][c] = True # 处理纵向合并(rowspan) v_merge = tc_pr.find(qn('w:vMerge')) if v_merge is not None and v_merge.get(qn('w:val')) != 'continue': rowspan = 1 next_row_idx = row_idx + 1 while next_row_idx < len(word_table.rows): next_cell = word_table.rows[nVFJVOZicxext_row_idx].cells[col_idx] next_v_merge = next_cell._element.tcPr and next_cell._element.tcPr.find(qn('w:vMerge')) if next_v_merge is not None and next_v_merge.get(qn('w:val')) == 'continue': rowspan += 1 javascript next_row_idx += 1 else: break if rowspan > 1: td['rowspan'] = rowspan # 标记后面被合并的行 for r in range(row_idx + 1, row_idx + rowspan): if r < len(word_table.rows): merged_cells[r][col_idx] = True # 设置样式和默认边距 td['style'] = td_style + "padding: 5px;" html_tr.append(td) # 更新列索引 if 'colspan' in td.attrs: col_idx += int(td['colspan']) else: col_idx += 1 html_table.append(html_tr) soup.append(html_table) return str(soup) def set_cell_background(cell, color_hex): """设置单元格背景色""" color_hex = color_hex.lstrip('#') shading_elm = OxmlElement('w:shd') shading_elm.set(qn('w:fill'), color_hex) cell._tc.get_or_add_tcPr().append(shading_elm) def html_table_to_docx(doc, html_content): """ 将 HTML 中的表格转换为 Word 文档中的表格 :param html_content: HTML 字符串 :param doc: python-docx Document 实例 """ soup = BeautifulSoup(html_content, 'html.parser') tables = soup.find_all('table') for html_table in tables: # 获取表格行数 trs = html_table.find_all('tr') rows = len(trs) # 估算最大列数(考虑 colspan) cols = 0 for tr in trs: col_count = 0 for cell in tr.find_all(['td', 'th']): col_count += int(cell.get('colspan', 1)) cols = max(cols, col_count) # 创建 Word 表格 table = doc.add_table(rows=rows, cols=cols) table.style = 'Table Grid' # 记录已处理的单元格(用于处理合并) used_cells = [[False for _ in range(cols)] for _ in range(rows)] for row_idx, tr in enumerate(trs): cells = tr.find_all(['td', 'th']) col_idx = 0 for cell in cells: while col_idx < cols and used_cells[row_idx][col_idx]: col_idx += 1 if col_idx >= cols: break # 避免越界 # 获取 colspan 和 rowspan colspan = int(cell.get('colspan', 1)) rowspan = int(cell.get('rowspan', 1)) # 获取文本内容 text = cell.get_text(strip=True) # 获取对齐方式 align = cell.get('align') align_map = { 'left': WD_ALIGN_PARAGRAPH.LEFT, 'center': WD_ALIGN_PARAGRAPH.CENTER, 'right': WD_ALIGN_PARAGRAPH.RIGHT } # 获取背景颜色 编程 style = cell.get('style', '') bg_color = None for s in style.split(';'): if 'background-color' in s or 'background' in s: bg_color = s.split(':')[1].strip() break # 获取 Word 单元格 word_cell = table.cell(row_idx, col_idx) # 合并单元格 if colspan > 1 or rowspan > 1: end_row = min(row_idx + rowspan - 1, rows - 1) e编程nd_col = min(col_idx + colspan - 1, cols - 1) merged_cell = table.cell(row_idx, col_idx).merge(table.cell(end_row, end_col)) word_cell = merged_cell # 设置文本内容 para = word_cell.paragraphs[0] para.text = text # 设置对齐方式 if align in align_map: para.alignment = align_map[align] # 设置背景颜色 if bg_color: try: set_cell_background(word_cell, bg_color) except: pass # 忽略无效颜色格式 # 标记已使用的单元格 for r in range(row_idx, min(row_idx + rowspan, rows)): for c in range(col_idx, min(col_idx + colspan, cols)): used_cells[r][c] = True # 移动到下一个可用列 col_idx += colspan # 添加空段落分隔 doc.add_paragraph() return doc def copy_inline_shapes(old_paragraph): """复制段落中的所有内嵌形状(通常是图片)""" images = [] for shape in old_paragraph._element.xpath('.//w:drawing'): blip = shape.find('.//a:blip', namespaces={'a': 'http://schemas.openxmlformats.org/drawingml/2006/main'}) if blip is not None: rId = blip.attrib['{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed'] image_part = old_paragraph.part.related_parts[rId] image_bytes = image_part.image.blob images.append([image_bytes, image_part.image.width, image_part.image.height]) return images def is_page_break(element): """判断元素是否为分页符(段落或表格后)""" if element.tag.endswith('p'): for child in element: if child.tag.endswith('br') and child.get(qn('type')) == 'page': return True elif element.tag.endswith('tbl'): # 表格后可能有分页符(通过下一个元素判断) if element.getnext() is not None: next_element = element.getnext() if next_element.tag.endswith('p'): for child in next_element: if child.tag.endswith('br') and child.get(qn('type')) == 'page': return True return False def clone_paragraph(old_para): """根据旧段落创建新段落""" style = {"run_style": []} if old_para.style: # 这里保存style 主要通过字体识别 是 几级标题 style_name_to_style_obj = {old_para.style.name + "_" + str(old_para.alignment).split()[0]: old_para.style} style["style"] = style_name_to_style_obj paras = [] for old_run in old_para.runs: text_to_style_name = {old_run.text: old_para.style.name + "_" + str(old_para.alignment).split()[0]} style["run_style"].append(old_run) paras.append(text_to_style_name) style_name_to_alignment = {old_para.style.name + "_" + str(old_para.alignment).split()[0]: old_para.alignment} style["alignment"] = style_name_to_alignment images = copy_inline_shapes(old_para) if len(images): style["image"] = images paras.append({"Image_None": "Image_None"}) return style, paras def clone_document(old_doc_path): try: old_doc = Document(old_doc_path) new_doc = Document() # 复制主体内容 elements = old_doc.element.body para_index = 0 table_index = 0 index = 0 body_style = [] body_paras = [] while index < len(elements): element = elements[index] if element.tag.endswith('p'): old_para = old_doc.paragraphs[para_index] style, paras = clone_paragraph(old_para) body_style.append(style) body_paras += paras para_index += 1 index += 1 elif element.tag.endswith('tbl'): old_table = old_doc.tables[table_index] body_paras += [{"table": docx_table_to_html(old_table)}] table_index += 1 index += 1 elif element.tag.endswith('br') and element.get(qn('type')) == 'page': if index > 0: body_paras.append("br") new_doc.add_paragraph().add_run().add_break(WD_BREAK.PAGE) index += 1 else: index += 1 # 检查分页符 if index < len(elements) and is_page_break(elements[index]): if index > 0: new_doc.add_paragraph().add_run().add_break(WD_BREAK.PAGE) body_paras.append("br") index += 1 else: return body_style, body_paras except Exception as e: print(f"复制文档时发生错误:{e}") # 使用示例 if __name__ == "__main__": # 示例HTML表格 body_s, body_p = clone_document('专报1.docx')
生成可更改模版
from docx import Document from docx.enum.text import WD_ALIGN_PARAGRAPH # 创建一个新的Word文档 doc = Document() for align in [WD_ALIGN_PARAGRAPH.LEFT, WD_ALIGN_PARAGRAPH.RIGHT, WD_ALIGN_PARAGRAPH.CENTER, None]: for blod_flag in [True, False]: # 获取所有可用的段落样式名(只保留段落样式) paragraph_styles = [ style for style in doc.styles if style.type == 1 # type == 1 表示段落样式 ] # 输出样式数量 print(f"共找到 {len(paragraph_styles)} 种段落样式:") for style in paragraph_styles: print(f"- {style.name}") # 在文档中添加每个样式对应的段落 for style in paragraph_styles: heading = doc.add_paragraph() run = heading.add_run(f"样式名称: {style.name}") run.bold = blod_flag para = doc.add_paragraph(f"这是一个应用了 '{style.name}' 样式的段落示例。", style=style) para.alignment = align # 添加分隔线(可选) doc.add_paragraph("-" * 40) # 保存为 demo_template.docx doc.save("demo_template.docx") print("\n✅ 已生成包含所有段落样式的模板文件:demo_template.docx")
以上就是Python使用python-docx实现自动化处理Word文档的详细内容,更多关于Python自动化处理Word的资料请关注China编程(www.chinasem.cn)其它相关文章!
这篇关于Python使用python-docx实现自动化处理Word文档的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!