wordpress XML转换为html – 礼君之家

wordpress XML转换为html

2025-4-08 1:20

|

62

|

2025年,2025年4月,网站

380 字

|

5 分钟

要实现将XML文档转换为HTML并提取特定内容，可以按照以下步骤编写Python程序：

import xml.etree.ElementTree as ET
import os
import re
from html import escape

def sanitize_filename(title):
    """处理非法文件名字符并限制长度"""
    filename = re.sub(r'[^\w\u4e00-\u9fff]', '_', title)
    filename = re.sub(r'_+', '_', filename).strip('_')
    return filename[:50] + '.html'

def process_xml(xml_path, output_dir):
    # 创建输出目录
    os.makedirs(output_dir, exist_ok=True)

    # 定义XML命名空间
    namespaces = {
        'wp': 'http://wordpress.org/export/1.2/',
        'content': 'http://purl.org/rss/1.0/modules/content/'
    }

    # 解析XML文件
    tree = ET.parse(xml_path)
    root = tree.getroot()

    # 遍历所有item节点
    for item in root.findall('.//item'):
        # 检查是否为文章类型
        post_type = item.find('wp:post_type', namespaces)
        if post_type is None or post_type.text != 'post':
            continue

        # 提取标题和内容
        title = item.find('title').text
        content = item.find('content:encoded', namespaces)

        if title is None or content is None:
            continue

        # 转义HTML特殊字符
        safe_title = escape(title)
        safe_content = content.text if content.text else ''

        # 构建HTML结构
        html = f'''<div class="post">
    <h1>{safe_title}</h1>
    {safe_content}
</div>'''

        # 生成文件名并保存
        filename = sanitize_filename(title)
        output_path = os.path.join(output_dir, filename)

        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(html)

if __name__ == '__main__':
    # 使用示例
    process_xml('WordPress.2025-03-25.xml', 'test')

程序说明：

XML解析：使用ElementTree解析XML文档，处理WordPress的命名空间
内容过滤：只处理文章类型(post)的条目
HTML生成：
• 自动转义特殊字符防止XSS
• 包裹在

容器中
• 保留原始HTML格式内容
文件保存：
• 文件名基于标题自动生成
• 自动处理非法字符和长度限制
• 保存到指定输出目录

使用步骤：

将wordpres下载的xml文件保存WordPress.2025-03-25.xml
运行脚本，自动创建test目录
每个文章会生成单独的HTML文件，文件名格式为处理后的标题.html

注意事项：

需要Python 3.6+环境
保留原始HTML格式中的图片链接等资源
自动处理中/英文字符和空格
输出目录会自动创建，无需手动建立

该程序能够高效处理WordPress导出的XML文档，自动完成格式转换、内容提取和文件分割保存等操作。

暂无评论

发送评论编辑评论