盒子
盒子
Posts List
  1. 一.安装beautifulsoup4
  2. 二.安装插件lxml
    1. 安装wheel
    2. 下载lxml wheel组件
    3. 安装lxml
  3. 三.html处理
    1. 设置html解析器
    2. 删除标签
    3. 更改内容
    4. 完整例子

python修改html

一.安装beautifulsoup4

pip install beautifulsoup4 -i https://pypi.douban.com/simple

二.安装插件lxml

安装wheel

pip install wheel

下载lxml wheel组件

打开wheel组件地址:

http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml

下载与系统对应的lxml组件,如window_64的lxml
lxml-3.6.4-cp27-cp27m-win_amd64.whl

安装lxml

pip install lxml-3.6.4-cp27-cp27m-win_amd64.whl

三.html处理

设置html解析器

soup = BeautifulSoup(content,"lxml")

删除标签

for tag in soup.find_all('script'):
tag.extract()

更改内容

for tag in soup.find_all('link'):
tag['href'] = tag['href'].replace("/XXXX/" , "/")

完整例子

# coding=utf-8
import json;
import sys
from bs4 import BeautifulSoup
# 从命令行读取参数
# channelId = sys.argv[1]
def readFile(index):
file_object = open(str(index), 'r')
return file_object.read()
def writeFile(fname , fcontent):
file_object = open(fname , 'w')
return file_object.write(fcontent)
def html2Soup(content):
soup = BeautifulSoup(content,"lxml")
return soup
def removeAllScript(soup):
for tag in soup.find_all('script'):
tag.extract()
def replaceCSS(soup):
for tag in soup.find_all('link'):
tag['href'] = tag['href'].replace("/XXXX/" , "/")
def setCoding():
reload(sys)
sys.setdefaultencoding('utf-8')
if __name__ == '__main__':
setCoding()
start = 358
end = 359
result = ""
for i in range(start , end):
soup = html2Soup(readFile(i))
removeAllScript(soup)
replaceCSS(soup)
writeFile(str(i)+".html" ,str(soup))
支持一下
扫一扫,支持牛头码农