如何巧妙找到自己想看的电子书

news/2024/7/24 4:48:15 标签: 数据库, sqlite, python

        我有个朋友呢,特爱看电子书,但是他要求还挺多,非完本不看,评分低不看,完本太久远也不看,等等。

        某一天,我突然心血来潮,然后他找到我,给我一个电子书网站,让我能不能把一些数据拿到,我乍眼一看,哎呦,这不是练习requests,像我天天忙成狗的,直接就答应了,主要是他请我和果茶。

先看看结果,这个是按照仙草排行

分析

    其实,需要的内容也不多,就下面几项

  1. 电子书名称
  2. 电子书作者
  3. 电子书大小
  4. 电子书仙草毒草
  5. 电子书简介
  6. 电子书类型

       网页分析,不得不说Python下一个很好用的模块 bs4 ,分析之后,就可以保存数据了,本次保存为sqlite文件,方便操作

实战

先确定好表字段,然后编写数据库对象并创建数据库

python">class Sql_Utils:
    def __init__(self):
        self.filepath = "./book.data"
        self.get_conn()
        self.check_init()

    def get_conn(self):
        try:
            self.conn = sqlite3.connect(self.filepath)
        except Exception as e:
            self.conn = None
            print(e)
        return self.conn

    def check_init(self):
        conn = self.conn
        if conn:
            c = conn.cursor()
            try:
                cursor = c.execute('''
                select * from bookinfos limit 10;
                ''')
                # for row in cursor:
                #     print(row)
            except Exception as e:
                print(e)
                self.__init_database()
            finally:
                conn.close()
        else:
            print("Opened database failed")

    def __init_database(self):
        conn = self.conn
        c = conn.cursor()
        c.execute('''CREATE TABLE bookinfos
               (ID INTEGER  PRIMARY KEY    NOT NULL,
               BOOKID            INT     NOT NULL UNIQUE,
               NAME           CHAR(200)    NOT NULL UNIQUE,
               INFOURL        CHAR(300) NOT NULL,
               DOWNLOADURL        CHAR(300) NOT NULL,
               FLOWERS        CHAR(300) NOT NULL,
               DESCRIPTION    CHAR(1000) NOT NULL,
               SIZE CHAR(30) NOT NULL,
               TYPES  CHAR(30) NOT NULL);''')
        conn.commit()
        conn.close()

    def insert(self, *args, **kwargs):
        bookid = args[0]
        books = self.query(bookid)
        if len(books) == 0:
            if len(kwargs) == 7:
                bookname = kwargs.get("bookname")
                infourl = kwargs.get("infourl")
                downloadurl = kwargs.get("downloadurl")
                size = kwargs.get("size")
                types = kwargs.get("types")
                description = kwargs.get("description")
                flowers = kwargs.get("flowers")
            else:
                raise KeyError
        else:
            bookname = kwargs.get("bookname", books[0].get("bookname"))
            infourl = kwargs.get("infourl", books[0].get("infourl"))
            downloadurl = kwargs.get("downloadurl", books[0].get("downloadurl"))
            size = kwargs.get("size", books[0].get("size"))
            types = kwargs.get("types", books[0].get("types"))
            description = kwargs.get("description", books[0].get("description"))
            flowers = kwargs.get("flowers", books[0].get("flowers"))

        conn = self.get_conn()
        c = conn.cursor()
        try:
            c.execute("""INSERT INTO bookinfos(BOOKID,NAME,INFOURL,DOWNLOADURL,FLOWERS,DESCRIPTION,SIZE,TYPES) 
                VALUES(?,?,?,?,?,?,?,?)""", (bookid, bookname, infourl, downloadurl, flowers, description, size, types))
            conn.commit()
            print("%s  插入成功" % bookname)
        except Exception as e:
            print("%s  插入失败 %s" % (bookname, e))
        finally:
            conn.close()

    def query(self, bookid):
        self.get_conn()
        conn = self.conn
        c = conn.cursor()
        books = []
        try:
            if bookid == -1:
                cursor = c.execute("""select * from bookinfos""")
            else:
                cursor = c.execute("""select * from bookinfos where BOOKID =?""", (bookid,))

            for row in cursor:
                books.append({
                    # "bookid": row[1],
                    "bookname": row[2],
                    "infourl": row[3],
                    # "downloadurl": row[4],
                    "flowers": row[5],
                    # "description":row[6],
                    "size": row[7],
                    "types": row[8],

                })
        except Exception as e:
            print("查询异常 %s" % e)
        finally:
            conn.close()
        return books

数据库操作对象已经完成了,接下来就是页面分析查找有用数据了

通过浏览器f12进入开发者模式

 1.获取电子书下载地址

python">def get_book_download_url(bookid, headers):
    baseurl = "http://zxcs.me/post/%s" % bookid
    req = requests.get(baseurl, headers=headers)
    if req.status_code == 200:
        html_body = req.content
        soup = BeautifulSoup(html_body, "html.parser")
        for a in soup.find_all('a'):
            if "点击下载" in a.get("title", ""):
                endpagurl = a.get("href")
                return endpagurl
    else:
        print("url: %s ERROR" % baseurl)
        return "ERROR"

2.获取电子书仙草毒草数据信息,该数据无需分析页面,可以直接通过请求拿到

python">def get_book_flowers(bookid, get_flower_tep, headers):
    flower_url = get_flower_tep % (bookid, random.random())
    req = requests.get(flower_url, headers=headers)
    return req.text

3.获取电子书大小和简介

python">def get_book_infos(bookurl, headers, bookname):
    req = requests.get(bookurl, headers=headers)
    if req.status_code == 200:
        html_body = req.content
        soup = BeautifulSoup(html_body, "html.parser")
        sinfo = []
        for a in soup.find_all('p'):
            if "内容简介" in str(a):
                for i in a:
                    si = str(i)
                    if "link" in si:
                        break
                    if "<br/>" in si or "\n" == si:
                        continue
                    sinfo.append(str(i).strip("\r\n").strip("\xa0").strip("\t").strip("\u3000"))
        try:
            return sinfo[0].split(":")[1], "".join(sinfo[2:])
        except Exception as e:
            print("ERROR  %s 大小和描述获取失败 %s" % (bookname, e))
            return "获取失败", "获取失败"

4,根据连接获取信息并存储数据库

python">def get_this_page_url(baseurl, headers, types):
    xurl = []
    req = requests.get(baseurl, headers=headers)
    if req.status_code == 200:
        html_body = req.content
        soup = BeautifulSoup(html_body, "html.parser")
        # print(soup.prettify())
        count = 0
        for a in soup.find_all('a'):
            # print(a)
            if a.parent.name in "dt":
                # print({"%s"%a.string:a.get("href")})
                xurl.append({int(a.get("href").split("/")[-1]): {"%s" % a.string: a.get("href")}})
                sql_conn = Sql_Utils()
                infourl = a.get("href")
                bookid = infourl.split("/")[-1]
                books = sql_conn.query(bookid)
                bookname = a.string

                if len(books) != 0:
                    print("%s 已经存在,继续操作" % bookname)
                    continue
                downloadurl = get_book_download_url(bookid, headers)
                flowers = get_book_flowers(bookid, get_flower_tep, headers)
                size, description = get_book_infos(infourl, headers, bookname)

                sql_conn.insert(bookid, bookname=bookname, infourl=infourl,
                                flowers=flowers, downloadurl=downloadurl, size=size,
                                description=description, types=types)
    return xurl

def get_this_all_url(baseurl, headers, types):
    endurl = get_this_end_url(baseurl, headers)

    endnumlist = endurl.split("/")
    all_url = []
    for page in range(1, int(endnumlist[-1]) + 1):
        endnumlist[-1] = str(page)
        print("*" * 200)
        print("%s 总共%s页,现在进行到%s页" % (types, endurl.split("/")[-1], page))
        this_page_url = get_this_page_url("/".join(endnumlist), headers, types)
        # print("page:%s this_page_url:%s"%(page,this_page_url))
        all_url.extend(this_page_url)
    return all_url

接下来,就整理一下剩余代码,如下

python">import random
import sqlite3

import requests
from bs4 import BeautifulSoup

zxcs_urls = {
    "奇幻科幻":	"http://zxcs.me/sort/26",
    "都市娱乐":	"http://zxcs.me/sort/23",
    "武侠仙侠":	"http://zxcs.me/sort/25",
    "科幻灵异":	"http://zxcs.me/sort/27",
    "历史军事":	"http://zxcs.me/sort/28",
    "竞技游戏":	"http://zxcs.me/sort/29",
    "二次元": "http://zxcs.me/sort/55",
}
headers = {'User-Agent': 'Mozilla/8.0 (iPhone; CPU iPhone OS 16_0 like Mac OS X) AppleWebKit'}
get_flower_tep = "http://zxcs.me/content/plugins/cgz_xinqing/cgz_xinqing_action.php?action=show&id=%s&m=%s"




def make_book_data():
    for types, baseurl in zxcs_urls.items():
        get_this_all_url(baseurl, headers, types)


def get_book_l():
    sql_u = Sql_Utils()
    booklist = sql_u.query(-1)

    # 仙草排行
    booklist.sort(key=lambda x: int(x.get("flowers").split(",")[0]), reverse=True)

    # 毒草排行
    # booklist.sort(key=lambda x:int(x.get("flowers").split(",")[-1]),reverse=True)

    # 仙草百分比排行
    # booklist.sort(key=lambda x:float(int(x.get("flowers").split(",")[0])*10/(int(x.get("flowers").split(",")[-1])+int(x.get("flowers").split(",")[0]))),reverse=True)

    for book in booklist:
        if book.get("types") == "竞技游戏":
            continue
        print(book)


if __name__ == '__main__':
    '''
    先运行 make_book_data()
    等这个运行之后,就可以查询啦
    '''
    make_book_data()
    get_book_l()

先运行make_book_data() ,将数据存储到数据库,结果如下

 已经在数据库里面存在的,就不会再次存储了

通过 get_book_l() 查询到仙草毒草比例


http://www.niftyadmin.cn/n/1265510.html

相关文章

linux nginx虚拟主机配置文件,详述Linux系统中Nginx虚拟主机的配置

Nginx虚拟主机应用Nginx支持的虚拟主机有三种基于域名的虚拟主机.基于IP的虚拟主机基于端口的虚拟主机通过"server{}"配置段实现本篇实验接着上一篇搭建Nginx服务继续搭建&#xff0c;前面Nginx的编译安装不在介绍基于域名的虚拟主机[rootlocalhost nginx-1.12.2]# m…

手写一个配置参数缓存器

本人擅长用Django开发应用&#xff0c;但是由于部分配置文件频繁更改&#xff0c;导致必须重启服务才能生效&#xff0c;特别是服务上线之后&#xff0c;频繁重启服务是万万不可能的&#xff0c;那么有没有好的解决方案&#xff1f; 经过我苦思冥想&#xff0c;终于想到了一套…

linux系统按内容查找,Linux系统中查找文本的技巧你都掌握哪些?

今天小编要跟大家分享的文章是关于Linux系统中查找文本的技巧你都掌握哪些?之前小编也为大家分享过很多Linux相关的命令&#xff0c;但是对文件内容搜索的命令似乎还没有涉及&#xff0c;今天小编就为大家带来了这篇文章&#xff0c;让我们一起来看一看文件查找grep命令的相关…

linux奶瓶U盘使用方法,CDlinux如何制作U盘启动及Beini(奶瓶)制作U盘启动的方法...

最近在研究无线网络安全。查了一下&#xff0c;在BT5之外还有一个很强大而便捷且专一的无线安全审计工具Beini&#xff0c;不过Beini原版系统比较简洁点&#xff0c;集成的功能单一。而国内CDlinux思维论坛针对Beini做了二次开发&#xff0c;添加很多工具&#xff0c;而且界面更…

Linux IO重定向及管道

一&#xff0c;标准输入输出&#xff1a; 程序 &#xff1a; 指令数据 程序 &#xff1a;IO 可用于输入的设备&#xff1a;文件&#xff0c;键盘设备&#xff0c;文件系统上的常规文件&#xff0c;网卡等&#xff1b; 可用于输出的设备&#xff1a;文件&#xff0c;显示器&a…

linux usb3.0无法识别u盘启动,Deepin 20系统能识别USB3.0:如果不能用请重启系统或重插几次...

在Deepin 15.11系统中能正常识别到USB3.0&#xff0c;但到了Deepin 20系统中却不能识别。如果出现这种情况&#xff0c;请多重新插几次U盘&#xff0c;或者重启一下系统试试。根据用户的反馈来看&#xff0c;Deepin 20是完全可以识别到USB3.0的。日志注&#xff1a;以下是插入U…

Linux 用户和组管理

介绍3A安全机制&#xff1a; Authentication&#xff1a;认证 Authorization&#xff1a;授权 Accouting|Audition :审计 linux 运行的程序是以进程发起者的身份运行的&#xff0c;进程所能访问资源的权限取决于进程的运行者的身份 用户 user&#xff1a; 令牌 token&#xf…

python复制远程linux文件,python paramiko+scpclient SCP远程文件拷贝应用

image.pnggit链接&#xff1a;https://github.com/monkeyish-smart/python-paramiko-scpclient.git1、实现scp协议的远程文件拷贝&#xff0c;注意&#xff1a;不是FTP 也不是SFTP。2、为什么要做scp, 因为scp使用ssh 对资源有限的嵌入式linux 比较常用3、scp文件拷贝使用param…