实例介绍
【实例截图】
【核心代码】
Structuring Crawlers 58 Crawling Sites Through Search 58 Crawling Sites Through Links 61 Crawling Multiple Page Types 64 Thinking About Web Crawler Models 65 5. Scrapy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Installing Scrapy 67 Initializing a New Spider 68 Writing a Simple Scraper 69 Spidering with Rules 70 Creating Items 74 Outputting Items 76 The Item Pipeline 77 Logging with Scrapy 80 More Resources 80 6. Storing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Media Files 83 Storing Data to CSV 86 MySQL 88 Installing MySQL 89 Some Basic Commands 91 Integrating with Python 94 Database Techniques and Good Practice 97 “Six Degrees” in MySQL 100 Email 103 Part II. Advanced Scraping 7. Reading Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Document Encoding 107 Text 108 Text Encoding and the Global Internet 109 CSV 113 Reading CSV Files 113 PDF 115 Microsoft Word and .docx 117 8. Cleaning Your Dirty Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Cleaning in Code 121 iv | Table of Contents Data Normalization 124 Cleaning After the Fact 126 OpenRefine 126 9. Reading and Writing Natural Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Summarizing Data 132 Markov Models 135 Six Degrees of Wikipedia: Conclusion 139 Natural Language Toolkit 142 Installation and Setup 142 Statistical Analysis with NLTK 143 Lexicographical Analysis with NLTK 145 Additional Resources 149 10. Crawling Through Forms and Logins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Python Requests Library 151 Submitting a Basic Form 152 Radio Buttons, Checkboxes, and Other Inputs 154 Submitting Files and Images 155 Handling Logins and Cookies 156 HTTP Basic Access Authentication 157 Other Form Problems 158 11. Scraping JavaScript. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 A Brief Introduction to JavaScript 162 Common JavaScript Libraries 163 Ajax and Dynamic HTML 165 Executing JavaScript in Python with Selenium 166 Additional Selenium Webdrivers 171 Handling Redirects 171 A Final Note on JavaScript 173 12. Crawling Through APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 A Brief Introduction to APIs 175 HTTP Methods and APIs 177 More About API Responses 178 Parsing JSON 179 Undocumented APIs 181 Finding Undocumented APIs 182 Documenting Undocumented APIs 184 Finding and Documenting APIs Automatically 184 Combining APIs with Other Data Sources 187 Table of Contents | v More About APIs 190 13. Image Processing and Text Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Overview of Libraries 194 Pillow 194 Tesseract 195 NumPy 197 Processing Well-Formatted Text 197 Adjusting Images Automatically 200 Scraping Text from Images on Websites 203 Reading CAPTCHAs and Training Tesseract 206 Training Tesseract 207 Retrieving CAPTCHAs and Submitting Solutions 211 14. Avoiding Scraping Traps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 A Note on Ethics 215 Looking Like a Human 216 Adjust Your Headers 217 Handling Cookies with JavaScript 218 Timing Is Everything 220 Common Form Security Features 221 Hidden Input Field Values 221 Avoiding Honeypots 223 The Human Checklist 224 15. Testing Your Website with Scrapers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 An Introduction to Testing 227 What Are Unit Tests? 228 Python unittest 228 Testing Wikipedia 230 Testing with Selenium 233 Interacting with the Site 233 unittest or Selenium? 236 16. Web Crawling in Parallel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Processes versus Threads 239 Multithreaded Crawling 240 Race Conditions and Queues 242 The threading Module 245 Multiprocess Crawling 247 Multiprocess Crawling 249 Communicating Between Processes 251 vi | Table of Contents Multiprocess Crawling—Another Approach 253 17. Scraping Remotely. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Why Use Remote Servers? 255 Avoiding IP Address Blocking 256 Portability and Extensibility 257 Tor 257 PySocks 259 Remote Hosting 259 Running from a Website-Hosting Account 260 Running from the Cloud 261 Additional Resources 262 18. The Legalities and Ethics of Web Scraping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Trademarks, Copyrights, Patents, Oh My! 263 Copyright Law 264 Trespass to Chattels 266 The Computer Fraud and Abuse Act 268 robots.txt and Terms of Service 269 Three Web Scrapers 272 eBay versus Bidder’s Edge and Trespass to Chattels 272 United States v. Auernheimer and The Computer Fraud and Abuse Act 274 Field v. Google: Copyright and robots.txt 275 Moving Forward 276 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
标签:
小贴士
感谢您为本站写下的评论,您的评论对其它用户来说具有重要的参考价值,所以请认真填写。
- 类似“顶”、“沙发”之类没有营养的文字,对勤劳贡献的楼主来说是令人沮丧的反馈信息。
- 相信您也不想看到一排文字/表情墙,所以请不要反馈意义不大的重复字符,也请尽量不要纯表情的回复。
- 提问之前请再仔细看一遍楼主的说明,或许是您遗漏了。
- 请勿到处挖坑绊人、招贴广告。既占空间让人厌烦,又没人会搭理,于人于己都无利。
关于好例子网
本站旨在为广大IT学习爱好者提供一个非营利性互相学习交流分享平台。本站所有资源都可以被免费获取学习研究。本站资源来自网友分享,对搜索内容的合法性不具有预见性、识别性、控制性,仅供学习研究,请务必在下载后24小时内给予删除,不得用于其他任何用途,否则后果自负。基于互联网的特殊性,平台无法对用户传输的作品、信息、内容的权属或合法性、安全性、合规性、真实性、科学性、完整权、有效性等进行实质审查;无论平台是否已进行审查,用户均应自行承担因其传输的作品、信息、内容而可能或已经产生的侵权或权属纠纷等法律责任。本站所有资源不代表本站的观点或立场,基于网友分享,根据中国法律《信息网络传播权保护条例》第二十二与二十三条之规定,若资源存在侵权或相关问题请联系本站客服人员,点此联系我们。关于更多版权及免责申明参见 版权及免责申明
网友评论
我要评论