在好例子网,分享、交流、成长!
您当前所在位置:首页C# 开发实例C#网络编程 → C#写的蜘蛛程序也叫小偷程序

C#写的蜘蛛程序也叫小偷程序

C#网络编程

下载此实例
  • 开发语言:C#
  • 实例大小:0.57M
  • 下载次数:27
  • 浏览次数:1355
  • 发布时间:2015-09-11
  • 实例类别:C#网络编程
  • 发 布 人:bszz312
  • 文件格式:.rar
  • 所需积分:2
 相关标签: C# 采集

实例介绍

【实例简介】"蜘蛛"(Spider)是Internet上一种很有用的程序,搜索引擎利用蜘蛛程序将Web页面收集到数据库,企业利用蜘蛛程序监视竞争对手的网站并跟踪变动,个人用户用蜘蛛程序下载Web页面以便脱机使用,开发者利用蜘蛛程序扫描自己的Web检查无效的链接……对于不同的用户,蜘蛛程序有不同的用途。那么,蜘蛛程序到底是怎样工作的呢?
蜘蛛是一种半自动的程序,就象现实当中的蜘蛛在它的Web(蜘蛛网)上旅行一样,蜘蛛程序也按照类似的方式在Web链接织成的网上旅行。蜘蛛程序之所以是半自动的,是因为它总是需要一个初始链接(出发点),但此后的运行情况就要由它自己决定了,蜘蛛程序会扫描起始页面包含的链接,然后访问这些链接指向的页面,再分析和追踪那些页面包含的链接。从理论上看,最终蜘蛛程序会访问到Internet上的每一个页面,因为Internet上几乎每一个页面总是被其他或多或少的页面引用。


下载后测试的时候,请将该程序拷贝至其它目录,程序有几个bug:

1.关掉程序后 程序没有完全退出,可在任务管理器中 杀掉,也可以 改下代码 application.exit()

2.将改程序 拷贝至其它目录之后,再运行调试,因为该程序 有个目录bug,带符号的目录 不识别(内部报错)

【实例截图】

【核心代码】


using System;
using System.Collections;
using System.Net;
using System.IO;
using System.Threading;

namespace Spider
{
	/// <summary>
	/// The main class for the spider. This spider can be used with the 
	/// SpiderForm form that has been provided. The spider is completely 
	/// selfcontained. If you would like to use the spider with your own
	/// application just remove the references to m_spiderForm from this file.
	/// 
	/// The files needed for the spider are:
	/// 
	/// Attribute.cs - Used by the HTML parser
	/// AttributeList.cs - Used by the HTML parser
	/// DocumentWorker - Used to "thread" the spider
	/// Done.cs - Allows the spider to know when it is done
	/// Parse.cs - Used by the HTML parser
	/// ParseHTML.cs - The HTML parser
	/// Spider.cs - This file
	/// SpiderForm.cs - Demo of how to use the spider
	/// 
	/// This spider is copyright 2003 by Jeff Heaton. However, it is
	/// released under a Limited GNU Public License (LGPL). You may 
	/// use it freely in your own programs. For the latest version visit
	/// http://www.jeffheaton.com.
	///
	/// </summary>
	public class Spider
	{
		/// <summary>
		/// The URL's that have already been processed.
		/// </summary>
		private Hashtable m_already;

		/// <summary>
		/// URL's that are waiting to be processed.
		/// </summary>
		private Queue m_workload;

		/// <summary>
		/// The first URL to spider. All other URL's must have the
		/// same hostname as this URL. 
		/// </summary>
		private Uri m_base;

		/// <summary>
		/// The directory to save the spider output to.
		/// </summary>
		private string m_outputPath;

		/// <summary>
		/// The form that the spider will report its 
		/// progress to.
		/// </summary>
		private SpiderForm m_spiderForm;

		/// <summary>
		/// How many URL's has the spider processed.
		/// </summary>
		private int m_urlCount = 0;

		/// <summary>
		/// When did the spider start working
		/// </summary>
		private long m_startTime = 0;

		/// <summary>
		/// Used to keep track of when the spider might be done.
		/// </summary>
		private Done m_done = new Done();		

		/// <summary>
		/// Used to tell the spider to quit.
		/// </summary>
		private bool m_quit;

		/// <summary>
		/// The status for each URL that was processed.
		/// </summary>
		enum Status { STATUS_FAILED, STATUS_SUCCESS, STATUS_QUEUED };


		/// <summary>
		/// The constructor
		/// </summary>
		public Spider()
		{
			reset();
		}

		/// <summary>
		/// Call to reset from a previous run of the spider
		/// </summary>
		public void reset()
		{
			m_already = new Hashtable();
			m_workload = new Queue();
			m_quit = false;
		}

		/// <summary>
		/// Add the specified URL to the list of URI's to spider.
		/// This is usually only used by the spider, itself, as
		/// new URL's are found.
		/// </summary>
		/// <param name="uri">The URI to add</param>
		public void addURI(Uri uri)
		{
			Monitor.Enter(this);
			if( !m_already.Contains(uri) )
			{
				m_already.Add(uri,Status.STATUS_QUEUED);
				m_workload.Enqueue(uri);
			}
			Monitor.Pulse(this);
			Monitor.Exit(this);
		}

		/// <summary>
		/// The URI that is to be spidered
		/// </summary>
		public Uri BaseURI 
		{
			get
			{
				return m_base;
			}

			set
			{
				m_base = value;
			}
		}

		/// <summary>
		/// The local directory to save the spidered files to
		/// </summary>
		public string OutputPath
		{
			get
			{
				return m_outputPath;
			}

			set
			{
				m_outputPath = value;
			}
		}

		/// <summary>
		/// The object that the spider reports its
		/// results to.
		/// </summary>
		public SpiderForm ReportTo
		{
			get
			{
				return m_spiderForm;
			}

			set
			{
				m_spiderForm = value;
			}
		}

		/// <summary>
		/// Set to true to request the spider to quit.
		/// </summary>
		public bool Quit
		{
			get
			{
				return m_quit;
			}

			set
			{
				m_quit = value;
			}
		}

		/// <summary>
		/// Used to determine if the spider is done, 
		/// this object is usually only used internally
		/// by the spider.
		/// </summary>
		public Done SpiderDone
		{
			get
			{
				return m_done;
			}

		}

		/// <summary>
		/// Called by the worker threads to obtain a URL to
		/// to process.
		/// </summary>
		/// <returns>The next URL to process.</returns>
		public Uri ObtainWork()
		{
			Monitor.Enter(this);
			while(m_workload.Count<1)
			{
				Monitor.Wait(this);
			}


			Uri next = (Uri)m_workload.Dequeue();
			if(m_spiderForm!=null)
			{
				m_spiderForm.SetLastURL(next.ToString());
				m_spiderForm.SetProcessedCount("" (m_urlCount  ));
				long etime = (System.DateTime.Now.Ticks-m_startTime)/10000000L;
				long urls = (etime==0)?0:m_urlCount/etime;
				m_spiderForm.SetElapsedTime( etime/60   " minutes ("   urls  " urls/sec)" );
			}

			Monitor.Exit(this);
			return next;
		}

		/// <summary>
		/// Start the spider.
		/// </summary>
		/// <param name="baseURI">The base URI to spider</param>
		/// <param name="threads">The number of threads to use</param>
		public void Start(Uri baseURI,int threads)
		{
			// init the spider
			m_quit = false;

			m_base = baseURI;
			addURI(m_base);
			m_startTime = System.DateTime.Now.Ticks;;
			m_done.Reset();
		
			// startup the threads

			for(int i=1;i<threads;i  )
			{				
				DocumentWorker worker = new DocumentWorker(this);
				worker.Number = i;
				worker.start();
			}

			// now wait to be done

			m_done.WaitBegin();
			m_done.WaitDone();			
		}
	}
}



标签: C# 采集

网友评论

发表评论

(您的评论需要经过审核才能显示)

查看所有0条评论>>

小贴士

感谢您为本站写下的评论,您的评论对其它用户来说具有重要的参考价值,所以请认真填写。

  • 类似“顶”、“沙发”之类没有营养的文字,对勤劳贡献的楼主来说是令人沮丧的反馈信息。
  • 相信您也不想看到一排文字/表情墙,所以请不要反馈意义不大的重复字符,也请尽量不要纯表情的回复。
  • 提问之前请再仔细看一遍楼主的说明,或许是您遗漏了。
  • 请勿到处挖坑绊人、招贴广告。既占空间让人厌烦,又没人会搭理,于人于己都无利。

关于好例子网

本站旨在为广大IT学习爱好者提供一个非营利性互相学习交流分享平台。本站所有资源都可以被免费获取学习研究。本站资源来自网友分享,对搜索内容的合法性不具有预见性、识别性、控制性,仅供学习研究,请务必在下载后24小时内给予删除,不得用于其他任何用途,否则后果自负。基于互联网的特殊性,平台无法对用户传输的作品、信息、内容的权属或合法性、安全性、合规性、真实性、科学性、完整权、有效性等进行实质审查;无论平台是否已进行审查,用户均应自行承担因其传输的作品、信息、内容而可能或已经产生的侵权或权属纠纷等法律责任。本站所有资源不代表本站的观点或立场,基于网友分享,根据中国法律《信息网络传播权保护条例》第二十二与二十三条之规定,若资源存在侵权或相关问题请联系本站客服人员,点此联系我们。关于更多版权及免责申明参见 版权及免责申明

;
报警