爬取网页中遇到的编码有关问题

爬取网页中遇到的编码问题
利用httpclient爬取网页的过程中，需要根据网页的编码来进行爬取，而我们需要的网页编码是html的meta标签中conten-type属性中的charset字段定义的，因此为了防止乱码，需要获取charset字段中的编码方式。
解决思路：先按照默认方式将网页源码爬取下来，并存入byte型数组中；之后用findCharset方法，利用正则表达式获得meta标签中的编码；最后利用byte数组和获得的编码重新构造String对象（即最后的网页源代码）

具体代码如下：

public static String findCharSet(String html){
		//默认编码
		String Default_charet = "UTF-8";
		//得到的网页编码
		String charset = Default_charet;
		String regex = "<meta.* charset.*>|<META.* charset.*>";
		Pattern p = Pattern.compile(regex);
		Matcher m = p.matcher(html);

		while(m.find()){
			String s = m.group();
			if(s.matches(".*charset.*")){
				try{
					charset = s.substring(s.lastIndexOf("charset=")+"charset=".length(), 
							s.lastIndexOf("\""));
				}catch(IndexOutOfBoundsException e){
				}		
			}
		}
		//System.out.println(charset);
		return charset;
	}
	
	public static byte[] crawl(String url){
		
		//存储网页内容
		byte[] result = null;
		
		//创建HttpClientBuilder
        HttpClientBuilder httpClientBuilder = HttpClientBuilder.create();
        //HttpClient
        CloseableHttpClient closeableHttpClient = httpClientBuilder.build();

        HttpGet httpGet = new HttpGet(url);
        //System.out.println(httpGet.getRequestLine());
        try {
        	
        	//设置请求和链接超时
        	RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(5000).setConnectTimeout(5000).build();
           
        	httpGet.setConfig(requestConfig);
        	//执行get请求
            HttpResponse httpResponse = closeableHttpClient.execute(httpGet);
            //获取响应消息实体
            HttpEntity entity = httpResponse.getEntity();
            //响应状态
            if(httpResponse.getStatusLine().getStatusCode() == HttpStatus.SC_OK){
	            //判断响应实体是否为空
	            if (entity != null) {
	                result = EntityUtils.toByteArray(entity);
//	                result = new String(b);
	                
	                //System.out.println(result);
	               // System.out.println(s);
	            }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            try {
                //关闭流并释放资源
                closeableHttpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return result;
    }
	
	public static String downPage(String url){
		
		String regex = "[\\S]*\\.xls|[\\S]*\\.jpg|[\\S]*\\.doc|[\\S]*\\.docx"+
				"|[\\S]*\\.rar|[\\S]*\\.pdf";
		if(url.matches(regex))
			return null;
		
		//存储网页内容
		String result = null;
		
		//获得网页内容，并根据得到的编码重新构建String对象
		byte []b = crawl(url);
		if(b==null)
			return null;
		String s = new String(b);
		
        try {
			result = new String(b,findCharSet(s));
		} catch (UnsupportedEncodingException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
        
        return result;
	}

httpclient-4.3.1版本

爬取网页中遇到的编码有关问题

相关推荐