如何用Java解析大型(50 GB)XML文件
| 
                         目前,我试图使用一个SAX解析器,但约3/4通过文件,它完全冻结,我已经尝试分配更多的内存等,但没有得到任何改进. 有什么办法加速吗?一个更好的方法? 剥去它的裸骨头,所以我现在有以下代码,当在命令行运行它仍然不会像我想要的那么快. 运行它“java -Xms-4096m -Xmx8192m -jar reader.jar”我得到一个GC超出限制超过了约700000 主要: public class Read {
    public static void main(String[] args) {       
       pages = XMLManager.getPages();
    }
} 
 XMLManager public class XMLManager {
    public static ArrayList<Page> getPages() {
    ArrayList<Page> pages = null; 
    SAXParserFactory factory = SAXParserFactory.newInstance();
    try {
        SAXParser parser = factory.newSAXParser();
        File file = new File("..enwiki-20140811-pages-articles.xml");
        PageHandler pageHandler = new PageHandler();
        parser.parse(file,pageHandler);
        pages = pageHandler.getPages();
    } catch (ParserConfigurationException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    return pages;
    }    
} 
 页面处理器 public class PageHandler extends DefaultHandler{
    private ArrayList<Page> pages = new ArrayList<>();
    private Page page;
    private StringBuilder stringBuilder;
    private boolean idSet = false;
    public PageHandler(){
        super();
    }
    @Override
    public void startElement(String uri,String localName,String qName,Attributes attributes) throws SAXException {
        stringBuilder = new StringBuilder();
         if (qName.equals("page")){
            page = new Page();
            idSet = false;
        } else if (qName.equals("redirect")){
             if (page != null){
                 page.setRedirecting(true);
             }
        }
    }
     @Override
     public void endElement(String uri,String qName) throws SAXException {
         if (page != null && !page.isRedirecting()){
             if (qName.equals("title")){
                 page.setTitle(stringBuilder.toString());
             } else if (qName.equals("id")){
                 if (!idSet){
                     page.setId(Integer.parseInt(stringBuilder.toString()));
                     idSet = true;
                 }
             } else if (qName.equals("text")){
                 String articleText = stringBuilder.toString();
                 articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>"," "); //remove references
                 articleText = articleText.replaceAll("(?s){{(.+?)}}"," "); //remove links underneath headings
                 articleText = articleText.replaceAll("(?s)==See also==.+"," "); //remove everything after see also
                 articleText = articleText.replaceAll("|"," "); //Separate multiple links
                 articleText = articleText.replaceAll("n"," "); //remove new lines
                 articleText = articleText.replaceAll("[^a-zA-Z0-9- s]"," "); //remove all non alphanumeric except dashes and spaces
                 articleText = articleText.trim().replaceAll(" +"," "); //convert all multiple spaces to 1 space
                 Pattern pattern = Pattern.compile("([S]+s*){1,75}"); //get first 75 words of text
                 Matcher matcher = pattern.matcher(articleText);
                 matcher.find();
                 try {
                     page.setSummaryText(matcher.group());
                 } catch (IllegalStateException se){
                     page.setSummaryText("None");
                 }
                 page.setText(articleText);
             } else if (qName.equals("page")){
                 pages.add(page);
                 page = null;
            }
        } else {
            page = null;
        }
     }
     @Override
     public void characters(char[] ch,int start,int length) throws SAXException {
         stringBuilder.append(ch,start,length); 
     }
     public ArrayList<Page> getPages() {
         return pages;
     }
}
解决方法您的解析代码可能正常工作,但是您加载的数据量可能太大,无法容纳该ArrayList中的内存.您需要某种管道才能将数据传递到其实际目的地,无需任何时间 我有时为这种情况做的事情类似于以下. 创建一个用于处理单个元素的界面: public interface PageProcessor {
    void process(Page page);
} 
 通过构造函数向PageHandler提供一个实现: public class Read  {
    public static void main(String[] args) {
        XMLManager.load(new PageProcessor() {
            @Override
            public void process(Page page) {
                // Obviously you want to do something other than just printing,// but I don't know what that is...
                System.out.println(page);
           }
        }) ;
    }
}
public class XMLManager {
    public static void load(PageProcessor processor) {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        try {
            SAXParser parser = factory.newSAXParser();
            File file = new File("pages-articles.xml");
            PageHandler pageHandler = new PageHandler(processor);
            parser.parse(file,pageHandler);
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
} 
 将数据发送到此处理器,而不是将其放在列表中: public class PageHandler extends DefaultHandler {
    private final PageProcessor processor;
    private Page page;
    private StringBuilder stringBuilder;
    private boolean idSet = false;
    public PageHandler(PageProcessor processor) {
        this.processor = processor;
    }
    @Override
    public void startElement(String uri,Attributes attributes) throws SAXException {
         //Unchanged from your implementation
    }
    @Override
    public void characters(char[] ch,int length) throws SAXException {
         //Unchanged from your implementation
    }
    @Override
    public void endElement(String uri,String qName) throws SAXException {
            //  Elide code not needing change
            } else if (qName.equals("page")){
                processor.process(page);
                page = null;
            }
        } else {
            page = null;
        }
    }
} 
 当然,您可以使您的界面处理多个记录的块,而不仅仅是一个,并让PageHandler在本地将页面收集到较小的列表中,并定期发送列表进行处理并清除列表. 或者(也许更好),您可以实现这里定义的PageProcessor接口,并在其中构建缓冲数据的逻辑,并将其发送以进一步处理块. (编辑:莱芜站长网) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!  | 
                  
