使用PHP将Wiktionary XML数据转储到MySQL数据库中

问题描述:

好的,我只是想解析Wiktionary Wikimedia 提供的数据转储.

Alright, I'm just trying to parse Wiktionary Data Dump provided by Wikimedia.

我的意图是解析XML数据转储到MySQL数据库中.我没有找到有关此XML结构的适当文档.另外,由于文件实际上很大(〜1 GB),因此我无法打开该文件.

My intention is to parse that XML data dump into MySQL database. I didn't find proper documentation regarding the structure of this XML. Also, I'm not able to open the file because it's infact really huge (~1 GB).

我曾想过使用一些PHP脚本来解析它,但是我对要继续进行的XML结构一无所知.因此,如果有人已经使用PHP解析了MySQL(或对解析任何工具有想法),请分享详细信息.如果PHP中没有任何内容,则其他方法也可以.

I thought of parsing it using some PHP script but I don't have any idea about the XML structure to proceed. So If anyone had already parsed (or have idea about any tool to parse) into MySQL using PHP, Please share the details. If nothing in PHP, Other methods are also fine.

我刚刚关注了这篇文章( http://www.igrec.ca/lexicography/installing-a-local-copy-of-wiktionary-mysql/),但没有成功.. :(如果有人在此过程中取得了成功,请提供帮助.预先感谢.

I just followed this post (http://www.igrec.ca/lexicography/installing-a-local-copy-of-wiktionary-mysql/) but it didn't work out..:( If anybody have succeed in this process, please help. Thanks in Advance.

可以使用XMLReader在PHP中解析这些文件. compression.php"rel =" nofollow> compress.bzip2://流.您拥有的文件的结构是示例性的(查看大约前3000个元素):

Those files can be parsed in PHP with XMLReader operating on a compress.bzip2:// stream. The structure of the file you have is exemplary (peeking into ca. the first 3000 elements):

\-mediawiki (1)
  |-siteinfo (1)
  | |-sitename (1)
  | |-base (1)
  | |-generator (1)
  | |-case (1)
  | \-namespaces (1)
  |   \-namespace (40)
  \-page (196)
    |-title (196)
    |-ns (196)
    |-id (196)
    |-restrictions (2)
    |-revision (196)
    | |-id (196)
    | |-parentid (194)
    | |-timestamp (196)
    | |-contributor (196)
    | | |-username (182)
    | | |-id (182)
    | | \-ip (14)
    | |-comment (183)
    | |-text (195)
    | |-sha1 (195)
    | |-model (195)
    | |-format (195)
    | \-minor (99)
    \-redirect (5)

文件本身稍大,因此需要花费很多时间来处理.或者,不对XML转储进行操作,而只是通过mysql命令行工具导入SQL转储.该站点上也提供SQL转储,请参见英语维基词典的所有转储格式:

The file itself is a little larger, so it takes quite some time to process. Alternatively do not operate on the XML dumps, but just import the SQL dumps via the mysql commandline tool. SQL dumps are available on the site as well, see all dump formats for the English Wiktionary:

整个文件很小,包含66 849 000个元素:

The overall file was a litte larger with more than 66 849 000 elements:

\-mediawiki (1)
  |-siteinfo (1)
  | |-sitename (1)
  | |-base (1)
  | |-generator (1)
  | |-case (1)
  | \-namespaces (1)
  |   \-namespace (40)
  \-page (3993913)
    |-title (3993913)
    |-ns (3993913)
    |-id (3993913)
    |-restrictions (552)
    |-revision (3993913)
    | |-id (3993913)
    | |-parentid (3572237)
    | |-timestamp (3993913)
    | |-contributor (3993913)
    | | |-username (3982087)
    | | |-id (3982087)
    | | \-ip (11824)
    | |-comment (3917241)
    | |-text (3993913)
    | |-sha1 (3993913)
    | |-model (3993913)
    | |-format (3993913)
    | \-minor (3384811)
    |-redirect (27340)
    \-DiscussionThreading (4698)
      |-ThreadSubject (4698)
      |-ThreadPage (4698)
      |-ThreadID (4698)
      |-ThreadAuthor (4698)
      |-ThreadEditStatus (4698)
      |-ThreadType (4698)
      |-ThreadSignature (4698)
      |-ThreadParent (3605)
      |-ThreadAncestor (3605)
      \-ThreadSummaryPage (11)