如何使用MySQL中的关系将文本字符串批量转换为唯一ID

如何使用MySQL中的关系将文本字符串批量转换为唯一ID

问题描述:

I have a movie database I am working on and before I start working on the php side I want to make sure the database is solid.

As I have mass imported data scraped from the web currently my results in the columns for genre and actors are in text strings. I want to convert them to unique ids and create relationship tables.

Essentially how it is now:

Movie Table

Movie ID - Movie name - Genres - Actors
1        - Inception  - Sci Fi - Leonardo Di Caprio, Ellen Page

How I want it:

Movie Table

Movie ID - Movie Name
1        - Inception

Genre Table

Genre ID - Genre Name
1        - Sci Fi

Actor Table

Actor ID - Actor Name
1        - Leonardo Di Caprio
2        - Ellen Page

Genre Relationships Table

Movie ID - Genre ID
1        - 1

Actor Relationships Table

Movie ID - Actor ID
1        - 1
1        - 2

If it was just the genres then I could do this by hand but as there are thousands of movies and actors I am struggling to come up with a simple approach to convert all this data.

I have a csv dump of all the data and figure it could be done by using a php script to import it in this format or I don't know if it's possible to run SQL commands to sort the data this way (the database has over 200,000 movies).

Any hints or ideas on how to accomplish this would be much appreciated!

我有一个我正在研究的电影数据库,在我开始处理php方面之前我想确保 数据库是可靠的。 p>

由于我从网上抓取了大量导入的数据,因此我在genre和actors的列中的结果是文本字符串。 我想将它们转换为唯一ID并创建关系表。 p>

基本上现在如何: p>

电影表 p> \ n

 电影ID  - 电影名称 - 类型 - 演员
1  - 初始 - 科幻 -  Leonardo Di Caprio,Ellen Page 
  code>  pre> 
  blockquote  > 
 
 

我的需要: p>

电影表 p>

 电影ID  - 电影名称 
1  - 初始
  code>  pre> 
  blockquote> 
 
 

流派表 p>

 流派ID  - 流派名称
1  - 科幻
  code>  pre> 
  blockquote> 
 
 

演员表 p>

 演员ID  - 演员姓名
1  -  Leonardo Di Caprio 
2  -  Ellen Page 
  code>  pre> 
  blockquote> 
 
 

流派关系表 p>

 电影ID  - 流派ID 
1  -  1 
  code>  pre> 
  blockquote> 
 
 

演员关系表 p>

 电影ID  - 演员ID 
1  -  1 
1  -  2 
  code>  pre> 
  blockquote> 
 
  

如果 这只是流派,然后我可以手工完成这个,但由于有成千上万的电影和演员,我正在努力想出一个简单的方法来转换所有这些数据。 p>

我有一个所有数据的csv转储,并且可以通过使用php脚本以这种格式导入它来完成,或者我不知道是否可以运行SQL命令 以这种方式对数据进行排序(数据库有超过200,000部电影)。 p>

非常感谢有关如何实现这一目标的任何提示或想法! p> div>

Something like this will sort of work:

For each record
    Do 
       Select from genre table using genre string to get genre ID
       If select did not return ID, INSERT new genre string to add new genre ID
    While Select did not return ID
    For each actor
       Do 
         Select from genre table using actor string to get actor ID
         If select did not return ID, INSERT new actor string to add new actor ID
       While Select did not return ID
  ...

But there will be problems:

  • Movies with the same name
  • Different spellings of genre names (sf, sci fi, science fiction)
  • Different spellings of actor names. You'll see in IMDB actors might be Mike or Michael, with or without a middle initial, etc. and women might use their married name in some movies but not others
  • Actors with the same name

To fix that you'd need access to some existing database where you can get the same ID code for any variation of an actor's name, for a genre name, and for movies when supplying an actor list.