Repeated DNA Sequences - leetcode
Repeated DNA Sequences -- leetcode
Ascii码表中,A: 0x41, C: 0x43, G: 0x47, T: 0x54. 可以用编码的后三bit位来区分这个字母。
区分4个字母,其实2个bit位也足够。 但这里用到3个,是为了运算符方便。
这里是10个字母,3个bit位表示一个,总共需要30个bit位。也在int表示范围内。由于3个bit位能准确表示这4个字符,故这里算出的hash值能唯一代表一种字符串,不需要考虑冲突的情况。
https://leetcode.com/discuss/24478/i-did-it-in-10-lines-of-c
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
基本思路:
统计字串出现次数,一般会用滚动hash。Ascii码表中,A: 0x41, C: 0x43, G: 0x47, T: 0x54. 可以用编码的后三bit位来区分这个字母。
区分4个字母,其实2个bit位也足够。 但这里用到3个,是为了运算符方便。
这里是10个字母,3个bit位表示一个,总共需要30个bit位。也在int表示范围内。由于3个bit位能准确表示这4个字符,故这里算出的hash值能唯一代表一种字符串,不需要考虑冲突的情况。
class Solution { public: vector<string> findRepeatedDnaSequences(string s) { vector<string> ans; if (s.size() < 10) return ans; unordered_map<int, int> count; int hash = 0; for (int i=0; i<9; i++) hash = hash<<3 | s[i] & 7; const int mask = ~(7 << 30); for (int i=9; i<s.size(); i++) { hash = hash<<3 & mask | s[i] & 7; if (++count[hash] == 2) ans.push_back(s.substr(i-9, 10)); } return ans; } };
https://leetcode.com/discuss/24478/i-did-it-in-10-lines-of-c
版权声明:本文为博主原创文章,未经博主允许不得转载。