>>> 'x\u2009 a'.split()
['x', 'a']
# incorrect; in bytes mode, `\S` doesn't know about unicode whitespace
>>> list(re.finditer(br'\S+', 'x\u2009 a'.encode()))
[<re.Match object; span=(0, 4), match=b'x\xe2\x80\x89'>, <re.Match object; span=(7, 8), match=b'a'>]
# correct, in unicode mode
>>> list(re.finditer(r'\S+', 'x\u2009 a'))
[<re.Match object; span=(0, 1), match='x'>, <re.Match object; span=(5, 6), match='a'>] import mmap, codecs
from collections import Counter
def word_count(filepath):
freq = Counter()
decode = codecs.getincrementaldecoder('utf-8')().decode
with open(filepath, 'rb') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
for chunk in iter(lambda: mm.read(65536), b''):
freq.update(decode(chunk).split())
freq.update(decode(b'', final=True).split())
return freq... Ah, but I suppose the existing code hasn't avoided that anyway. (It's also creating regex match objects, but those get disposed each time through the loop.) I don't know that there's really a way around that. Given the file is barely a KB, I rather doubt that the illustrated techniques are going to move the needle.
In fact, it looks as though the entire data structure (whether a dict, Counter etc.) should a relatively small part of the total reported memory usage. The rest seems to be internal Python stuff.
edit: OP's fully native C++ version using Pystd