collect top 1000 names from forums

aim:
find some generic usernames that are popular across many forums.
generic means not unique, eg. slayer is generic, urbanadventurer is not.

hypothesis:

simple, generic usernames are registered earlier
names become more complex as the userbase grows
eg.

mike, dragon will appear in the top a
mike79, red_dragon/black_dragon will appear in the top b
mike_atlanta_79 and white_dragon_eyes will appear in the top c

where a, b and c are standard deviations (or whatever) apart.

----------------------------------------------------------------------


phpBB

sign up to http://www.phpbb.com/community/
	user:	annaboomkey@mailinator.com / M8R-7135mr @mailinator.com 
	pass:	xe3UlT44
	prove im human

members list
http://www.phpbb.com/community/memberlist.php?sid=a11c59c59669bec6d809a606fdf525ba

note: members are listed chronologically
page to collect:
http://www.phpbb.com/community/memberlist.php?
http://www.phpbb.com/community/memberlist.php?start=25
http://www.phpbb.com/community/memberlist.php?start=50
times 40 

----------------------------------------------------------------------
* registration sux
some phpBB forums dont' require registration for memberlists



#find a selection of phpBB forums:
./gggooglescan -d 2 -l phpbb-forums.txt -s 3 -e ./wordlist "phpbb forum"

# scan them with whawteb to check whether they allow unregistered users to view the memberslist
~/projects/whatweb/whatweb-new/whatweb --log-brief phpbb-forums3.txt  --custom-plugin ":text=>'viewprofile'"  -U 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1' --url-suffix "/memberlist.php" -i ./phpbb-forums2.txt

# make a list of those that allow unregistered users to view the memberslist
grep Custom-Plugin phpbb-forums3.txt | cut -d ' ' -f 1 | sort -u > phpbb-forums4.txt

# write a scraper

# make the list of targets
cat phpbb-forums4.txt | sed 's/memberlist.php//g' > phpbb-forums5.txt


-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
# make 2nd list
./gggooglescan -s 2 -l phpbb-forums2.log -d 1 -n 100 -e ./wordlist "memberlist goto page"

# got over 500,000 URLs
wc -l set2.log 
522300 set2.log

set2.log
http://forums.next-gen.biz/viewforum.php?f=7&amp;sid=e4f0073635176ab70254f737ac650b78
http://www.pinkbike.com/forum/listthreads/?forumid=19
http://www.lthforum.com/bb/viewforum.php?f=32
http://www.lthforum.com/bb/index.php


cat set2.log | sed 's/\/[^\/]*$//g' | sort |uniq > set2b.log
http://0000free.com/forum
http://100amazonia.trustpass.alibaba.com/product/110007899-101600547
http://100years.forumotion.net
http://101science.com
http://10feettallincdr.16.forumer.com

# down to 45,000
#wc -l set2b.log 
46884 set2b.log


 ./whatweb -p Custom-Plugin,Title --log-brief phpbb-forums.txt --custom-plugin ":text=>'>Goto page <b'" --follow-redirect never  -U 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1' --url-suffix "/memberlist.php" -i ~/projects/username-anarchy/forum-names/find-forums/set2b.log 

grep Custom-Plugin phpbb-forums.txt| cut -d ' ' -f 1 | sed 's/memberlist.php//g' > /home/dc/projects/username-anarchy/forum-names/find-forums/set2c.log
# 11041 set2c.log

run scraper

# find duplicates in files 
for i in `ls`; do echo $i; sort $i | uniq -c | grep -v " 1 "; done 

# fix dups manually or move to unused folder



require 'fileutils'
require 'md5'
hashes={};Dir.entries(".").map {|x| [x,Digest::MD5.file(x).hexdigest] unless ['.','..'].include?(x) or File.directory?(x) }.compact.map {|k,v| hashes[v].nil? ? hashes[v] = [k] : hashes[v] << k }
hashes.select {|k,v| v.size>1}.map {|k,v| v[1..-1] }.flatten.compact.each {|f| FileUtils.move(f,'./duplicates/') }


# remove dups in lists
# cp into a new folder
for i in `ls`; do cat "$i" | tr "[:upper:]" "[:lower:]" | sort -u > $i.2; mv $i.2 $i; done


# find all common names. more than x occurances
cat *|tr '[:upper:]' '[:lower:]' |sort|uniq -c|sort -nr > common-forum-names.txt
cat common-forum-names.txt | sed 's/^[ ]*//g' | sed 's/[ ]/,/' > common-forum-names.csv



it would be better to automatically detect and control for spamming names.
eg. have they been banned? usually have 0 posts?

# check which of hte forum names are for spamming
for 1st 100, check which are spamming names
stivrichardoff
valar2006

checking anamolous names:

stivrichardoff	joined forums between 25-03-07 - 29-03-07. no posts. Richard Stiveson. user website: www.orgnews.org 
valar2006	joined 04-07-05 - 07-11-2005. link spamming for porn, xanax
hellcoming	joined 01-2007. ?
jtfoe1974	joined 02/2006. user website: www.tempus-fugit.info
barbafim	joined april-07. ?
uropian		joined 04-2007 link spamming




