Monday, August 3, 2009

Anonymizing CCK08 forum network files

I hit a wall last week when I tried to anonymize the forum level network VNA (Netdraw) files. First I tried converting them to ODS (Openoffice.org Calc files) and trying vlookup, but that only worked with numbers as search for the lookup table. Since I don't have excel I couldn't use VB macros, and I don't know how to create OOO macros. I am not a programmer and I don't know SQL, bash scripting with sed and awk make my hair stand on end. I've forgotten my elementary perl and python as well.

My problem is that I have to anonymize the subset forum level networks by using as a lookup table from the union of all networks vertex labels, that was anonymized in Pajek. If I anonymized all the forum network files in Pajek, then they will not be comparable because Pajek will renumber then from 1 to n. I need them to be comparable so as to track ego's in each forum. Ex.

original file
"Roel Cantada" "Juan dela Cruz"
...

lookup table
"Roel Cantada" "v1"
"Juan dela Cruz" "v2"
...

target anonymized file
"v1" "v2"


I couldn't find a tool for the purpose. rpl appears to me to require me to input 537 codes one by one. So I ended up writing a python script. In the script, anonall is the lookuptable, origfile the vna file to convert, and newfile is the anonymized text file. The python file needs to be in the same folder as the VNA files and it requires manual input of the names of the VNA files. In addition the output needs cleanup of double quotes of the headers of the VNA. But it's better than manually anonymizing all the network files. Here it is.

import csv
origfile = raw_input('csv filename to anonymize: ')
newfile = (origfile + 'new.txt')
table={};
anonall = csv.reader(open('anonall_code.csv'), delimiter=' ', quotechar='"');
forum1 = csv.reader(open(origfile), delimiter=' ', quotechar='"');
output = open(newfile,'a');

for row in anonall:
table[row[0]] = row[-1]

for row in forum1:
for i,j in table.iteritems():
if i in row[-4]:
row[-4] = j
for i,j in table.iteritems():
if i in row[-3]:
row[-3] = j
output.write('\"'+row[-4]+'\" \"'+row[-3]+'\" \"'+row[-2]+'\" \"'+row[-1]+'\"\n')
output = open(newfile,'a')
Python file: http://www.mediafire.com/file/3ttzggdnmtk/anonymizecck08.py.zip (0.43KB)

I'll be sharing the VNA files for the individual forums within this week.

No comments:

Post a Comment

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.