Open Educational Tools: Workflow of the merger and anonymization of the CCK09 forum datasets

Thursday, February 3, 2011

Workflow of the merger and anonymization of the CCK09 forum datasets

This is my workflow for anonymyzing and merging the VNA datasets of the CCK09 Moodle forum. I coded some python scripts to automate some of the steps. But it would still require some manual work with a spreadsheet.

Python Scripts

Download the python scripts:

http://www.mediafire.com/file/u7mo4524kfkssgg/datasetscripts09.zip (3.11 kb)

Environment:

Ubuntu Lucid amd64
Python 2.6.5
Moodle SNA Tool

I do not know if the extraction script will work with SNAPP.

Notes common for all scripts

1. This is needed for the scripts to work
add the following lines to to /etc/python2.6/sitecustomize.py

#change default python system encoding
import sys
sys.setdefaultencoding("utf-8")

My sitecustomize.py. is included in the downloadable archive above. Backup your original file to sitecustomize.py.orig and place this in /etc/python2.6 .

2. Setting the permissions of the scripts

Right click the python script and select Properties
In Properties - Permissions tab - enable Execute: Allow executing file as program.

Open a terminal and you can issue a command like this
./extractvna.py

if you cannot change permissions then you have to issue the following command in a terminal

python extractvna.py

3. Some possible errors
If you get errors like this:

Traceback (most recent call last):
File "anoncck09.py", line 29, in <module>
inputfile = open(vnafilename, 'rb')
IOError: [Errno 2] No such file or directory:

That means you forgot to set the nfile or input file number to the correct number of files to process.

Extracting VNA files from saved html page output of Moodle SNA Tool

1. Browse a Moodle forum,
2. click the bookmark link to the Moodle SNA Tool
3. Save the page with the export VNA data as 1.html, 2.html, 3.html in a folder. I number my folders as forum1, forum2 etc. Naming the files as number 1 to n is important.
4. Download and extract extractvna.py in a folder where the vna files will be saved. Let's say folder vna1.
5. Open extractvna.py in a text editor and change the following values.

idirectory = 'file:///home/juan/Documents/CCK09/VNA/forum1/' #input file local url directory. Do not use /home/juan... it will result in python error. Always put a trailing slash / . There should be 3 and only 3 leading slash after file: e.g. ///

fextension = '.html' #file extensions. make it empty if there are no extensions e.g. ''. note NO space in between the quotes.

nfile = 1 #number of input files

6. open a terminal and issue the command
./extractvna.py

or

python extractvna.py

7. The terminal should scroll with the vna data being processed. If an error occurs leave me a comment about it. Make sure you have the same Python version. If it is successful it will say "finished extracting text".

8. You will have 1-n files without the vna extension in your folder.

Anonymyzing the dataset

When I anonymized the CCK08 dataset I waited until the union is complete before creating the code sheet. In CCK09 I used a cumulative anonymization method. That is, I just add new names whenever I encounter them. This also allowed me to tag new CCK09 students in the dataset.

With an existing codesheet

If you already have a codesheet then go to step 1. If not manually create a codesheet from the union of the first set of vna files first.

1. Download and extract anoncck09.py in the same folder where you have the extracted vna files. The example above is vna1.

2. Open anoncck09.py in a text editor e.g. gedit then change the following values.

###CHANGEABLE VALUES
idirectory = 'home/juan/Documents/CCK09/VNA/vna1/' #input file directory
csfilename = '/home/juan/Documents/CCK09/VNA/codesheet08.csv' #codesheet
odirectory = '/home/juan/Documents/CCK09/VNA/vna1/' #output directory
nfile = 35 #number of input files
###END OF CHANGEABLE VALUES

3. open a terminal and issue the command
./anoncck09.py

or

python anoncck09.py

4. If it is successful it will say "finished anonymyzing vna files". You will have 1.csv, 2.csv, 3.csv ... in your folder.

5. Apply to other forum sets and just add names and aliases to the codesheet.

Anonymyzing a Union of datasets

1. Download and extract anoncck09union.py in the extracted vna folder.
2. Open anoncck09union.py in a text editor and change the following values
###CHANGE THESE VALUES
idirectory = '/home/juan/Documents/CCK09/VNA/1/1union.csv' #input file directory
csfilename = '/home/juan/Documents/CCK09/VNA/codesheet1.csv' #codesheet
odirectory = '/home/juan/Documents/CCK09/VNA/1/' #output directory
fname = 'outputfile' #filename of output file
###END OF CHANGEABLE VALUES

3. open a terminal and issue the command
./anoncck09union.py

or

python anoncck09union.py

4. If it is successful it will say "finished anonymyzing vna files". You will have a file named outputfileunion.csv
5. Apply to other forum union dataset and just add names and aliases to the codesheet.

Merging the datasets into a union dataset using OpenOffice.org spreadsheet.

This part will be very tedious if you have a lot of data.

1. Open each anonymized csv
Select Character set: Unicode (UTF-8)
Enable separated by space and text delimiter ".
2. Create a new spreadsheet with separate worksheets for node data and tie data. Copy every node data and tie data to this spreadsheet for merging.
3. Select the data columns minus the headers i.e. *Node data, ID posts; and *Tie data, from to talk strength.
4. Then sort the data with menu-Data-Sort.
5. When the names are sorted you will see the duplicate entries. Sum the numbers and erase the duplicate entries.
6. In another worksheet copy your node data and tie data then save as csv.
7. Select field delimiter space and text delimiter ". Disable save cell content as shown.
8. Open the saved csv file and remove the quotes from the headers. *Node data, ID posts; and *Tie data, from to talk strength. Otherwise you will not be able to open the file in Netdraw.
9. Rename the file extension to vna and open in Netdraw.