Week Six – DataONE Notebooks

This week I focused on integrating other useful information, such as work role and sector, all together with other visualization data. Those data are not simple integer, but also some unorganized phrases or sentences. Therefore, I switched the programming language to Python. It was not easy at first, but the result turned out that Python is more useful for our data.

Also I improved the algorithm to update similar names and remove duplicates information. I created a “fuzzy” match function, to match people with the same first or last names, and similar last or first names. It successfully helped me reduce some similar duplicates names. Now my task becomes harder: I am doing cross-reference by hand to match the same person with similar names with no patterns. Moreover, some information collected from SNS had some irregular letter/words, and those need to be cleaned up by hand, too. Because the data is large, it will take a little longer to clean up the whole data. The task for next week is to identify groups of people by their work and geographic information, which are just integrated in. The SNS data has a lot noise and it will take some time to integrate and convert to visualization. However, it will be very interesting to see the new visualization results.

In addition, I also reproduced the visualization many times in this week to improve visualization effect. I tried different cluster algorithms, changed the size and color of nodes, edges and label, and explored to represent information more clear to the audience. Gephi is very powerful and I experienced its different functions. I will apply what I learned these weeks to create better visualization in the coming weeks.

Leave a Reply Cancel reply