-
Notifications
You must be signed in to change notification settings - Fork 106
GSoCIdeas
This is the ideas page for Google Summer of Code. We have listed a handful of interesting project ideas that would benefit not only the S-Space package, but the many researchers that use the package for their projects worldwide. Our key goals are to make the package more useful, more flexible, and more reliable.
Our projects are designed to give you a taste for both high quality development and exposure to interesting research questions. We want to feed for your passion for development and research with interesting, challenging and meaningful projects!
This list is just an overview of some major projects that the team has been thinking of doing. If any interest you or you would like more information on any of them, please send us an email our development list [email protected]. Also, this list is not comprehensive, if you have any ideas beyond what we've listed, we're very interested to hear them, so please share any other ideas you have via the mailing list, and we can find a way to turn the idea into a full GSoC project.
New things you will learn as a part of working with us:
- Industrial-quality Java development with a focus on scalability, memory efficiency and concurrency
- All about the [Distributional Hypothesis] (http://en.wikipedia.org/wiki/Distributional_hypothesis) and [ distributional semantics] (http://en.wikipedia.org/wiki/Statistical_semantics)
- New and interesting ideas in [Natural Language Processing] (http://en.wikipedia.org/wiki/Natural_language_processing) and [Computational Linguistics] (http://en.wikipedia.org/wiki/Computational_linguistics)
Tools you will learn in your projects (if you didn't know them already)
- [Maven] (http://ant.apache.org/)
- [Git] (http://git-scm.com/)
- [Concurrent] (http://java.sun.com/developer/technicalArticles/J2SE/concurrency/) Java programming
- Writing unit tests with [jUnit] (http://www.junit.org/)
Things you can expect from us:
- Guidance to help you select your project and future directions. We want you have a clear vision of what you're getting into and hopefully a lot of excitement as well.
- Full support via email, IM, and IRC for all your questions to ensure you are able to keep making progress. Getting stuck or not knowing what to do next is frustrating; we want your development experience to be both fun and challenging!
- Constant encouragement. Your work really matters to researchers around the world and we want you to know it.
- Respect for your passion and development skills.
How to get started:
- Read up on [distributional semantics] (http://en.wikipedia.org/wiki/Statistical_semantics) to get an idea of what we do and what kinds of problems people are solving with the S-Space Package.
- Download or check out the latest source code
- Join our mailing list, [email protected]
- Download a corpus to play around with some of the algorithms. A
txt
version of one of Project Gutenberg's [top 100 books] (http://www.gutenberg.org/browse/scores/top) makes a great start. (If you have some space available the [Open American Corpus] (http://americannationalcorpus.org/OANC/index.html#download) or a [Wikipedia snapshot] (http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2) make for interesting corpora.) - Run any of the algorithms over your corpus to build a semantic space. Load it into [SemanicSpaceExplorer] (http://code.google.com/p/airhead-research/wiki/SemanticSpaceExplorer) and see what you can discover.
- Do the neighbors of words seem to capture some related properties?
- Are neighbors synonyms or are they related in a different way?
- How does word frequency affect your results?
- How do the neighbors change between different algorithms?
- If you're feeling ambitious, consider reading Turney and Pantel's (2010) [survey] (http://www.patrickpantel.com/cgi-bin/web/tools/getfile.pl?type=paper&id=2010/jair10.pdf) of the different distributional approaches. This will provide more context for what you might be working on, but don't feel obligated to read all of it. :)
- Email us! We love to hear from our users.
Task | Difficulty | Description | Rationale | Ideal Deliverables | Skill Requirements |
Implement an interactive GUI for the S-Space Package | Easy | Currently, the S-Space package is only accessed via the command line. For many people, this is an unfamiliar model. A S-Space Package GUI should give users the flexibility to run a variety of Semantic Space models over a variety of corpora. Ideally a user would be able to select parameters for each Semantic Space model, such as the number of dimensions, a matrix transform, the form of dimensionality reduction, word filters, and so on from a series of simple menus. The user could then select their corpus of choice, possibly have a chance to clean the corpus, and then decide where to save the final semantic space. | A GUI interface significantly lowers the bar for playing around with a variety of semantic space algorithms. With a GUI, users would be able to easily select one or more algorithms, run them, and then plug them into their application. For a good example of an easy to use interface for a complex set of algorithms, see Weka 's GUI. |
A GUI that exposes the ability to
|
|
Integrate S-Spaces with Graph Visualization and Community Detection | Medium to Hard | Once of the nice properties of a semantic space is that it is a space. You can think about the connections between words in terms of distances, angles, vectors, etc. This project seeks to extend this idea by visualizing a Semantic Space model as a graph. The project will start with the idea of visualizing the space by representing words as vertices and connecting nearby neighbors with edges. We then would like to add support for [community detection] (http://en.wikipedia.org/wiki/Community_structure) to help group related words into semantic categories. Our target graph visualization platform is [Gelphi] (http://gephi.org/) | The complete structure of Semantic Space models is currently hard to visualize using a command-line interface. By visualizing these spaces as graphical models, researchers will be more capable of distinguishing the features of each algorithm's semantic space. They can then visually determine which type of space best captures the semantics for solving their particular type of problem. Furthermore, community structure helps researchers assess global properties of the space, such as its conceptual organization. |
|
* Java * Familiarity with Gephi or other graphing software * Familiarity with graph data structures and algorithms |
Create a web service around Semantic Space models | Medium | Semantic spaces for millions of words can often grow into the tens of gigabytes and take significant resources to compute. While the model only needs to be created once, sharing its data is prohibitively intensive on network bandwidth. This project focuses on exposing semantic space data as a network application. Users can query the different semantic spaces with remote method calls to access information much like they would if the data was local. Our target platform for this is [Google app engine] (https://appengine.google.com/start) . | This project aims at increasing the accessibility of semantic space data. As new semantic space models are built, their contents can be rapidly disseminated via web service without having to download the entire data set. Furthermore, the web service allows for semantic-space using applications to access the data in a light-weight manner, which opens the possibility of using the data in other web-apps. |
|
|
Implement new clustering algorithms | Medium-High | [Clustering] (http://en.wikipedia.org/wiki/Cluster_analysis) is fundamental to many data applications and is often essential in analyzing data. The S-Space package is currently integrating a variety of innovative and efficient clustering algorithms such as [hierarchical] (http://en.wikipedia.org/wiki/Hierarchical_clustering) clustering and [spectral] (http://en.wikipedia.org/wiki/Cluster_analysis#Spectral_clustering) clustering. Our ultimate goal is to provide a robust, *diverse* library of algorithms for researchers to use in analyzing data. Your task would be to select one or more clustering algorithms and implement them according to our clustering API. This ensures that other researchers can easily assimilate your work and use it in new applications. Ideally, we would like you to select an algorithm with two key aspects: efficient computation time with sparse data sets and an ability to infer the number of clusters via parameters. (We can help guide you on different algorithms) | A number of clustering algorithms have been proposed through different research fields, many times outside of machine learning literature. We would like the S-Space package to be a good resource of effective clustering algorithms for word spaces. These algorithms will let researchers discover more relations both within the word spaces and in broader research areas. |
|
|
Implement a S-Space algorithm in Hadoop | High | All but one of the S-Space algorithms are implemented with the default java threading framework. This form of parallelism has a number of limitations as the number of cores increases. The Hadoop framework is an effective method of utilizing a large number of parallel machines for highly parallel tasks. This task would require the student to find similar processing patterns in the S-Space algorithms, and create a simplified Hadoop processing system for as much of the parallel similarities as possible. | Hadoop's parallelism can scale to a massive number of nodes, which is becoming increasingly necessary as the amount of text data available increases. This project would let researchers already using hadoop leverage their setup fully utilize the S-Space package. |
|
|