So yesterday, I decided to learn Python. Been a .NET guy primarily for the last n years, had some people work in it around me, but never was inclined to try it out. DUH!!!! Such a nice language. It took a couple minutes to get my bearings, but I figured…why not! Everyone in the Valley is so anti-MS and so pro-(Python, MySQL, PHP) one needs to embrace the flow.
For the last couple years I’ve been using a very simple, yet (what I believe to be) a strong POS tagger built by Mark Watson and based on Eric Brill’s work. Written in C#, it gave me a very straightforward paring knife to do tokenization and POS tagging quickly and easily in .NET. Now Monty Tagger and NTLK are definitely incredible resources for NLP in Python, but I wanted something very strightforward and portable without all the bells and whistles so I can build on the core myself. Not to mention I wanted something fun for my first outting in Python. Well…ta da! Here it is.
It’s comprised of two (count them 2) VERY simple source files. The first is the basic hashing and pickling utility if you want to make changes to the lexicon (I believe I’m using the same lexicon file as Monty Tagger), and the second is the actual tagger/tokenizer.
I’ve made some additional tweaks to the versions I run and plan to port some of them also to Python. If you’re intersted in additions add a comment and I’ll do my best to share/accomodate.
You can download my Python NLP Part-of-Speech Tagger here.
P.S./Caveat/blahblah:
This is my first anything outside of some Hello World stuff in Python. It definitely works, and does so at a decent clip (speed wise), but I’m sure I could have done some of the operations a little more elegantly. Leave comments though with recommendations/suggestions/!flames.


24 Responses to "Simple NLP Part-of-Speech tagger in Python"
This looks great Jason… i’m not too familliar with Python so how do I install this to my linux apache server? All I have is FTP access to it.
Hi Brandon-
Thanks. You can run it from PHP if you only have FTP access. Here’s a good link to use as reference. Let me know if you need further help. -J.
http://www.csh.rit.edu/~jon/projects/pip/
Hi Jason,
Where can I get the .Net ( C#) version of the Simple NLP POS?
Thanks
Dan
Hi-
It’s available above by clicking on Mark Watson’s name
J.
Hi Jason. It’s good to hear you’re getting into Python still, but I hope your code doesn’t look like *that* anymore!
I’ve edited your code to make it a whole lot more pythonesque, and would like to send it to you, but feel that this textarea is not the place =)
Hi,
I have downloaded the “Simple NLP Part-of-Speech tagger in Python”. I would like to integrate it in a c#.net program. Can you please guide me?
Wow very intrested in this i have tried to download it but the website is down can you post a new link please.
Kind Regards
Mark
NLP Exeter
Really fantastic work…. i am doing the project for extracting usefulness of reviews using NLP… can i use your tool for this work….
Feel free, but if you plan on releasing the product commercially, please make sure to take care of the LGPL
Is it difficult to split Brill_lexicon into two files to get the file-size down? I’m trying to use this on app engine and the file is over the max size limit
This is exactly what I’ve been looking for! I’ve encountered one stumbling block though: I’m not sure what all the tags mean
Some of them are obvious, but it would be great to get a full list. I know there are several different tagging schemes around for POS, so I thought I’d ask you directly: Which scheme does the tagger use? Thanks!
[...] the Digg stories into a single twitter post was accomplished with a really nice little POS tagger written in python that I came across. It tags all the words in the text I give it with their [...]
hai Jason,
I’ve tried your code using phython.. what a brilliant program!.. I am looking for a c# version of this code and refering to Mark watson blog. Unfortunately couldn’t find link on the brill’s tagger he implemented.
Hello Jason,
The link to the zip file seems to be dead. It points to version tracker, then redirects to some fujitsu drivers for some reason. Could you please provide a direct link?
Thanks!
Hey Jason, the link to your python pos tagger brings me to the fujitsu website. Is there somewhere else that I can download the source? If not, where can I find information on the algorithm that you used?
Hi. I didn’t realize the versiontracker dumped the file. I’m hosting it now. http://jasonwiener.com/dld/_nlplib_pyc.zip
When I’ve tried to download the file i was redirected to a Fujitsu-Page. Can you post a working download-link?
Thx
Felix
Hi. I didn’t realize the versiontracker dumped the file. I’m hosting it now. http://jasonwiener.com/dld/_nlplib_pyc.zip
Incrediblely simple and effective part of speech tagger!
Thanks Jason!
Incrediblely simple and effective part of speech tagger! Was looking for something just like this.
Thanks Jason!
Hi
I have downloaded the “Simple NLP Part-of-Speech tagger in Python”. in order to using in a spanish controled language.
[...] I used this guy’s code to determine the part of speech, plus a few modifications of my own. Bookmark on Delicious Digg [...]
I have downloaded Mark Watson’s c# POS tagger and I require to use it in my program.Could you help me?Whenever I try to run the code it asks me for a lex.dat file.Please help me..
for mark’s c# library to work, i believe u need to run the .bat file to create the .dat from the lexicon text file. he’s very good about responding to questions if u reach out to him as well. otherwise feel free to ping me and i’ll do my best to help.