The right place to know more about cucco

Cucco

Introduction

Cucco is an easy-to-use text normalization library for Python that allows users to make text more consistent, and therefore makes data easier to process, by removing or replacing accent marks, white spaces, stop words, unnecessary characters and punctuation, and lots more.

How to install

The easiest way to install cucco is by using pip, as you get the latest release version. Simply run the following command:

$ pip install cucco

Another way to install cucco is by installing it directly from the code using git. To do this, clone the repository from GitHub and install it like this:

$ git clone https://github.com/davidmogar/cucco.git
$ cd cucco
$ python setup.py install

After using either of the two methods, you should now have the latest release version of cucco, ready to use.

Quickstart

Using cucco is very simple, which makes it stand out from other normalization tools. To get started, all you need is to have an up-to-date version of cucco installed. Check out the How to Install section if you have not done this already.

The basic idea is that you only have to input the text to be processed, specify the normalizations to apply and receive the text normalized as the output. Let's see some examples.

A minimal application

You can normalize text within your code using the default settings. In the default mode, punctuation, extra white spaces and symbols are removed. A specified list of English stop words are also removed.

For example, to apply all normalizations to the text "Who let the cucco out?"":

from cucco import Cucco

cucco = Cucco()
print(cucco.normalize('Who let the cucco out?'))

The output would be the following, since the punctuation, extra white spaces and the stop words would be removed:

cucco
Defining custom normalizations

To define custom normalizations, you can send a list of normalizations to apply, which will be executed in order.

For example:

from cucco import Cucco

cucco = Cucco()

normalizations = [
    'remove_extra_whitespaces',
    ('replace_punctuation', {'replacement': ' '})
]

print(cucco.normalize('Who    let   the cucco out?', normalizations))

In this case, the output would be the following, since only extra white spaces were removed and punctuation is replaced by a white space:

Who let the cucco out

Normalizations

Now that you have cucco installed and know the basics of how to process the text, here are a list of the various normalization functions that cucco can do for you.

Remove accent marks

This function removes accent marks in the text. You can also define accent marks to be excluded from this effect by setting up a parameter.

def remove_accent_marks(text, excluded=None):
"""
Args:
    text: The text to be processed.
    excluded: Set of unicode characters to exclude.
"""
Remove white spaces

This function removes white spaces from the beginning and the end of the string, as well as duplicated white spaces between words.

def remove_extra_whitespaces(text):
"""
Args:
    text: The text to be processed.
"""
Remove stop words

Stop words in over 50 languages are supported. The language of stop words to be loaded is specified during instantiation, according to the language code received. You can specify whether to take the letter case into account or not. English is the default language.

def remove_stop_words(self, text, ignore_case=True, language=None):
"""
Args:
    text: The text to be processed.
    ignore_case: Whether or not ignore case.
    language: Code of the language to use (defaults to 'en').
"""
Replace characters
def replace_characters(self, text, characters, replacement=''):
"""
Args:
    text: The text to be processed.
    characters: Characters that will be replaced.
    replacement: New text that will replace the custom characters.
"""

Specific characters can be removed from the input text or replace them with a specified string.

Replace emails

You can remove email addresses from the input text or replace them with a specified string.

def replace_emails(text, replacement=''):
"""
Args:
    text: The text to be processed.
    replacement: New text that will replace email addresses.
"""
Replace emojis

You can remove emojis, emoticons, or smileys from the input text, or replace them with a specified string.

def replace_emojis(text, replacement=''):
"""
Args:
    text: The text to be processed.
    replacement: New text that will replace emojis.
"""
Replace hyphens

It's also possible to remove hyphens from the input text or replace them with a white space or a specified string.

def replace_hyphens(text, replacement=' '):
"""
Args:
    text: The text to be processed.
    replacement: New text that will replace hyphens.
"""
Replace punctuation

Punctuation can be easily removed or replaced with a string if specified.

def replace_punctuation(self, text, excluded=None, replacement=''):
"""
Args:
    text: The text to be processed.
    excluded: Set of characters to exclude.
    replacement: New text that will replace punctuation.
"""
Replace symbols

You can remove symbols from the input text or replace them with a specified string.

def replace_symbols(text, form='NFKD', excluded=None, replacement=''):
"""
Args:
    text: The text to be processed.
    form: Unicode form.
    excluded: Set of unicode characters to exclude.
    replacement: New text that will replace symbols.
"""
Replace URLs

URLs can be removed from the text or replaced with a string if specified.

def replace_urls(text, replacement=''):
"""
Args:
    text: The text to be processed.
    replacement: New text that will replace URLs.
"""

Command Line Interface

Introduction

Cucco provides a command line tool that can be used directly from the terminal and is installed along with the library. This section covers the usage and features of this tool.

The configuration file

Although it's not strictly necessary, the command line allows you to set a configuration file with all the normalizations to apply to a given text. This is an example of a configuration file:

normalizations:
  - remove_extra_whitespaces
  - replace_punctuation:
      replacement: ' '
  - remove_stop_words
Cucco configuration file uses the YAML format and expects a root element named normalizations. This element has to contain a list of all the normalizations to apply and their respective parameters.

To use this file, the parameter -c path_to_config.yaml has to be used.

Normalizing input

This section demonstrates how to use the tool to normalize text from different types of inputs.

User defined text

The most straightforward way to use cucco is by supplying a text while executing it. An example of this method will be as follows:

$ cucco normalize 'Who let the cucco out?'
cucco
Pipe text

In a similar way to the example above, it is possible to normalize the text coming from a different command using pipes. This example ilustrates how to do it:

$ echo "Who let the cucco out?" | cucco normalize
cucco
Files on disk

Finally, cucco allows you to apply normalizations to all of the files in a given directory. This can be achieved in the following way:

$ cucco batch path_to_directory
2017-06-18 23:28:14,455 I Processing files in "path_to_directory"

Option -r or --recursive can be used to find files recursively.

At the moment is not possible to apply normalizations to a single episode but this functionality will be available soon. Follow the progress here.

Watch mode

One of the most advanced features of the command line tool is to run it in "watch mode." In this mode, cucco will monitor a given path for changes, applying normalizations to every new or modified file.

$ cucco batch --watch path_to_directory
2017-06-18 23:24:40,633 I Initializing watcher for path "path_to_directory"
2017-06-18 23:24:40,634 I Starting watcher
2017-06-18 23:24:40,635 I Waiting for file events

Option -r or --recursive can be used to find new and modified files recursively.

API

Introduction

Another way to use Cucco is through its API. This section will give you all the details about its usage. To know more about the routes or how to reach them, you can check this link.

Endpoints

Cucco API has two kinds of endpoints, public and private; the latter of which is being tested at the moment.

The difference between these two endpoints is the way they are accessed. The public endpoints will have routes preceded by /v1 like /v1/accents, and will be rate limited, while the private ones will be preceded by /private as /private/accents, and will requrire authentication.

Authentication

Private endpoints require the user to be authenticated. This authentication is performed using an API key that can be obtained through this page (coming soon). Once you are registered, you will be able to get an API key from your profile. Once this is done, you should query /private/token endpoint sending in the following header:

"api_key": "[your_api_key]"

This will return a short life token which you should use to query other endpoints. Its usage is similar to the previous query but the header name should be changed to token.

Rate limits

Public endpoints are rate limited, what means that users only can make a limited number of requests per given period of time. Once the limit is surpassed, an error message will be returned for the following requests in the same time window.

HTTP headers can be checked to find out the reason for a rate limit error. The following headers are returned with every request:

  • X-Rate-Limit-Limit: the rate limit ceiling for that given endpoint
  • X-Rate-Limit-Remaining: the number of requests left for the current window
  • X-Rate-Limit-Reset: the remaining window before the rate limit resets, in UTC epoch seconds

The next table shows rate limits for the different routes:

Endpoint Rate limit
/api/v1/accents 30 per hour
/api/v1/characters 30 per hour
/api/v1/emails 30 per hour
/api/v1/emojis 30 per hour
/api/v1/hyphens 30 per hour
/api/v1/normalize 10 per hour
/api/v1/punctuation 30 per hour
/api/v1/stopwords 10 per hour
/api/v1/symbols 30 per hour
/api/v1/urls 30 per hour
/api/v1/whitespaces 30 per hour