Preparing Data

The first step in using text categorization is preparing your input file and your test file. For this, you need to structure the data as tab separated values in both the files. The files need to have details in this format:
  • UFT-8 encoding
  • Tab-separated data in two columns, where the first column contains the category name (for example: "Patient" or "Provider") and the second column has the data for each category (as displayed in the example below)

Your data should look as:

Patient     John Smith dob04181963 224 Main St. Atl GA 30311 
Provider    Mark Johnson M.D. NPI5489512047 412 Washington Atl GA 30301