OWL+

Ownership and Leadership: Pathway for (Endangered) Languages’ Use in School

Data Collection 2: Text Samples


PREPARATION

Before you start collecting samples, consider the following: (or read guide Data Collection 1: Preliminary Planning)

  1. Define your objectives: What language level, themes, or linguistic features are you focusing on?
  2. Identify your sources: Plan where you’ll collect samples from (e.g., libraries, websites, public spaces).

You might prefer to collect your samples casually and take some photos whenever convenient. However, we recommend that you keep track of the domains you are covering to keep your “mini corpus” balanced and varied.

COLLECTING SAMPLES

Digital Sources

  1. Websites: Use your browser’s save function or screenshot tool to capture web pages.
  2. E-books: If permitted, copy relevant passages or save as PDFs.
  3. Social media: Screenshot conversations or posts (ensure you have permission if the content is private).
  4. Digital newspapers and magazines: Save articles as PDFs or use the “Print to PDF” function.

Physical Sources

  1. Books and print media: Use your scanner to digitize relevant pages.
  2. Ephemera (menus, tickets, flyers): Scan or photograph these items.
  3. Handwritten notes or letters: Scan these for authenticity in handwriting samples.

Real-world Texts

  1. Public signage: Photograph signs, posters, or billboards.
  2. Menus: Ask restaurants if you can keep a menu to scan, or take a clear photo.
  3. Product packaging: Flatten packaging and scan, or take clear photos of text.

PROCESSING SAMPLES

  1. OCR: Convert image-based text to editable text using OCR software or apps.
  2. Clean-up: Edit the OCR output to correct any errors and format consistently.
  3. Anonymization: Remove or change any personal identifying information to protect privacy.

Area of Interest: Documentation and text collection

Skills:

Competences:

Age Bracket: 11 – 15, 16 – 18, and Adult Education

Time Commitment: 30 – 60 minutes

Affordability:

Materials:

The guide on collecting text samples is designed to help you gather a diverse range of authentic text materials to enhance your teaching. By systematically collecting text samples, you’ll create a valuable resource for vocabulary acquisition, reading comprehension, and cultural understanding. This guide emphasises a digital-first approach while also incorporating basic principles of lexicography to help you build a small but well-rounded corpus of text to enrich your lessons.

EQUIPMENT NEEDED
For this digital-first approach, you’ll need:
1. Laptop or desktop computer
2. Scanner (preferably portable for on-the-go scanning)
3. Smartphone (for quick captures and OCR apps)
4. External hard drive or cloud storage subscription
5. Text processing software (e.g., Microsoft Word, Google Docs)
6. Spreadsheet software (e.g., Microsoft Excel, Google Sheets)

Optional but useful:
1. OCR (Optical Character Recognition) software or app.
2. Digital camera (if your smartphone camera isn’t sufficient).

RECOMMENDED OCR APPS (listed from free to most expensive)
1. Google Drive (iOS/Android): Free with Google account.
Pros: Seamless integration with Google Docs, automatic OCR for PDFs and images.
Cons: OCR accuracy can be inconsistent.

2. Microsoft Office Lens (iOS/Android): Free.
Pros: Integrates well with Microsoft Office, good for document scanning.
Cons: OCR features are more limited compared to specialized apps.

3. Tesseract (Open-source): Free.
Pros: Highly customizable, supports many languages.
Cons: Requires technical knowledge to set up and use effectively.

4. Adobe Scan (iOS/Android): Free with basic features, subscription for advanced features.
Pros: Easy to use, good accuracy, automatic cloud storage.
Cons: Some features require subscription.

5. ABBYY FineReader (Desktop/Mobile): Paid with free trial, most expensive option.
Pros: High accuracy, advanced OCR features, supports many languages.
Cons: Expensive for casual users.

We recommend that you choose an OCR solution based on your specific needs, budget, and technical comfort level. Start with the free options and see which ones work best for your language before investing in any subscriptions.

Expert recommendations:

LEXICOGRAPHIC PRINCIPLES TO CONSIDER

Understanding basic lexicographic principles is crucial when collecting text samples for language teaching. These principles help you analyze and organize your samples more effectively, leading to better teaching materials and a deeper understanding of language use. By applying these concepts, you can identify patterns in word usage, understand how context affects meaning, and recognize the nuances of language that might not be immediately apparent. This knowledge allows you to create more comprehensive and accurate resources for your students, helping them to develop a more authentic and nuanced understanding of the target language.

As you collect and organize your samples, keep these basic lexicographic principles in mind:
1. Frequency: Note how often certain words or phrases appear across your samples.
2. Context: Record the context in which words are used, as this can affect meaning.
3. Collocation: Pay attention to words that frequently appear together.
4. Register: Note the level of formality in each text.
5. Semantic fields: Group related words from your samples into thematic categories.

ETHICAL CONSIDERATIONS
1. Copyright: Ensure you’re not violating copyright laws. Use materials in the public domain or under fair use for educational purposes.
2. Privacy: Always anonymize personal information in text samples.
3. Consent: If collecting samples from individuals (e.g., WhatsApp conversations), obtain explicit permission.