Product Page Processing Methods

From Salish Sea Wiki
  1. For all pages containing {{Product}}
    1. If namespace is not "File" or "Main" then flag as "Product in other namespace" and END
      • No other namespaces contain: {{Product}} or {{product}}.
      • There are 447 pages in the File namespace containing either: {{Product}} or {{product}}, see list here.
      • There are 186 pages in the Main namespace containing: {{Product}} or {{product}}, see list here.
        • Note, there is 1 page that end in .pdf (here)
        • Regex for search: (\{\{[Pp]roduct\}\})
        • Regex search for pagenames: \t(.*)\n.*\n
        • Regex replace for pagenames in Main: # [[$1]]\n
        • Regex replace for pagenames in File: # [[:$1|$1]]\n
    2. Where free text contains [[Category:Document]] Then:
      1. Look for four digit number in name string, and make YEAR = four-digit Number, if there is no number then flag as "Document with no year"
        • There are 87 pages that don't include a 4 digit year, see list here (note, this is just the count of all {{Product}} or {{product}}, not necessarily those pages that also contain [[Category:Document]]. This is expected, and some of these will need to shift to a new template, so some mechanism for replacing Product template with Cosmetic Graphic template will be useful.
        • This sounds fine, if it's supposed to be moved from the Product Template to the Cosmetic Graphic Template, we'd just need a mapping Let me know if this should be done before the data scraping (can change template using replace text) or if it can be integrated into the spreadsheet review (preference) - PLEASE INSTRUCT
      2. Take text to left of the number and make AUTHOR = left text string (strip of rightmost space)
        • Assuming when there's no 4 digit year, this needs to be skipped as well.
      3. Take text to right of the number, starting at first non-space character and make TITLE = right text string; if no text then
        • If there is no text, then what should be done? All of these should be redirect pages... which don't have the template, otherwise I will need to change page name manually.
        • With the example of the page David et al 2014, it contains {{product}}, contains [[Category:Document]], has a 4 digit YEAR, has text to the left for AUTHOR, but does not have text to the right. If this is supposed to be a redirect, what does it redirect to? I "think" this may be a manual action, but the page probably should be moved so that it looks like "David et al 2014 Foraging and Growth Potential of Juvenile Chinook Salmon after Tidal Restoration". Then the original page (David et al 2014) would redirect to there? This is an example of a product that was created without the QA of the form... so it just needs a page move. I can do that manually (and just did for David et al), or I can perform the page move as part of the quality control review of page data. PLEASE INSTRUCT
      4. Take all categories and add as comma delimited structured data in the field CATEGORIES
        • This can be done to add the page to the correct category, but since the tree selector is split into 6 parts, we need to add each category into the correct sub-tree (Geographic Place, Political Jurisdiction, Workgroup Origin, Anthropogenic Topic, Ecosystem Topic, Purpose). I presume this requires a lookup table... where you lookup the category, and see the sub-tree assignment? This raises the question: where is the place of truth for the organization of categories?
        • Agreed, a lookup table would be needed to place all the categories in the correct fields. The truth for the organization of these categories is currently the category hierarchy used by Form:Product and defined by the Category pages Salish Sea, Jurisdiction, Workgroup, Anthropogenic Topics, Ecosystem Topics, & Effort. Categories have stabilized for the moment. SHOULD I DEVELOP THIS LOOKUP TABLE?
      5. Where namespace = File Then
        1. Create a new page with page name AUTHOR + YEAR + TITLE
        2. Create a link directly to the File Media at the top of the new page
          • This should just be adding the File Media (pagename) to the structured data (i.e. the "Link To File" section of the Form or the "File" field), then the Product Template would handle laying out that fields value on the page, in this case the top of the page. Understood.
        3. Add all the free text and structured data created above to the new page
        4. Delete the original File:Page content and replace with #redirect[[NEWPAGENAME]] //or alternately a link to the main namespace product page?
          • I'm not sure what this part means, "or alternately a link to the main namespace product page?" I wasn't sure what "best practice" might be... if there is any unexpected problem with using a redirect? Redirect is my impulse so that any old links to the file page go to the new main page?
          • Redirects make sense, I just didn't understand the alternative. Maybe you just meant a link on the page vs. a redirect? If so, I think the redirect makes more sense. Agreed.
    3. Where free text contains [[Category:Dataset]] Then:
      • Are we sure there is no overlap? i.e. pages with both [[Category:Document]] & [[Category:Dataset]].
      • Not sure... but should not be... use order of operations to manage? preference Document sub-typing.
      • Makese sense. If you order things (maybe like they here) then the order taken should flush things out unless they get re-run.
      1. Look for four digit number in name string, and make YEAR = four-digit Number, if there is no number then flag as "Dataset with no year"
      2. Take text to left of the number and make AUTHOR = left text string (strip of rightmost space)
        • Same note as above section.
      3. Take text to right of the number, starting at first non-space character and make TITLE = right text string; if no text then
        • If there is no text, then what should be done? same as above... requires manual name change.
      4. Take all categories and add as comma delimited structured data in the field CATEGORIES
        • Same note as above section.
    4. Where free text contains [[Category:Graphic]] Then:
      1. Look for four digit number in name string, and make YEAR = four-digit Number, if there is no number then flag as "Graphic with no year"
      2. Take text to left of the number and make AUTHOR = left text string (strip of rightmost space)
      3. Take text to right of the number, starting at first non-space character and make TITLE = right text string; if no text then
      4. Take all categories and add as comma delimited structured data in the field CATEGORIES
    5. Where free text contains [[Category:Image]] or [[Category:Map]] or [[Category:Diagram]]
      • Are we sure there is no overlap? i.e. pages with both any of the above and [[Category:Document]] & [[Category:Dataset]].
      1. Replace {{Product}} with {{Picture}} //Right now, all the media in the graphic category are just images used to decorate pages. At a later date, they can be "promoted" to full Products if it is merited. We will need to start flagging and managing a set of File namespace media that are in the Cosmetic Image category (as opposed to an graphic that is a product with an author and year).
    6. If contains [[Category:Website]]
      1. AUTHOR = Null, YEAR = Null
      2. TITLE = pagename
      3. Take all categories and add as comma delimited structured data in the field CATEGORIES
        • Same note as above section.
    7. For any remaining product pages flag as "Product without subtype"

//This is to clean up a number of pages in File that contain media that are categorized irregularly, and also assigns them to the template {{Cosmetic Graphic}} so they can be managed more easily later. I will need to do some manual work to sort these... some may become products, and otherwise I'll have to start managing cosmetic graphics separate from products.

  1. Where page does NOT contain {{Product}} but does contains [[Category:Graphic]] or [[Category:Image]] or [[Category:Map]][[Category:Diagram]] add {{Cosmetic Graphic}} at the top of free text. //This is the support the future cleanup described above.


Notes

The Form and Structured data for Products have the following fields where "(*)" marks the field as mandatory, for all mandatory fields, we likely need to have a value for every page we are converting:

  1. Type Group (*)
  2. Authors Dataset (*)
  3. Year Dataset (*)
  4. Title Dataset (*)
  5. FileOrCitation Dataset (*)
  6. Type Document (*)
  7. Authors Document (*)
  8. Year Document (*)
  9. Title Document (*)
  10. FileOrCitation Document (*)
  11. Type Graphic (*)
  12. Authors Graphic (*)
  13. Year Graphic (*)
  14. Title Graphic (*)
  15. FileOrCitation Graphic (*)
  16. Type Website (*)
  17. Title Website (*)
  18. File
  19. Distribution (*)
  20. Places
  21. Jurisdictions
  22. Workgroups
  23. AnthroTopic
  24. EcoTopic
  25. Effort