Save the Data - .docx im Jahre 2133

Die Zukunft der Daten in der kantonalen Verwaltung des Kantons Bern

⛶  Fullscreen ↓  Download
Demo

Viele der in der Kantonsverwaltung erzeugten Daten wurden im .docx-Format erstellt. .pdf ist der große Konkurrent, allerdings ist der Großteil der Datenproduktion in der Kantonsverwaltung im OOXML-Format (.docx).

Obwohl .docx das am häufigsten verwendete Format ist, wird für die Archivierung nicht als archivtaugliches Format zugelassen. Die verschiedenen Word-Versionen, die es gibt, und gewisse Funktionen (z. B. automatisch aktualisierte Felder) können zu veränderbarem Inhalt führen. Deswegen ist es empfohlen für die Archivierung jedes Word Dokument in pdf/A-2u zu konvertieren. Die Konvertierung führt aber zu Informationsverlust.

Gemäss Wikipedia ist Word "das mit Abstand meistverwendete Textverarbeitungsprogramm der Welt". Es ist auch ein Dateiformat das bereits 40 Jahren auf dem Markt ist. Wird Word weitere 40 Jahre existieren? Diese Möglichkeit besteht und deswegen fragen wir uns, ob es möglich wäre die Unterlagen auf .docx, das heisst im Originalformat, zu archivieren.

In dieser Challenge fördern wir, dass einer Software entwickeln wird, der eine Analyse von .docx Dokumenten macht und dass die kritischen Elemente im Bezug von Veränderbarkeit von .docx beschreibt. Zusätzlich möchten wir, dass ein neues archivtaugliches Word-Dokument erstellt wird.

Unterschiedliche Versionen von Microsoft Word OOXML:

Microsoft Word 2007

Microsoft Word 2010

Microsoft Word 2013

Microsoft Word 2016


Microsoft Word 1.0 - Internet Archive

Weitere Daten für die Challenge sind im Slack Channel verfügbar.

Docx Date Converter App

This simple Application allows users to select a Docx file to be analyzed. The analysis consists of checking if there are any variable fillable fields in the document, replacing them in the XML source code with a static value, and delivering a set of relevant metadata values for the analyzed document. For the first version, the solution will focus only on replacing variable date fields, which would otherwise be automatically filled by Microsoft Word after opening the document. Date values will be replaced by a static date at the time of analysis. This application converts .docx files that have a dynamic date field and saves them with a static date of the data when the .docx was created. It is a solution to the challenge of preserving the integrity of .docx files in archival systems.

The solution consists of three parts:

  • Front-End (React, Electron)

  • API (Python)

  • Back-End (Python)

Docx Date Converter

The Challenge

A large portion of data produced within the Canton Administration is created in the .docx format. Despite .docx being the most frequently used format, it\'s not permitted for archival due to potential inconsistencies across different Word versions and functions (e.g., automatically updated fields) that can lead to mutable content. Therefore, it\'s recommended to convert every Word document to pdf/A-2u for archival purposes, but this conversion leads to information loss.

We ask the question: "Could we archive documents in .docx, i.e., in their original format, given that Word has been a prominent document format for 40 years and might continue to be for the next 40 years?"

This challenge aims to develop software that performs an analysis of .docx documents, identifies critical elements concerning the mutability of .docx, and creates a new archive-friendly Word document.

Running the Software

To run the software, you need to have both Python (for the Flask backend) and Node.js (for the React frontend) installed on your machine.

Setup

  1. Clone this repository:

    git clone git@github.com:LucaKern/Save-the-Data---.docx-im-Jahre-2133.git
    
  2. Navigate to the project directory:

    cd Save-the-Data---.docx-im-Jahre-2133
    
  3. Install the Node.js dependencies:

    npm install
    
  4. Navigate to the backend directory:

    cd ./backend
    
  5. Create a virtual environment:

    python -m venv venv
    
  6. Activate the virtual environment:

    • On Windows:

      venv\Scripts\activate
      
    • On Unix or MacOS:

      source venv/bin/activate
      
  7. Install the Python dependencies:

    pip install -r requirements.txt
    ````
    
  8. Navigate to the electron app:

    cd ./frontend
    

Running the App

To start the application, run the following command:

npm start

Further Case Analysis

DOCX files, commonly used for document storage and sharing, may not be optimal for long-term archiving due to several factors:

  1. Compatibility: DOCX is a file format associated with Microsoft Office applications. While it enjoys widespread support presently, there is no assurance of its future dominance. If you rely on specific software to access DOCX files, there is a risk of obsolescence or incompatibility with newer systems, hindering retrieval of archived documents.

  2. Backward Compatibility: With each new release, Microsoft Office introduces changes to the DOCX format. Opening older DOCX files with newer Office versions can result in formatting errors, missing content, or altered document layout. Such issues pose challenges when accessing archived files after a significant duration.

  3. Data Corruption: Over time, files may experience corruption or degradation due to hardware failures, software bugs, or storage media issues. Due to the dependencies and complexity of the DOCX format, it is relatively more susceptible to integrity problems. Repairing corrupted DOCX files can be challenging, potentially leading to data loss or incomplete retrieval of archived documents.

  4. Long-Term Storage Standards: For long-term archiving, employing open, standardized file formats is recommended. Formats like PDF/A2-u, designed specifically for archiving, ensure document preservation and accessibility over extended periods. These formats are independent of specific software vendors, enhancing compatibility and longevity.

While converting important documents to standardized formats like PDF/A2-u is generally advised for long-term archiving, it is worth noting that the widespread use and potential future developments surrounding DOCX could contribute to its acceptance as an archival standard. Factors such as popularity, backward compatibility efforts, support from Microsoft, industry acceptance, and technological advancements might enhance the reliability of DOCX as an archiving format. However, until industry-wide acceptance and preservation efforts validate its suitability, adhering to current best practices by relying on standardized formats like PDF/A2-u for long-term archiving remains a prudent approach.

This content is a preview from an external site.
 

Event finish

Edited

1 year ago ~ MilagrosWernicke

Research

updated readme (@Luca Kern)

push image (@Luca Kern)

Readme (@Luca Kern)

added screenshot (@Luca Kern)

Delete frontend/dist directory

added files and screenshot (@Luca Kern)

first commit (@Luca Kern)

Repository updated

1 year ago ~ BrunoRodrigues

Research

Edited

1 year ago ~ oleg

1 year ago ~ LucaKern

changed to Electron should be working now with API

1 year ago ~ LucaKern

Edited

1 year ago ~ MilagrosWernicke

Coming soon

1 year ago ~ MilagrosWernicke

Edited

1 year ago ~ MilagrosWernicke

dummy function (@Luca Kern)

Research

new readme.md (@Luca Kern)

updated README.md (@Luca Kern)

Update README.md

added README.md (@Luca Kern)

first commit (@Luca Kern)

Joined the team

1 year ago ~ Jonas

Research

I've looked into Microsoft Purview, the python-docx library, found a nice technical (XML) intro to .docx, and grabbed me a copy of 💾 Microsoft Word 1.0

1 year ago ~ oleg

Joined the team

1 year ago ~ LucaKern

Project

We are defining deliverables and working strategies and identifying further issues.

1 year ago ~ BrunoRodrigues

Für die Dokumentation: https://kost-ceco.ch/cms/ooxml.html

1 year ago ~ MilagrosWernicke

Joined the team

1 year ago ~ BrunoRodrigues

Start

Edited

1 year ago ~ Felix

Joined the team

1 year ago ~ MilagrosWernicke
 
Alle Teilnehmer*innen, Sponsor, Partner, Freiwilligen und Mitarbeiter*innen unseres Hackathons sind verpflichtet, dem Hack Code of Conduct zuzustimmen. Die Organisatoren werden diesen Kodex während der gesamten Veranstaltung durchsetzen. Wir erwarten die Zusammenarbeit aller Teilnehmer*innen, um eine sichere Umgebung für alle zu gewährleisten.

Tous les participant-es, sponsors, partenaires, bénévoles et collaborateur-es de notre hackathon sont tenus d'accepter le Hack Code of Conduct. Les organisateurs feront appliquer ce code tout au long de l'événement. Nous comptons sur la coopération de tous les participants* afin de garantir un environnement sûr pour tous. Pour plus de détails sur le déroulement de l'événement, veuillez consulter les directives sur notre wiki.

Creative Commons LicenceDie Inhalte dieser Website stehen, sofern nicht anders angegeben, unter einer Creative Commons Attribution 4.0 International. / Sauf indication contraire, le contenu de ce site est placé sous Creative Commons Attribution 4.0 International.