Indiana Online Indiana Online - An Encyclopedia of Indiana

 IOL Home

 Project Description
  -About IOL
  -Planning IOL
    --Process
    --Plans
      ---Editorial
      ---Technology
      ---Governance
      ---Implementation
      ---Business

 Project Schedule
  -Meetings

 Join The Effort

 Disclaimer

Search
    

Technology
Executive Summary

An online encyclopedia is a product of technology and as such requires a well-defined, flexible, affordable, and sustainable technical infrastructure that can be implemented and managed successfully. A technology plan has been developed for the Indiana Online Encyclopedia (IOL) with these requirements in mind. This plan:

  • Outlines a description of IOL users.

  • Discusses data, functional, security, reliability, and performance requirements.

  • Describes system design criteria, issues, and architecture

  • Details the scope of the work and the product.

  • Describes core-technologies.

  • Describes system hardware and deployment design.

  • Outlines project risk factors.

Background

On April 4, 2001 the Indiana Humanities Council was awarded a $50,000 planning grant from the National Endowment for the Humanities through its initiative to create comprehensive online encyclopedias for all fifty states, all five U.S. territories, and the District of Columbia. Indiana Humanities Council (IHC) subsequently identified The Polis Center (TPC) as its primary partner in the development of Indiana Online, with TPC managing the planning phase under the governance of IHC.

These online encyclopedias are intended to serve as portals to each state's cultural heritage, providing access not only to America but also to the world and to raise awareness of the importance of geographical roots and foster local pride.

IHC envisions Indiana OnlineTM as a leading-edge interactive, multimedia resource that will convey traditional history, lore, and knowledge, as well as interactive learning resources and advanced visualizations of data.

Through a series of community focus groups and follow-up interviews, the local goals and requirements of Indiana Online have been further defined.

Project Goals

The goals of the Indiana Online project are:

  1. To provide services and information to leaders, educators, students, researchers and cultural organizations.


  2. To serve a general audience interested in the state, including but not limited to tourism, the media, and Indiana counties, and


  3. To serve corporations and individuals investigating areas for investment and location.

Users

There are two groups of IOL users: content developers and content consumers.

Content Developers

The content development staff consists of five types of users:

  • Editor-In-Chief

  • Associate Editors

  • Section Editors

  • Graphic Editors

  • Writers

The functionality required by these users will include content capturing and content editing for all data element types. The weight given to individual users within this group will be determined by the characteristics of the content development workflow as it becomes further refined.

Content Consumers

The content consumer group consists of six types of users categorized into three levels:

  • Primary Users


    • K thru 12 teachers and students

    • Higher education teachers and students

    • General public


  • Secondary Users


    • Businesses

    • Cultural heritage and tourism interests


  • Tertiary Users


    • Government agencies

    • Civic organizations



These users have been categorized into three levels that reflect the importance of their needs in the requirements definition process. The functionality associated with the users in the primary group will be weighted higher, while those in the tertiary group will be weighted lower. Consequently, as constraints and requirement conflicts arise, these divisions will determine the development priority each group will receive.

The following sections describe the core requirements generated from the focus groups held for this project. They provide a level of information that permits the cost estimation for a system design and development of a software solution. Prior to the design phase, these core requirements will be further analyzed and deconstructed, producing an implementation specification document that will drive the development process. Supporting documents for this section include notes from the focus groups as well as the editorial plan presented elsewhere.

Data Requirements

Information Elements

These are the base elements of information that the product will recognize for the purpose of indexing, searching, and metadata assignment.

  • Text

  • Image

  • Audio

  • Video

  • Animation

  • Statistics

  • Maps

  • Links

Entry

Content will appear in 4 primary entry types or content classes. Photographs, audio, video, maps, links etc. may associated with all entries.

  • Essay entries are interpretive narratives greater than 2000 words that can be either an overview essay, dealing with major section headings and focus on a broader regional area, or mini-essays that address a more specific theme within a section heading.

  • Regular entries are non-interpretive, factual entries of events and organizations.

  • Biographical entries are objective narratives of individuals who have contributed significantly to Indiana culture or history. Associated with the narrative will be photographs of the individual and other information (e.g., audio visuals etc.).

  • Geographic entries are maps, in both static and dynamic format.

For more definitive information regarding entry classes, please review the editorial plan.

The information elements will be associated with keywords that will be used to index and search the data for presentation to the user. There are four thematic strategies that will be employed:

  • Section Headings provide a topical set of categories. These categories can be further deconstructed into subcategories. Please review the editorial plan for details regarding suggested section headings.

  • Geography Levels define the regional level of information. For example, national, state, county and local are geographic levels.

  • Place Names are used to organize data by a specific, recognized place name.

  • Time provides a temporal attribute upon which to organize data.

Standards

With a commitment to develop interoperable information system, we will incorporate open standards from number of standards organizations including World Wide Web Consortium (W3C), Open GIS Consortium (OGC), Federal Geographic Data Committee (FGDC), International Standards Organization (ISO) and American National Standard Institute (ANSI), National Information Standards Organization (NISO).

We will create an associated appropriate metadata element for all information elements.

  • Citations will be displayed when possible and will be minimally accessible whenever an information element is displayed.

  • Copyrights will be displayed when possible and will be minimally accessible whenever an information element is displayed.

Functional Requirements

Functional requirements are the fundamental subject matter of the system and are measured by concrete means such as data values, decision-making logic, and algorithms. This section defines the actions that the product must be able to take and tasks that the product must do.

Content Development Tools

Content development will be initially conducted on the standard desktop products familiar to the author and editorial staff and conducive to the particular class of information. For example, a manual process will be used to review, process and post the content. As the workflow becomes better defined, we will automate and integrate this workflow into the product.

Navigation Tools

Search function

Users will be able to navigate the site for related entries and resources. A search function will allow the user to identify all relevant entries and topics, thereby providing the user with the ability to create a comprehensive reference/reading list. The user will be able to search by more than one criterion. Keyword, thematic, time, Boolean search and natural language query search options will be provided. "Hot links" will direct the user to specific references and resources elsewhere on the site.

Theme navigation

The product will provide a set of navigation tools that allows the consumer to review the information based upon section headings and their subcategories.

Timeline

The product will provide a tool that permits the consumer to navigate through the information elements based upon occurrence of the event associated with the informational element.

Bibliographic reference

An interface must be provided that will allow the consumer to navigate through the citation information of resources used in the encyclopedia as well as related resources that may be of interest.

Links

When appropriate, links to related sites will be offered to the user. When selected, the consumer must be informed through some means that they are no longer on the IOL site.

Performance Requirements

  • All modules of the system must respond to the user in 5 seconds or less.

  • The map generation system must send a map to the user in no more than 3 seconds.

  • The audio/video streaming must respond the user in no more than 2 seconds.

Security Requirements

  • Only editors or users with defined privileges must have access to use the data manipulation system.

  • Database updates will be committed to the database only after the managers have approved it.

  • The casual user interfaces must have privileges to view or alter privacy information.

  • Requirements regarding content versioning and integrity of the data will be further defined as the editorial workflow is further defined.

Design Issues

The primary challenge in developing Indiana Online (IOL) is the creation of a process to allow editors to meaningfully and seamlessly integrate a diverse variety of media types, such as text, audio, and video, as well as GIS-enabled live maps, to communicate information about Indiana. A system of collaborative and inter-operational databases and applications will be acquired and/or developed based on a well-defined and flexible data schema that allows editors of the Indiana Online web site to easily, methodically, and continuously compose and integrate content of varying media types.

Another challenge is the development of a system that can incorporate existing tools used in the creation of these diverse media types. Appropriate, commercially available tools will be selected and interfaces developed to allow users of varying expertise to expand content easily and methodically.

The developed system will be capable of supporting continuous growth coupled with richer content. This system will be easy to scale in response to the growth of content and use base. A process will be implemented that controls release and changes to content.

IOL users, content producers and content consumers, are expected to have a wide variation in their technical proficiency. To support their needs, a wide variety of intuitive and consistent user interfaces will be developed for the Indiana Online database. These interfaces will be used to both develop and view content.

The system is distributed across multiple locations and on varying types of computer platforms. Therefore development tools and methodologies will be selected that support cross platform implementations. An infrastructure will be provided that is suitable for the management of such a system.

Management of intellectual property and protection of ownership and authenticity of the content is another challenge. To meet this challenge, a system will be developed that automates the process of safeguarding the ownership and content authenticity.

Design Criteria

This section describes the overall IOL system design criteria.

Consistent Content

The core persistent data will be stored in a centralized location to avoid duplication and achieve greater consistency.

Multiple User Interfaces

To support a variety of uses, IOL will provide multiple types of interfaces that allow the user to choose one that is appropriate for the task at hand.

Common Look and Feel

Interfaces to IOL information will have a consistent and common look and feel to allow the user to intuit where to look for desired information.

Security

Since the encyclopedia is a factual document, the authenticity of the content and its ownership will be preserved by an information security scheme and will be enforced by a software authentication and verification mechanism. The privacy and security of users will also be maintained using a similar mechanism.

Extensibility

Encyclopedia functionality will be designed to keep up with participant growth, participant activities and interaction, as well as technological change.

Modularity

Design of the encyclopedia will be developed using modules that interact with well-defined interfaces. This will allow developers to work independently, enhance maintainability and testability, and provide opportunities for using purchased components and outsourcing some development.

Minimize Network Traffic

The application will avoid transmitting data needlessly or redundantly.

Reliability

Encyclopedia application modules will be reliable and support continuous operation without interruptions.

Maintainability

As best practices in programming and technologies evolve, modular components must be adaptable to take advantage of these changes.

Code Reusability

Application modules will be developed using reusable components. This will reduce errors and reduce the cost of maintaining the application.

Portability

Application modules will be developed to run on a variety of operating environments without major changes to program code.

Performance

Performance under real time use will be optimized to give users the best experience possible.

Scalability

Components will be scalable to deploy on multi-servers to support increase in demand.

Architectural Consideration

To accomplish the above design goals, an efficient computing architecture will be employed with the following characteristics:

  • An n-tier open architecture based on XML web services for interoperation and integration of modular and distributed applications.


  • A scalable computing architecture for scale-out distributed modular application and scale-up key application modules and data stores.


  • A balanced computing system that eliminates bottlenecks throughout the system architecture.


  • Highly interactive, Internet-connected, and high volume data development and visualization applications.


  • Support all recognized browser technologies and three prior versions deep compatibility.


  • Ready to be incorporated with well-known indexing and searching engines.

Systems Design

The project can be conceptually divided into two primary functional categories. The first category is Content Development. This includes the functionality necessary for the collection of resources, authoring of information, as well as management and implementation of editorial and publishing decisions. The second primary functional category is Content Consumption. This category is concerned with the presentation of information to the consumers and methodologies they will use to navigate, search, and manipulate the information.

The proposed project is ambitious and full deployment will likely take three to four years. To accommodate the inevitable evolution in system requirements, versioned products will be developed, released, and evaluated, with the initial version of the product addressing the most necessary requirements. With this strategy in mind, the initial focus will be in the development of the functionality required for consumers. This will provide a more immediate product for us to evaluate.

Presently, content development tools are readily available. The integration of these tools into the product can be accomplished continually while immediately managing the content development process in an independent and manual fashion. Another opportunity to deliver a quality versioned product early is to limit the initial deployment to the introduction of mainstream datasets such as text and photography. The addition of more complete datasets such as audio/video streaming and dynamic mapping will be introduced in later versions. This will allow for the creation of a solid product foundation while also permitting the evaluation of off-the-shelf products that handle these more difficult datasets as they become available.

The Scope of the Work

The process data flow diagram illustrates the manner by which the content development staff will input raw data into the IOL archival system and transform it into meaningful information that is viewable by content consumers through the use of a web browser.





Click here or the image above for a larger graphic


The process starts with the resource data, metadata collection, and selection stage. At this stage the development staff will select pertinent sources of raw data. The data can either be in electronic format (word documents, shape files, image files, etc.) or non-electronic format (books, pictures, maps, etc.) If the data is in non-electronic format, it must be digitized so that it can be stored in the IOL archival system. The end result of this stage is to have data and metadata stored and available in electronic format.

Once the raw data is archived, it is now available for use in the next stage (resource transformation, manipulation) of the process. The development staff will use the raw data in the construction of the four primary IOL entry types (essay, regular, biographical, geographic). The entries will have corresponding records that are stored away in the content database and metadata database. The content database will contain the text of the document and the category to which it pertains. The metadata database will contain information about the entry (the author, spatial coverage, temporal context, etc.)

The data is now ready to be encoded and assembled for the web (next stage in the process). This stage will involve making the IOL entry "web ready" by adding appropriate HTML tags to the text and images. This is the stage at which the editor in-chief can approve the document for publication on the web.

Every effort will be made to enhance the speed of the web site. Indexing and caching will be used to get data to the consumer in a timely manner. The IOL entry will be made available to the general public at the end of the publication stage.

The Scope of the Product

The Indiana Online project will use several modules to manage the development and deployment of information about Indiana. Below is a description of each of the modules that appears in the diagram.





Click here or the image above for a larger graphic


Security Module

The security module will manage and determine the rights and privileges of IOL users as they pertain to IOL resources. Resources include IOL modules, web pages, IOL files, IOL databases, and other IOL data sources. The module will manage issues such as whether a user has "read only" privileges for a web page or is able to update the page. The module will also manage issues pertaining to the database (whether a user has write access to a specific database table). The database administrator and web administrator will administer the module.

Data Access Module

The data access module will manage the retrieval and updating of records from databases. It will also manage the details of retrieving data from other IOL sources (such as digital libraries). The module will retrieve information for IOL interfaces and web pages. The data accessed through this interface will be used for a variety of purposes from populating a drop down list to drawing an image. The data will be transferred via XML and will be accessed via methods provided by web services.

Metadata Module

The metadata module will provide methods and properties for the management of metadata. Metadata will be collected about IOL entries (maps, pictures, essays, etc.) Data developers will provide the information. Individual structures will be developed to store the various types of supported metadata (Dublin-core, FGDC, etc.) The metadata module will populate the structure for a given entity (image, picture, document) and use the data access module to place the structure into the database. All IOL entries should have associated metadata.

Customization and Personalization Template Module

The customization and personalization templates module manages user profiles and user templates. Data consumers and developers will each have user profiles. The profile will determine what the user sees when they access data via the editorial and data collection module and also the IOL public interface module. The module will manage the templates that will be available to data developers for editing their web entries. Templates will be developed for the different types of entries. Templates will allow the data developer to insert text, maps, pictures, and audio/visual data. Both data developers and consumers will be able to adjust parameters in their profiles to customize their IOL environment.

Editorial and Data Collection Interface Module

The editorial and data collection interface module allows the data developer to construct IOL entries. The module allows a data developer to pull together the text, images, and other resources that are necessary for the project via a template. Multiple templates will be available to the developers. The templates mandate a uniform “look and feel” to the entries. The module will provide a mechanism for saving and retrieving the entries.

Capturing and Editing Modules

Several capturing and editing modules will be provided. They handle the various formats of data (text, image, audio/video, GIS, on-line resources) that will make up IOL entries. The capturing modules will allow the data developers to digitize data and get it into the system. The editing modules will allow the developers to modify the different forms of data. The modules will be used in conjunction with the data access module and metadata module to format and store the data into the database.

Versioning and Archiving Module

The versioning and archiving module will manage the versioning and archiving of IOL web entries and data. The data will be saved into a defined file structure and backed up monthly.

Encoding Module

The encoding module will manage the assembly of web entries into a "web ready" format. The module will add tags to the web entries and possibly local validation routines (client-side JavaScript). The module will add tags and JavaScript that is "cross-browser" compatible.

Web-publishing Module

The web-publishing module will manage the details of whether an IOL entry is ready for web publication. An entry would not be published unless it had met a checklist of requirements. Requirements for publication would include the following: "Approval by the editor", "Satisfactory completion of QA testing", "Copyright granted or rights to entry statement included". If the checklist items have been met then the web page would be made available to data consumers.

Copyrights Administration Module

The copyrights module will provide a mechanism for user requests concerning the use of copyright property that appears on the IOL website. It will provide a means for streamlining IOL copyright policies.

Integrity Verification and QA Administration Module

The QA administration module will provide a checklist for QA administrators that ensures a web entry is ready for publication. A web entry will not be published without meeting each of the quality assurance requirements. The module will also provide a template for bug reporting and a means for updating and checking the status of a bug.

Performance Tuning and Administration Module

The performance-tuning module will be used to monitor the performance of the web site and database. It will assist the web administrator in identifying any "bottle necks" in the system. Changes might include adding a new index to a database table or using "bind variables" on a database query. More computer resources might also be needed to improve the performance of the system.

Technology

Well-known commercial technologies that allow the development of modular applications and interoperation capabilities will be incorporated into the IOL system.

XML Based Data Exchange Between Applications.

XML-based data exchange standards will be used because of their inherent platform independence and ability to "self define" themselves for each new environment. Leading IT venders IBM, Sun, Microsoft Oracle, SAS, ESRI, Macromedia and many others are actively engaged in building XML-based solutions for enterprise applications.

GIS Technologies

ESRI GIS technologies that provide a wide range of well-used tools to develop, view, manage, and disseminate spatial data over the web will be employed.

Multimedia Technologies

Macromedia products and Adobe products for developing animation content, audio/video content and graphics content will be used. RealMedia will be the primary media dissemination technology.

Database Technologies

An Oracle database platform for centralized content storage will be used. Oracle supports single server storage for text, GIS, and multimedia content.

Hardware Consideration

Cost-effective server solutions driven by open and competitive market conditions will be used.

Front-end Server Farm

Users and content developer’s portal to the encyclopedia is the front-end server farm. We will incorporate a multi-server based front-end web farm with load balancers and caching applications. Traditionally web servers served HTML based static content that put only modest load on the servers. In contrast XML based inter-application web services involve more data-rich interaction and processor intensive translations. Because XML is more complex and flexible than HTML, its markup is more computationally intensive to process. We will incorporate the latest Intel Xeon based multi-CPU computer system. As application interaction becomes more complex and dynamic we will be able to easily scale-up and scale-out the server pool to achieve a consistent performance.

Middle Tier, Multi-threaded Business Applications

Once the front end receives an incoming request, if it requires further processing, it is forwarded to the middle-tier application servers. Such applications include custom developed integrating and composer applications as well as commercial middleware applications like ESRI spatial applications and audio/video streaming applications. Middle-tier servers must handle hundreds or thousands of XML requests concurrently. Because of the requirements for high availability of mid-tier applications as well as their computationally intensive processes the latest Intel multi-CPU Xeon based computer system will be incorporated. The Xeon processor’s Hyper-Threaded technology executes two threads simultaneously on each processor increasing the number of XML transactions that can be executed concurrently. As application interaction becomes more complex and dynamic we will be able to easily scale-up and scale-out the mid-tier server pool.

Back-end Database Tier

All necessary data base queries from the middle tier are forwarded to the backend database servers via SQL or XML. Due to the expected high volume of GIS and multi-media data requirements and complex data indexing requirements, a database with clusters of 4-way servers that run a high-end clustering product will be built.

All these computer systems will incorporate a large volume of RAM and disk space to boost performance on demand. We will design a system with fail-over power and server redundancies to achieve high availability and quick recovery in the event of a disaster.

Basic Deployment Design





Click here or the image above for a larger graphic


The Basic Deployment Design diagram illustrates the hardware deployment architecture of Indiana Online. The system is distributed and is scalable to handle increased demand if necessary.

The data warehouse will house the database and employ the use of a mirrored RAID array to ensure reliability. When data is written, it is written to more than one disk drive. Should a drive fail the other drive will have the necessary information so that the system can continue to run.

The back end data access server(s) will contain the application database and also be used to cache frequently used data. The application database differs from the back end database in that it will contain the "driver" tables. These tables will contain information about what data is available and where it is located. This information will probably be accessed more frequently than the data contained by the back end database. It will not contain information such as spatial and statistical data. This type of information will be stored in the back end database. Frequently accessed data from both the back-end database and application database will be stored in cache on this machine.

The middle ware server will have GIS application services, multimedia streaming services, and media integration services. The services will use XML to transfer and receive data originating from the front-end server. The services on this machine will assemble the images and media streams into the necessary format for the client. For example, map images may be requested from more than one location. The services at this server would request the images and merge them into one.

The front-end server will handle "load balancing" and also house static web pages. It will transfer the assembled pages back to the client. In addition to HTML and XML the front end server will output XHTML and WML for non-PC platforms.

The customer/client and data developers will need a web browser to access Indiana Online.

Data Formats, Digitization, and Storage

Spatial Data

Spatial data will be comprised of vector data (e .g. - lines, point, polygons) and raster data (e.g. ortho-rectified imagery). The spatial data will be collected in a variety of formats, processed, and imported into an Oracle RDBMS. The data will be accessed primarily thru ESRI ArcSDE by ESRI client applications, particularly ESRI ArcIMS. The data will be stored and tuned to these applications. Metadata for the final datasets will be likewise collected and stored in Oracle through ESRI ArcCatalog and viewed thru ArcIMS Metadata Server.

The primary tool used to manipulate and translate the data will be ESRI ArcGIS. When necessary, the applications that originally created the data will be used as well as general translation tools such as Safe Software’s FME.

Each layer (feature dataset) will be evaluated for quality, content and intended method of access in order to determine spatial storage and display characteristics such as projection, scale, offset, display scale and symbology.

Likely sources of spatial data are USGS, TIGER, local county GIS initiatives and university research projects.

Image Data

We will scan all images, photographs, and historical documents with the highest possible scan resolution not less than 600 dots per inch (dpi) in TIFF format. We will store these image scans on tapes for archival purposes and retrieval for printing purposes and other image manipulation. We will scan color images in 24-bit RGB format with 3 samples per pixel and 8 bits per sample with a pixel dimension of 4000 to 5000 pixels in the long side, based on the specifications now being used by the Library of Congress. We will scan grayscale images with 1 sample per pixel and 8 bits per sample. We will resample raw images with 72 pixels per inch (ppi) resolution in JPEG with a pixel dimension of 600 pixels in the long side for screen display. Also we will resample raw images with 72 ppi resolution in JPEG with a pixel dimension of 200 pixels in the long side for thumbnail display. We will store screen images and thumbnail images in the Oracle database for web access.

Audio/Video

We will capture, create and store audio/video content in widely used formats. We will store audio content in WAVE format for downloading audio files and archive them on tapes. We will create RealAudio files for audio streaming. We will store moving image data in MPEG format for high resolution downloads and store these on tapes. We will create RealVideo files for video streaming and store them in Oracle.

Text

We will scan all text documents with the highest possible scan resolution not less than 600 dpi in TIFF format. We will store these scans on tapes for archival purposes and retrieval for printing purposes and image manipulation for character recognition. We will use Intelligent Character Recognition (ICR), Optical Character Recognition (OCR), and Optical Mark Read (OMR) technologies for recognition and conversion of these scanned text documents into digital data. We will store these text documents in the Oracle database for web access.

Technology Risk Factors

No Operation History

The content development process and site use load cannot be fully comprehended and determined as of today. This will have a direct impact on the computer software development and hardware acquisition as well as server sizing.

We propose to mitigate this by adopting a scalable hardware and software system architecture. This will allow the scaling up and out of hardware components without any redevelopment to software components upon the increase in user load.

System Capacity Constraints

Future use may require extensive and costly increase in system capacity to achieve a reasonable performance.

We propose to mitigate this by incorporating a maintenance and on-going operation procedure that could handle unforeseen costs.

Network Capacity

Estimating network infrastructure needs is difficult because distribution of services and data is not fully defined or understood.

We propose to mitigate this by developing a system using an open architecture and World Wide Web protocols. Increase in content and moving the system to high bandwidth network will compensate usage. IUPUI is connected to Internet 2.

Application Related Risk Factors

Functionality of various application modules are not fully defined or understood as of today. These will be further defined prior to the in-house development, purchase, or out-sourced development of these modules.

We propose to mitigate this by using plug-able modules of interoperable software components. This approach will enable the addition of necessary software components that were not previously anticipated.
At the end of the requirement elicitation and design phase we will have a better understanding of the required components.

Technologies are yet to be developed to efficiently and effectively integrate all media types together to achieve a seamless navigation between them. Research and partnering with institutes of higher education for development of such components will be done to address this.

Off-the-shelf components may not interoperate with each other, leading to extra expenditure in developing conversion tools to match data formats. Appropriate plug-ins will be developed and the interoperability of all purchasing components will be assessed prior to purchase.



 



The Indiana Humanities Council



The Polis Center