FSDB

IBM FS data base (draft)

For me, the next 11 years or so were a technical odyssey. The adventure started with learning something about the ways files and data bases were structured and accessed. It ended years later with defining a programming language that considered all kinds of data, program-local and persistent, as conceptually similar, accessible via the same kind of simple reference, possibly usable at many stages of systems design and implementation.

But that's getting far ahead of things. In the period from 1972 to 1975, discussed here, I was worrying about data base design tools within a complex organizational and technical situation, which is what this part of the story was intended to be about. Unfortunately, building a coherent account of this activity was complicated by a number of problems. One was that I had forgotten much of what happened. However, I had lots of saved paper, which gave rise to a lengthy exercise in detection. Then, having worked out much of what the detailed records said about the history, I wrote it down, which created the next problem: the result was far too long and detailed to be readable. But I didn't want to discard it entirely, because it seemed that it might be interesting, or at least useful as a piece of history. It included an account of some possibly surprising technical directions and of some challenging problems that arose. It also illustrated some of the difficulties, organizational and cultural, that plagued IBM development projects for many years. So instead of giving up, I just tried to remove some of the more tedious elements.

A sort of remaining problem was that while this chapter was supposed to be part of a personal technical memoir, including just enough of the surrounding technical and organizational considerations to establish context, in this case -- partly because the entire project has been largely forgotten -- the surrounding matters seemed quite important. So the bulk of what follows contains at least as much material on the background for, and evolution of, the overall FS data base effort, as for the data base design area I was most concerned with.

To begin: in late 1971 IBM launched a major initiative, called Future System (FS). This was accompanied by assignments -- not always clear -- to many locations, charging them with making the envisioned new system a reality. One of those locations was San Jose, then a development center for data management software. At the time, IBM offered a range of such products, and new types of data bases were on the near horizon.

To the people in San Jose, this required that they develop an understanding of their responsibilities with respect to overall FS structure that was consistent (a) with their existing responsibilities and (b) with their views on how data-related software might evolve. And then begin to do something about it.

In order to make what happened intelligible, I have to start with an overview of the relevant data-related technologies. Then I'll outline the FS system, as initially envisioned and as later modified, and then, finally, I'll move to an account of the organizational and technical evolution of the data base subsystem, looking mainly at the core data base specifications and the related design tools which were my specific area of concern.

Data Management Software c. 1971

This is a rather deep dive, for which I apologize, but, for people unfamiliar with the area in the time period, it is necessary background to the work of the next several years.

IBM data-related software offerings in the early 1970s largely mirrored traditional ways of organizing corporate data in filing cabinets. The filing cabinets contained file-folders, each representing an individual entity of a specific type, arranged in a prescribed order (e.g., by name, or identification number, or ...), and with information about each entity often accompanied by material pertaining to other kinds of entities to which it was related. This kind of hierarchic organization was naturally reflected on early i/o devices, such as punched cards and magnetic tape. But it also persisted after the introduction of disk storage (and other direct access storage technologies) which permitted more flexible approaches.

By the time I started to worry about data and data bases there were two levels of explicit data accessing services available. File management(often itself called data management) software mostly reflected the traditional view. Each full record in a tape or disk file might be equivalent to a file folder about an entity. To use a classic example, a bill-of-materials file for a manufacturing operation might contain a physical record for each part, containing some attributes of the part itself, followed by a list of its component parts and their respective quantities. Programs that used file management were linked to the actual files via declarations to the operating system, and generally accessed the data by navigating from one record to the next in the existing order. (However, some file management services allowed records with a given value in a key field to be accessed directly, sometimes to serve as a starting point for further navigation.)

File management facilities were the primary means of storing and accessing corporate data at the time, and remained so for many years in the future [BP_3]. But by 1971 there were also more powerful data base (DB) approaches available, and others under consideration and/or development. Data bases appeared because, on the one hand, it was understood as desirable to store corporate data in a centralized way and, on the other hand, different areas within a firm tended to see data differently. For example, one group might be interested in in relationships between products and their component parts and quantities, while another would more interested in the sources of parts: their suppliers, prices charged, shipment times, supplier reliability, etc. To accommodate these different interests, data base management systems provided different ways of viewing and accessing the same data by different applications.

The increasing centralization of data storage led to increased attention to ensuring the integrity of the data. As long as only one process accessed a collection of data at any one time, it could be certain that they were accessing the latest version of the contained information. But, for example, if two processes were using some form of time-sharing to simultaneously check and update inventory for the same part, the correctness of the data could not be assured without additional provisions. In the period under discussion, this was partially addressed by the wide use of a facility called CICS (Customer Information Control System, pronounced kicks), accessible by instruction sequences from remote terminals, as well as from the COBOL and PL/I programming languages. A CICS instruction sequence indicated not only what data was to be accessed, but what was to be done with it. This allowed work units, called transactions, to be scheduled based on what they were going to do. Non-updating transactions could be run in parallel, with only updating ones requiring sequencing. Also, the approach facilitated the collection and use of information about what data had to be restored if a transaction failed. Little more will be said about data integrity here, except that two people in San Jose, Charlie Davies and Larry Bjork, did an excellent job in the early 1970s of analyzing and abstracting the problem in terms of multiple, potentially nested transactions, so that alternative approaches could be better understood [see TRANS_1, TRANS_2].

To provide multiple ways of viewing and accessing the same data, a few different approaches had emerged, most importantly:

The IBM IMS system (and a smaller, earlier version) provided alternate hierarchic views of the data,
The CODASYL DBTG (Data Base Task Group) report specified a network approach which, at the time, had various partial, non-IBM, implementations, and
The Relational Data Model, introduced in a few papers by Ted Codd of IBM Research, viewed a data base as a collection of tables. It was clearly interesting, and the subject of considerable discussion in San Jose. (Note: The term data model is generally be used here to mean a conception of how data is structured.)

In more detail: IMS (see [IMS_1, IMS_2]) adopted a traditional understanding of persistent data, in the sense that it was assumed to consist of records, which were about entities, like parts and suppliers, and their attributes, which could be simple values, or could refer to other entities. IMS represented data to a user program as a tree, called a logical data base, with one entity type (e.g. parts) serving as the root, and related to one or more types of children (e.g., suppliers), and the children were related ... etc.

Other logical data bases could be defined with overlapping content but different hierarchic structure. The logical data bases were implemented via connections within one or more physical data bases, also tree-structured, but potentially organized differently from the logical structures, and individual physical data bases could be supported by different underlying file management services.

Accesses to IMS data bases were primarily navigational, although a limited amount of value-based access was provided. If one accessed a logical data base starting at the root, the first record retrieved would be one of the root type, for example a record about a particular part. If the root record had children at the next level in the hierarchy, the next record retrieved would be one of those children, for example, a record about a supplier of the part. The child record could contain attributes pertaining to the child entity (like the name of a supplier), and to the relationship between the parent and the child (like the price charged by that supplier for the part). See [IMS_1 section 2.3] for an example of navigating within a logical data base.

The CODASYL DBTG report [DBTG_1, DBTG_2] instead called for data bases structured as networks of typed entity records connected via relationships, and the relationships were also represented by records which might also contain attributes (presumably of the relationships). (Note: DBTG networks should not be confused with later semantic networks relating individual items rather than records.)

Also, because the DBTG task group was an offshoot of CODASYL, which had originally defined the COBOL language, the report included specifications of extensions to COBOL for data base access. That language, like the IMS access language, was essentially navigational. IBM, not least because of its large data management and IMS customer base, opposed adoption of the DBTG report as a standard. And it should be said that while the network concept was appealing on the surface, because it avoided the need to define alternative logical hierarchies, the result included many details of data definition and navigation, and storage-related considerations; the DBTG report ran to 264 pages.

Ted Codd's relational data bases [REL_1, REL_2] were conceptually much simpler. A relational data base consisted of tables, called relations, and, like mathematical relations, were unordered sets of n-tuples. The named columns of the table corresponded to positions within the tuples. One or more columns together functioned as keys, and were assumed to identify unique rows. Other columns were considered attributes of whatever was denoted by the key. Importantly, there was no concept of an entity. This approach removed the need for people designing data bases to make decisions about what things were to be considered entities, much less what hierarchic structures were needed. (Note: the references [REL_1, REL2] are to externally published versions of the material; earlier versions were circulated internally.)

[REL_1] motivates and introduces the model, and specifies the beginnings of a relational calculus, a collection of succinct operations on relations. For example,

A selection operator π is used to obtain any desired permutation, projection, or combination of two operations.
Thus if L is a list of k indices L = i₍₁₎,i₍₂₎,...,i_(k)) and R is an n-ary relation (n≥k), then π_L(R) is the k-ary relation whose jth column is column i of R, except that duplication in resulting rows is removed.

[REL_2] describes the more user-friendly data base sublanguage, ALPHA, which was never implemented (to my knowledge), but was very influential. The basic idea was to obtain, in a user workspace, a new relation satisfying some conditions in the data base. For example (from the reference) given relations about Suppliers (here called S), and Supplies (here called Z), one could form a simple, one column relation in a workspace W containing the names of the suppliers who supply part number 3 by a statement such as:

GET W S.SNAME: ∃ Z((S.S#=Z.S#)^(Z.P#=3))

That is, put into workspace W the supplier names (SNAME) in those rows of S such that there exists a row of Z whose supplier number (S#) matches the S# of the S row, and whose part number (P#) is 3. The workspace is then accessible as a temporary relation (and potentially as an array of structures by a conventional programming language).

Deciding on how to divide data into relational tables was based on a then-evolving theory of normalization. A basic normalization consideration is that given a relationship to be stored in the data base between two kinds of things A1 and A2, if for any particular value of A1 the value of A2 is always the same, then A1 should be the key of a relation and A2 a non-key attribute column of that relation. For example, if a manufacturing data base is to contain information about PARTS, including their PART_COLOR, if any particular kind of PART, identified by a part number, always has the same value for an associated PART_COLOR attribute, then PART should be the key column of a relation in which PART_COLOR is a non-key attribute column. In this way, instances of the relationship would not be repeated and become a potential source of data base inconsistencies. (And PART_COLOR is said to be functionally dependent on PARTS).

While at the inception of FS there were only a few papers available about the relational model, it was taken seriously by San Jose development personnel. Not just because of its nice logical simplicity, but also because at the time the development organization was briefly housed in the same building (28) on the San Jose IBM plant site as the research organization; Ted Codd's office was just down the hall from mine. Also, in the early 1970s, relational research and development projects were beginning outside IBM. So, shortly thereafter, the research organization began a practical implementation, System R [SYSR_1].

My First Encounter To bring this down to a personal level. When I first showed up to work on the data side of things, a nice guy named Herb Meltzer, who was staff to a 2nd-level manager, handed me some documents. There were a few papers about relational data bases, some IMS manuals, and a thick copy of all or part of the DBTG report. That evening I stacked them on my nightstand, and started to read. The next day, Herb asked what I thought of the material. I responded that the relational material looked OK, but that I fell asleep reading the IMS manuals (which were quite tedious) and so didn't get to the DBTG stuff. Someone else in the room reacted you read the Codd papers?!

This short episode not only captured some aspects of the situation at the time, but was predictive of how my work in the area developed. As far as the situation at the time, there were some legitimate concerns about the relational approach, most importantly (a) that it was inconsistent with how customers traditionally understood and accessed their data, and (b) that encouraging non-navigational access could unnecessarily reduce application performance. But because of the mathematical terminology used in the early papers, it was off-putting to math-averse developers (and there were some), and a concern that customers would have a similar reaction.

As far as how my work in the area evolved, falling asleep reading IMS manuals was not very surprising. I had no background in commercial data processing, and thus didn't understand the appeal of the provisions for alternative logical structures resting on the same stored data, so as to simultaneously reflect multiple user views and also achieve adequate execution-time performance. Then, as the rationale for those provisions gradually became clear, it also became clear that developing and maintaining the associated specifications for a sizable corporate data base was a painstaking, difficult job. And the job, which was performed by people who were starting to be called data base administrators or DBAs, also included worrying about the security and integrity of the data.

There were some attempts to help. For example, IMS manuals included extensive material about considerations for data structuring and storage decisions. However, to act on these considerations, a DBA would submit low-level structuring and mapping directives for generating and accessing the data bases. This approach was clearly inadequate, and as integrated corporate data continued to broaden in scope, and DBA-like roles became more demanding and ongoing, the problem could only get worse.

So, at some point in learning about all this, I became interested in the problem; not only with respect to DBA tools, but also with respect to their potential relationship to more general systems design tools. There were some early attempts at the latter; in the late 1960s and early 1970s the expanding automation and integration of business systems gave rise to the production of tools to allow non-computer-oriented corporate staff and and systems analysts to describe and evolve their business systems. The tools, many of which are described in [BP_1 and BP_2], often represented business processes in terms of the consumption and production of (real or imagined) documents and their contained data. But it isn't clear to what extent these tools were adopted; they required considerable work to use, and generally resulted only in high-level information for application and data base developers, disconnected from data base description mechanisms. At any rate, I somehow became so interested in the area of business systems that, circa 1975, I enrolled in an evening MBA program (very part time), primarily to get an idea of what business systems were actually about.

FS Context

Moving on now to an overview of the FS project... I must first emphasize that I and many others saw, and kept tabs on, only a small part of the overall project as it was initiated and developed and changed. So this section relies, to a large extent, on Mark Smotherman's web page (qualified as currently under construction) about the IBM Future System (FS) - 1970s, which references source material from the time period as well as retrospectives and an IBM history by Pugh [FS_1, pp. 538-553]. I've also used another IBM history by Pugh [FS_2, pp. 307-311].

IBM's grand FS project was officially initiated in late 1971 with the publication of an impressive document [FS_3] reporting the results of a task force led by John Opel, then a senior vice-president and head of IBM's Data Products Division, and future IBM CEO. The technical motivations for the project seem to have included both contemporary hardware developments and expectations, such as an expected decline in the cost of storage, and as important, the recognition that user application development had become very complex and expensive. While Pugh [FS_1, p.540] suggests that the application development costs were attributable largely to dealing with the proliferation of storage devices, a related problem was the complexity of other interactions with the operating system as well (or, at least, the new interfaces proposed explicitly sought to reduce those complexities).

The system structure outlined in the Opel report evolved out of studies by different groups, the most influential probably being System/A led by George Radin at the IBM Watson Research Laboratory. The major hallmarks of the Opel report system structure were a 3-level architecture, and a single level store. The 3 architectural levels, as they appeared, and as further explicated by Radin [FS_4 (a useful read)], were:

The ADI (Application Development Interface): The ADI had two parts. There was to be a firm, relatively high-level, integrated interface accessible to application developers, and was to be supported by an optimizing compiler. This would allow applications to be written directly to that interface, rather than to the multiple interfaces often required to specify a single application (e.g., interfaces to a compiler, plus to the operating system, linkage editor, and data base facilities). But the ADI was also understood to provide (a) the popular programming languages (Fortran, COBOL, and PL/1) and their translators to the ADI interface, and (b) higher level aspects of data base management systems.
The EDI (Execution Discipline Interface): An interface representing the virtual machine target for the implementation of the ADI interface. It was specified as coincident with much of the 370 instruction set, plus controls for additional functions. It was also shown as the target for simulators of earlier operating systems .
The NMI (New Machine Interface): The common interface to the actual hardware boxes, of different sizes and performance, to be implemented at different IBM facilities. It was to be based on the 370 interface, with significant functional extensions. The specific location of the boundary between the EDI and NMI, as explained by Radin [FS_4], would be the result of attempts to maximize implementation commonality

References to the single-level store, appearing at both the ADI and EDI interfaces, did not differentiate among storage media; all references were symbolic (in a sense..) with placement of data among levels in the storage hierarchy at different times determined by various factors. This was a factor motivating an effort to develop new data management software; while existing software could be simulated, an implementation tailored to the new environment would be desirable.

The Opel report was extremely ambitious, although modest in that it explicitly recognized its dependence on various assumptions, both with respect to business-related projections and the ability to achieve the necessary innovations [FS_3, pp.54-55]. Looking ahead, some of those assumptions proved unwarranted, and other problems arose.

As mentioned above, both prior to and simultaneous with, work on the Opel report were precursor and related efforts. And, for a while, the relationship of these efforts to the Opel Report requirements was unclear. One ambiguous aspect was that of the actual appearance and implementation of the ADI interface. A serious candidate seemed to be the result of a study called HLS and its related AFS effort. The HLS (Higher Level System) study [FS_8] was a short term working group composed largely of senior personnel from the IBM Poughkeepsie development lab. Its report recommended a high level programming language, which was to incorporate the equivalent of operating system commands, and was to be executed interpretively. The AFS (Advanced Future System) effort, led by Carl Conti of Poughkeepsie, produced both a language specification [FS_7] consistent with the HLS recommendations, and a feasibility study [FS_6] reporting favorable results with respect to questions about the approach, including whether an interpretive implementation could deliver good performance. Despite these reported results, the approach was of concern to both the programming language and data base development organizations.

But, when things settled down, it became clear that the ADI interface was to be based on a definition by a group from the Endicott facility headed by Ray Larner. It covered a wide range of programming and control functions with rather lengthy operation names. And didn't look like a language someone would particularly want to code in.

A probably-related description, dated April 1972 [FS_9] outlines, at a high level, the program objects, process objects, data objects, and communication aspects of the ADI interface. Data objects were structures contained in self-describing packets, which also included specifications of access time requirements, shareability, and locks.

And then... according to Pugh [FS_1, p.547] the 3-level architecture was scrapped in September 1972. Instead, it was edicted that the EDI would be the highest architected level of FS, and would be called the FS Machine (FSM). It would be 370-like, but would include the single-level-store as well as some other ADI-like function, and probably would be implemented in hardware. Again according to Pugh, this change came about to some extent as a compromise between those who wanted to retain the 3-level structure and those who wanted to implement the entire ADI almost entirely in hardware. But there were almost certainly other influences; I doubt that the programming language groups were happy with a firm ADI as an initial compilation target.

(Notes: 1. I vaguely recall that an assumption of the Larner ADI interface lasted beyond September 1972. 2. Some aspects of the FSM, at least as further developed, are possibly reflected in a document published later by Radin and Schneider [FS_5]. The document describes, in particular, the advantages and difficulties presented by the single-level-store concept.)

In the ensuing, the efforts at various locations to relate to the ADI were replaced by efforts to design and implement an edicted new software system called FSPS (probably FS Programming System). A document [FS_10], issued about 2 years after the Opel report, reviews the state of FSPS at that time. It lists 17 major components, outlines their function and status as evidenced in a 1200 page external specification, and tries to evaluate their utility and usability with respect to application development. The result wasn't encouraging, not least because there was not yet sufficient information.

And then the entire FS effort was discontinued in early 1975. Pugh [FS_2, p. 310] discusses some of the many reasons. To abbreviate: the complexity of the architecture and the complexity of coordinating work across so many locations were a major cause of huge schedule slippages. These slippages, in turn, would cause existing IBM-originated 370 hardware to become obsolete long before FS availability, while, at the same time, IBM would lose prospective revenues precisely because the 360/370 architecture had become an industry standard. Leasing companies were dependent on continuing the lifespans of existing machines, and the standard was also being preserved by the many producers of plug-compatible devices, and in-process compatible machines from competitors, such as Amdahl.

Getting Organized & First Steps

As indicated earlier, the effect of the FS-related announcements on San Jose data-related development activities was to suggest a new subsystem consistent with the FS architecture. While current data-oriented applications could be supported by simulating current operating systems and their data-related services, migrating those applications to run directly on the new system, with its single level store, implied substantial revisions to the product software. Also, there was the possibility of supporting the new types of data bases (network, relational) on the horizon. So it seemed useful to consider integrating some or all of those facilities in a new subsystem.

In the initial FS-oriented San Jose organization, I was a member of the central Data Management Component (DMC) department, and thus obtained a good picture of early FS-related activity for data base. My records reflect two kinds of initial outputs of that department: early individual contributions and then a first-cut at a specification, in the form of a workbook.

The early individual contributions included critiques of various corporate FS-related specifications, detailed suggestions for initial work, and think pieces about data. All the primary contributors were intelligent, capable individuals, who attitudinally seemed to fall on a continuum between those considering themselves primarily as responsible, technically-astute business people, and those who seemed at least as interested in technical ideas for their own sake. Four people stood out. Bob Engles was a very bright career IBMer who was deeply knowledgeable about then-current data base systems, and had a formidable grasp of their details and of how they were used. Lloyd Harper had similar pragmatic tendencies but tended to abstract and conceptualize various aspects of data management to the point where it was sometimes difficult to relate his analyses to actual facilities. Bill Kent had a seriously contemplative orientation, and wrote some early comments on relational data base concepts. He also had a considerable talent for writing, and so was responsible for much of the department documentation. Finally, Roger Holliday was both thoughtful and creative; his initial contribution was an impressive draft of a basic data accessing language with a strong relational flavor.

As suggested by the above, there were undercurrents, with some (Bob and Lloyd) more interested in established contemporary approaches (IMS, DBTG) and others (Bill and Roger) interested in the new relational approach. Some interchanges I vaguely recall had the flavor of, on one side, are there any customers actually requesting relational data bases, and, on the other how can they request something they never heard of? But, significantly to what followed, Bob and Lloyd had rather strong personalities, whereas Bill and Roger were more understated. (Ironically, Bob later became the overall manager of IBM's flagship relational data base product, DB2.)

The workbook, containing an initial view of the prospective subsystem was produced in November 1972 [SJ_1]. It was a rather pleasant document, and included not only plans for data structure definition and accessing, but also for the accessing of such things as text files and libraries. Bill Kent probably wrote the overall introduction which, given the vagueness of FS specifications and assignments, states some necessary assumptions:

It is assumed that the San Jose Programming Center has the mission responsibility for all data base/data management for FS.
.....
We assume that the FS architecture is currently undefined and that furthermore it may be 6-12 months before a solid architectural definition is reached.

The largest part of the document is devoted to the then-envisioned data base architecture. It describes a cascade of information models (also called logical models, as contrasted with physical models, because they needn't correspond to stored structures) and the operations used to specify them. While inspired by the IMS cascade of structure definitions, the approach was also strongly influenced by relational concepts. Thus, at the lowest level of the cascade was something called the BSIM (basic system information model). The BSIM was made up of a collection of tables that together, and non-redundantly, subsumed all information in the data base, and

also contains[ed] other information relevant to the information content of the data base, such as the domains and meanings of attributes, the significances of the various tables, functional relationships between attributes, and consistency requirements.

Significantly, then, in this initial specification, the BSIM encompassed not only a compact representation of all the information in the data base, but also a great deal of explanatory information.

Above the BSIM level were structures called ESIMs (extended SIMs). They were described as built from the BSIM tables and lower-level ESIMs. The derivations were to be specified by a combination of operations on the lower-level structures plus value-based specifications of linkages among them. So, for example, hierarchic relationships similar to those of IMS were to be specified by matching values in parent and child records. The information models were expressed in a System Information Description Language (SIDL), and the structural descriptions and mappings were stored in a System Information Model (SIM) Dictionary. The actual operations on a data base at the model level were to be expressedin a System Information Manipulation Language (SIML).

In addition to the above, a Stored Data Interface(SDI) is discussed as a somewhat separate facility. It rested on a stored data model, which again consisted of relation-like tables. These tables could vary in how they were to be accessed, e.g., by row and column number, or by specified key values. A major concept was that of a selector, which represented all or part of a stored table. The potential operators on these tables generally took, as input, selectors and other criteria, and returned selectors.

All this seemed very pleasant, but there were hidden potential difficulties (for IBM development personnel and for users). One such difficulty was the answer to an implicit question raised by a stated intention that

there should be enough ... [descriptive function].. so that most desired application structural views can be specified in ESIMs

The implicit question was how much was enough descriptive function, and the answer turned out to be quite a lot. Because the descriptive function had to support not just different views according to a particular data base discipline, but different views in different data base disciplines, all resting on the same data. And not just differences among static views but alternatives with respect to accessing semantics, and beyond... Another potential difficulty was the tedium for DBAs of separately defining all the views and mappings associated with a large data base, with all of these view definitions assumed to be self-contained, in that they included all their associated record and field specifications.

Beyond material on the fundamental architecture, the workbook contains chapters about potential tools assisting in data base administration and data base access. I'm fairly sure I wrote these chapters, together with Herb Meltzer (based on documented chapter responsibilities, drafts with corrections in my handwriting, and later organizational changes giving me related responsibilities).

As far as data base administration tools, a collection of tool types were outlined. Some aspects were explicitly characterized as speculative, because of the newness of the area. However, the necessity of some tools was clear, given a brief contemplation of the difficulties of ensuring, for any substantial data base, that the details of each structure, and its relationship to other structures in the cascade, were correctly specified, and obtained the intended effect on accessing. And specifying storage structures added the problems of obtaining the desired levels of performance. The types of tools envisioned for DBAs were:

Definition tools: These tools were to accept inputs in a prescribed language, check their syntax, translate them to data management component directives, check the permission of the user to issue such directives, check the effects of the directives with respect to the consistency of the overall data base descriptions, and issue the directives.
Information Model Design tools: These tools intended to deal with the volume of definitional material needed, in the context of different development directions, that is, starting with structured views -- ESIMs -- and then helping to generate underlying BSIM table definitions, and/or the reverse. The general approach would be to generate new definitions given some exising ones and brief instructions as to their use in generating new ones. It might be noted that neither at this point nor later was attention given to helping people decide on the structuring of highest level views. A probable reason for this was the increasing importance of the idea of data independence, that is, that logical views should be specified consistent with the planned uses of the data by programs, independent of existing storage structures, to allow those structures to continue to evolve consistent with those views.
Stored Model Design tools: These tools were to provide assistance in making storage structure decisions, based on such things as ESIM structure access time requirements and usage frequency projections, and storage cost limitations and tradeoffs. (Note: the information model and stored model design tools were initially described together, in the context of the different periods in the lifetime of a data base, but they are listed separately here for consistency with what will follow.)
Authorization Tools: Tools to create and maintain authorization information for use by processes responsible for the enforcement of access control to definitional or data base data....
Process Validation Tools: Tools to discover or verify effects of a process with respect to the data base. They would be used prior to the extension of authorization to the process. These tools might simply identify points in code where data is accessed, for checking by the DBA, or might use compiler-related program analysis mechanisms to determine what is being done to the data.
"Tool Configuration" Tools: Because of the recognition of potential differences among user organizations, their sizes and personnel, tools were envisioned to help select and package a set of data-related tools appropriate to the enterprise. (In particular, the tools might be partitioned based on the particular data base technologies used by an enterprise, although this may not have been contemplated at the time.)

A separate chapter of the workbook dealt with tools targeted at people acting in various application-related roles requiring data base access. These might be direct users of ad-hoc query facilities, or data analysts of various corporate records, or application development personnel.

(A note about locations: During the period of these initial steps, plans were underway to move the development groups from the San Jose plant site to new facilities in a more rural area of San Jose. People in the data management groups were initially moved to very temporary quarters near the San Jose airport, then to an IBM-owned building in Palo Alto, and, finally, in 1977, to the brand new Santa Teresa Lab (now, in 2022, called the Silicon Valley Lab). Some time later, in 1986, the research groups were also moved, to the new Almaden Research Lab some distance away.)

Next Steps and Reorganization

After the workbook was produced, work on the core data management component continued. Incremental results were made known internally as the work progressed, and documented more formally about a year later. A major change was that the many kinds of structures to be supported were identified explicitly, as follows:

Linear homogeneous structures, There were relation-like in that they contained only one record type, but different in that they were considered to be separate files (see below).
Linear heterogeneous structures
IMS-like hierarchies
DBTG-like networks
Hierarchies containing only one type of record, to support such things as organization charts and bills-of-material

All of the above were considered types of "files", and fell into two classes: Data base files were given full record- and field-level support by the data management system, and multiple structure types could (in general) be used to access the identical underlying data. "General files" were given less support, and were intended (at least) to assist in migration of file-management data.

The above list of structures, plus the spelling out of all the variations provided for in field-level domains and constraints, as well as in access-time semantics, made more apparent the complexity of the system structure, the the number of specification alternatives available to a DBA, and the amount of information required to actually specify a large data base.

Another change was to the view cascade, which was made somewhat more practical and implementation oriented. Summarizing the changes:

ORIGINAL	NEW
------	LViews: program views
ESIMs: multiple layers, unconstrained mappings	DViews: Only lowest level had complex mappings to BView. Other levels subset only
BSIM: normalized, non-redundant tables	BView: contains TView tables, which can contain derived fields. Also includes relation-classes

A major change was that the lowest logical level, the BView (heretofore BSIM) no longer contained a normalized, non-redundant picture of all the data. Instead it contained additional tables with some columns/fields derived from material in other tables. Also, connections among fields in different tables to be used to support hierarchic and network structures were identified as relation classes. The changes were practical, in that they assisted in the specification and implementation of mappings both from higher-level structures, and to storage structures, but the simplicity of the original was lost. A further change in the direction of practicality was that the core data management dictionary no longer accomodated various kinds of semantic information and commentary.

These changes represented rational engineering developments, but, when spelled out, might have been resonsible for an increased focus on usability, as reflected in a number of activities. One was the creation of a separate group charged with fostering and assessing the usability of the overall data base subsystem. The first memo from the group [SJ_2] begins by discussing the well-known difficulties of specifying just IMS data bases, and the small projects underway at various IBM locations to produce some tools to help. And then insists that more emphasis and personnel be immediately applied to the usability of the new data base subsystem, and describes the kinds of activities required.

Another related activity was the identification of a prospective subsystem component initially called "Data Usability Aids" (DUAC). By request, I developed a presentation describing the component as containing DBA assists similar, but not identical, to those described in the workbook. Some functions were added and some were removed, most importantly the simple acceptance and checking of structure definitions. This would have a problematic effect on subsequent developments.

The resultant list of DUAC functions was:

"Requirement Collection" tools to gather information: (a) from programming-language-related descriptions, and (b) from non-DBAs in the form of "documents" input to, and output from, business processes (a high level was of describing application I/O).
"Information Model Design" tools to reduce the tedium of view definition by generating new descriptors supporting both top-down and bottom-up design approaches. This was to include the collection and/or verification of information needed for view generation which was not (or no longer) provided by the core data management dictionary, including dependency information needed for normalization, and the identification of informationally identical relationships in different views, needed for automated top-down structure generation.
"Stored View Design" tools to (a) collect (or predict), statistics about structure accessing, (b) model expected cost and performance based on those statistics, and (c) iteratively help design stored view using the modeled and actual results.
"Data Base Debugging" tools to check whether specified descriptors actually produced the intended results, by statically tracing data elements through the layered descriptions, and also by dynamically tracking actual accesses via interrupts and reports.
"Population Aids" to create or modify (real or test) data bases, to correct data base content based on new constraints, and to tune stored views.

And then, in mid-1973, after I and others produced draft functional outlines and staffing estimates for various components, there was a major reorganization of data base subsystem responsibilities. It separated, into different second-level organizations, responsibilities for developing the basic architecture and those for specifying the interactive aspects of the subsystem respectively.

(Note: In the IBM technology of the time, interactive facilities could be provided via typewriter-like terminals, or via planned display terminals that could accept typed lines and/or light pen menu selections.)

Two aspects of the reorganization were most relevant to the ensuing work. First, the specification of the core data structures, the DMC component, belonged to the basic architecture organization, while user facilities for designing and defining those structures were placed in the interactive organization. (DMC would have programmatic definitional facilities). This might not have been a bad idea; it sharpened the focus on usability, and possibly provided a degree of intellectual independence from the rather strong personalities of some of those designing the core structures.

A second aspect of the reorganization was not as useful: the interactive organization was further divided into several first-level departments, with different ones responsible for, respectively, data structure definition tools (a department eventually called RDL, for resource definition language), and data structure design tools (the instantiation of DUAC, eventually called DPA, for design and planning aids). Within the IBM development culture, that separation was a recipe for duplication of effort and for conflict.

Also, the above did not exhaust the organizations concerned with data specification for FS. In a totally separate IBM division, "Advanced Systems Development" (ASDD), headquartered in Mohansic, NY, a substantial group was concerned with "application methodology", which was concerned with high level specification of applications and their data.

Design and Planning Aids: Subsequent Development (DPA)

Returning to the memoir aspect more directly, DPA was the department to which I was assigned after the reorganization, as technical lead. In that era and later, many managers, even of small departments, focused primarily on administrative and personnel-related functions. There were about five other people in the department (the number varied over time). One of the more senior people, Ed Brink, and probably some of the others, came from an organization focused on smaller systems, and had probably been involved in the development of an IMS-like facility for such systems.

Unfortunately, saved records I have of departmental activity don't record much progress between the outline of DUAC function of early 1973, through the reorganization in mid-1973, to the DPA documentation of mid-1974. Some of this was beyond my control: a premature division of technical responsibilities, plus a significant expansion of staffing before some basics had been agreed upon, led to time wasted in inter-departmental conflict and in changes to assigned responsibilities. Nevertheless, I probably was partially responsible for the inadequate progress made. Oh well.

I'll use most of this section to talk about what happened in the more significant areas of responsibility for the department. So, starting with requirements collection....

In the initial DUAC outline, requirements collection covered collection of requirements from two sources: (a) from programmers via programming language structures, and (b) from systems analysts, who expressed requirements in terms of input and output "documents", an approach that had gained popularity over the prior several years. The programming-language source disappeared early, almost certainly based on discussions with representatives of the high-level language area, so there was no time wasted on that aspect. But after the DPA department was formed, the provisions for interaction with systems analysts was well specified in detail, probably by Terry Matsumoto. Unllke other work of this type, the specification included discussion of interaction between systems analysts and DBAs, so that the collection of information was not a paper exercise. And then this area disappeared from our responsibilities as a result of negotiations with ASDD.

These disappeared facilities were expected to provide valuable input to subsequent logical and physical designs, and so they were later replaced by the assumption (presumably justified) of the existence of transaction descriptors. These included identification of the data used by the so-delimited functional units, in terms of program-level views or in terms of tables, and indications of how the data was used.

Continuing, information model design was my specific area of responsibility. Based on the DMC definitional requirements, each data structure definition, on each level, included a definition of the structure proper, and of its contained records, and of the record-contained fields. So there was a potentially huge amount of DBA work required to establish a sizable corporate data base. As mentioned earlier, to address this problem (especially for large, distributed organizations), facilities had to be suited to a variety of design directions, that is, top-down starting from LVIEWs or DVIEWs, and/or bottom up starting from BVIEW tables, and/or sideways i.e., generating slightly modified structures for different application purposes.

Developing the approach to be taken went through two stages. A first cut was general and interesting but probably unworkable: a collection of existing structure definitions was placed in a work area, and specifying operations on and among the structures served to create new structure definitions. The final specification of the approach instead involved a collection of goal-specific functions, e.g., generate a structure to support two higher level ones. It also provided for the exercise of additional control by requesting that the functions be performed in a stepwise manner, exposing and interacting with the steps taken by the system to carry out the request.

In developing this approach, several issues had to be addressed. One, most relevant when generating structures in a top-down direction, was providing means of specifying when different structures referred to the same data, even when when the same names were used. I, and others in various ways, knew that the required identities were among facts, that is, as potentially named and explained relationships between (simple or compound) fields. But I doubt that I ever got around to seeing that as a part of the tool. It took another year or two before some of us (in a next project) explicitly used that idea.

Another issue was the obvious overlap with the work of the RDL department, responsible for obtaining and checking declarative definitions. People in both the RDL and DPA departments knew that it made no sense to develop definitional interfaces separately from design interfaces. So the RDL department proceeded to consider both types of facilities, and explicit inter-departmental negotiations yielded no obvious way of separating the two functions. But here the culture was at fault, because there was no reason that there could not have been cooperation in the area.

A final issue was whether it could be assumed that there would be a consolidated data base in which all design and definitional information, including that from the programming languages, could be stored, separately from the core DMC dictionary, and accessible interactively. This would allow usability-oriented facilities to be coherently developed and explained in terms of the content of that data base, and not be dependent on the core DMC dictionary, which was becoming increasingly implementation oriented.

Moving now to stored view design, which was the most challenging area of the project. What was needed was a reasonably accurate method for predicting the performance and costs of a design, given parameters consisting of (a) statistics, actual or predicted, about data access frequencies and data volumes, and also (b) details of the design (such as horizontal or vertical decompositions of the data, level(s) of storage assigned, use of indexes). To do this required extended contacts at least with the core group developing stored view alternatives and accessing mechanisms, and optimally with operating system and hardware groups as well. And, beyond building a predictive model, it was necessary to explore possible techniques for automating or semi-automating the generation of good alternative decisions. Techniques for doing this were just beginning to be explored in the IBM research division and externally, and there were some publications in the area, such as [STOR_1] and [STOR_2].

Unfortunately, only one person was assigned to this problem. He was bright and enthusiastic, but the job required more help. Early on, he proposed a way of characterizing the problem for optimization purposes that initially looked interesting but on closer consideration was found rested on seriously questionable assumptions. I'll talk about this at some length, not to criticize the individual involved, but because the situation was similar to other sorts of problematic directions taken in various development efforts over the years.

Roughly, the proposal was to map the sequences of expected hierarchic accesses onto a critical path method (CPM) network. CPM analyses are used for planning project schedules and directing attention to areas influencing overall project time. In a CPM network, the nodes represent tasks and have associated times, and the links between tasks represent prerequisite relationships (e.g., acquiring bricks is a prerequisite task to building a brick wall). A critical path is a longest path in the network, and is found by adding up the task times along the paths.

In the proposed adaptation, the tasks represented prospective accesses to the BView tables, and the paths represented hierarchic relationships among tables as specified by DViews. The values (storage-access-time x access-frequency) were placed on the links and added to find the longest paths, which were to be the subject of DBA attention. The adaptation made a number of fundamental, but unjustified, assumptions with respect to the target environment, although they may have reflected aspects of a small IMS system familiar to the proposer and/or some others in the department.

The rush to a questionable design may have been related to cultural problems sometimes affecting IBM development work. One was that it would have been difficult to convince people of the difficulty of the problem and thus the amount of effort required. This may sound strange, but a general attitude was that we are not here to do research.... (even if warranted). A related cultural problem was insularity; it would have been difficult to convince some in the department, especially those who had never worked outside of IBM development, to look beyond the local organization and its methods. In any event, after some time elapsed, the CPM network approach was reconsidered as simply a way of depicting structural interrelationships.

The other areas of departmental responsibility, namely structure definition debugging and data base loading, probably didn't survive unchanged, but I'm not sure; the most recent specifications I have, undated but probably from mid-1974, are incomplete.

After developing the specifications, there were two further major departmental activities, both led by Ed Brink, who was in charge of ongoing project planning. One was a DPA prototype effort, initiated in mid-1974, and announced as intended to

provide a vehicle for the investigation of interface techniques, solidify functional specifications in some relatively well-developed areas, and explore some key implementation questions.

How much of that goal was achieved? A report from the second activity gives some idea. This activity was a two month study by Ed and two others (plus occasional outside assistance) ending in March 1975, just about when FS was cancelled. The report proposed a collection of IMS/CICS data base specification products for smaller systems... possibly reverting to his/their earlier concerns but with respect to a different platform. Without describing the content in detail, it is fair to say that while it represents an advance in understanding over a year earlier, as well as a readiness to look at work outside the organization, it is still very fuzzy about critical aspects of information model design and stored view design.

What did I do during the latter part of 1974? I don't recall participating in the prototype effort, although I may have. I remember only participating in an interdivisional working group on Data Base Design Methodology. The group was headed by Frank King, manager of System R [SYSR_1] the new relational data base research project, and included one other person from research, two from ASDD, and me. And was great fun. What we did was dream up a fairly realistic, partially automated business organization, described it in great detail, and defined a detailed system + data base design process, illustrating the inputs and results of each stage. And came up with some recommendations; nothing startling. But it gave me some confidence that it was possible to provide decent, reasonably comprehensive, tools for business automation.

Comments and Aftermaths

At the beginning of this writeup I justified not discarding it by saying that

it seemed that it might be interesting, or at least useful as a piece of history....It included an account of some possibly surprising technical directions and of some challenging problems that arose.

Well, it sort of does those things. I doubt that there currently exists another account of the rather large FS data base subproject, even though it includes a fairly surprising technical direction: the deep integration of most types of large scale file and data base management approaches of the time. And the writeup discusses the technical challenges of helping users to cope with the complexities of specifying the associated data structures in exhaustive (and exhausting) detail.

I also indicated that the material illustrated some of the cultural difficulties that plagued IBM development projects for many years . And it does that to some extent as well. It describes what I consider a too-early expansion of the organization, with a large number of people all concerned with data structures and means of specifying them; too many for the work required at that stage, and too many to allow for rapid iterative development and revision of ideas affecting all the work. Similar kinds of organizational structures were, and remained, a source of some perennial IBM difficulties, including communication problems and sometimes overly complex, awkward technology. Finally, I talked about the lack of progress in the DPA area, and some partially responsible cultural factors including both the tendency to inter-departmental conflict and attitudes related to addressing very challenging technical problems.

Aftermaths One outgrowth of the data base effort was to establish the potential utility of a high-level data dictionary. So, after FS was cancelled, Roger Holliday, Bill Kent, and I ended up in a dictionary project, which I'll discuss in another section. It had some nice technical results, and, as usual, a conflict with another department, but this one was more interesting, because it rested more on technical issues than on territory.

I have little recollection about what happened following FS in the larger IBM data base world; during the dictionary project I strongly focused on that alone and, afterwards, in 1977, I transferred to the IBM Los Angeles Scientific Center, an applied research organization.

However, some information about data base development projects subsequent to FS can be found, strangely enough, in an entertaining record of the 1995 reunion of the System R research project, edited by Paul McJones [SYSR_2, pp.28ff]. Apparently, a large project was formed to carry out the DMC idea of simultaneously supporting many types of data structures, and proceeded for some time, but was eventually cancelled. And then work began on what eventually became DB2 [DB2_1], a very weighty relational system, and ever-growing, but nevertheless a huge success into the next century. IMS survived as well, but separately from DB2.