@w1024020103 2017-03-13T11:41:43.000000Z 字数 14191 阅读 548

Proj2 HeapFile

CS61B

HeapFile

Access methods provide a way to read or write data from disk that is arranged in a specific way. Common access methods include heap files (unsorted files of tuples) and B-trees; for this assignment, you will only implement a heap file access method, and we have written some of the code for you.
访问方法提供了一种渠道去读写磁盘里那些以特定方式排列的数据，常用的访问方法有heap file(未排序的rows文件)和B-trees;这个作业，你仅仅需要实施heap file访问方法，我们已经写了一些代码。

A HeapFile object is arranged into a set of pages, each of which consists of a fixed number of bytes for storing tuples, (defined by the constant BufferPool.PAGE_SIZE), including a header. In SimpleDB, there is one HeapFile object for each table in the database. Each page in a HeapFile is arranged as a set of slots, each of which can hold one tuple (tuples for a given table in SimpleDB are all of the same size). In addition to these slots, each page has a header that consists of a bitmap with one bit per tuple slot. If the bit corresponding to a particular tuple is 1, it indicates that the tuple is valid; if it is 0, the tuple is invalid (e.g., has been deleted or was never initialized.) Pages of HeapFile objects are of type HeapPage which implements the Page interface. Pages are stored in the buffer pool but are read and written by the HeapFile class.
一个HeapFile对象被按一个pages集合排列，每个page由固定数量的bytes构成以便储存rows（这个固定值由 BufferPool.PAGE_SIZE定义），包括了每row的一个header. 在SimpleDB里，每个table由一个对应的HeapFile对象。HeapFile里的每个page按照一个slots集合被安排，每个slot可以装下一个row(在SimpleDB里一个给定table的所有rows是一样大小的)。除开这些slots外，每个page还有一个header，这个header由一个每个row slot有一个bit的bitmap构成。如果一个特定row所对应的bit是1，这表明这个row是valid的；如果是0，则说明这个row是invalid(比如被删除了或者未被初始化)。 HeapFile对象的Pages是HeapPage类型的，HeapPage implements Page接口. Pages被保存在bufferpool里面，但是是由HeapFile类读写的。

SimpleDB stores heap files on disk in more or less the same format they are stored in memory. Each file consists of page data arranged consecutively on disk. Each page consists of one or more bytes representing the header, followed by the BufferPool.PAGE_SIZE - # header bytes bytes of actual page content. Each tuple requires tuple size * 8 bits for its content and 1 bit for the header. Thus, the number of tuples that can fit in a single page is:
tupsPerPage = floor((BufferPool.PAGE_SIZE * 8) / (tuple size * 8 + 1))
SimpleDB在磁盘上存储heap files跟在内存里储存是差不多同样的格式。每个file由在磁盘上连续排列的page数据构成。每个page由代表header的一个或更多的bytes，以及紧跟着的BufferPool.PAGE_SIZE - # header bytes这么多个bytes的实际page内容构成。每个row需要row size *8这么多bits来放它的内容，另外还需要1bit来放header。这样的话，每单个page可以存放的rows数量为：
tupsPerPage = floor((BufferPool.PAGE_SIZE * 8) / (tuple size * 8 + 1))

Where tuple size is the size of a tuple in the page in bytes. The idea here is that each tuple requires one additional bit of storage in the header. We compute the number of bits in a page (by mulitplying PAGE_SIZE by 8), and divide this quantity by the number of bits in a tuple (including this extra header bit) to get the number of tuples per page. The floor operation rounds down to the nearest integer number of tuples (we don't want to store partial tuples on a page!)
这个公式里，tuple size是在page里面每一个row由bytes表示的大小。这里要注意到每个row都需要额外的1bit来储存header.我们用 PAGE_SIZE * 8计算每个page拥有的bits数目，然后再把这个数除以每一个row所占的bits数目（包括一个额外的header bit）来得到每个page的rows数目。Floor操作按四舍五入法调低来得到整数个rows（我们不想只保存不到一个的部分rows来存到page上）

Once we know the number of tuples per page, the number of bytes required to store the header is simply:
headerBytes = ceiling(tupsPerPage/8)
The ceiling operation rounds up to the nearest integer number of bytes (we never store less than a full byte of header information.)
一旦我们知道了每个page上的rows数目，那么储存header需要的bytes数就很简单了(1 Byte == 8 bit)：
headerBytes = ceiling(tupsPerPage/8)
ceiling操作按四舍五入法调高取整得到整数个bytes（我们不会像要少于一完整的byte的header信息）

The low (least significant) bits of each byte represents the status of the slots that are earlier in the file. Hence, the lowest bit of the first byte represents whether or not the first slot in the page is in use. Also, note that the high-order bits of the last byte may not correspond to a slot that is actually in the file, since the number of slots may not be a multiple of 8. Also note that all Java virtual machines are big-endian.
每个byte里最低位(最不重要)的bits 表示这个file里更早的那些slots的状态。这样的话，第一个byte里最最低序位的bit就代表这个page里的第一个slot是不是被用到。同时要注意高序位的bits或许并不与file里的slot对应，因为slots的数量可能不是8的倍数（那些高序位的0可能是拿来补位的，因为slots的数如果不是8的倍数，就构不成完整的byte(= 8 bits)，所以那些补位的0不代表slots是不是被用到）。同时注意所有的JVM都是big-endian.

When overriding equals() in HeapFileId, I was wondering what's the difference of getClass and instanceof(), two ways are both ok.
For example:

public boolean equals(Object o) {
        // some code goes here
       if(o == null){
           return false;
       }else if (!(o instanceof PageId)) {
            return false;
       }
       PageId o1 = (PageId) o;
       return this.tableId == o1.getTableId() && this.pageNum == o1.pageNumber();
    }

public boolean equals(Object o) {
        // some code goes here
       if(o == null){
           return false;
       }else if (this.getClass() != o.getClass() {
            return false;
       }
       PageId o1 = (PageId) o;
       return this.tableId == o1.getTableId() && this.pageNum == o1.pageNumber();
    }

Got an answer says that instanceof is better:
屏幕快照 2017-03-10 下午4.09.01.png-100.5kB

When writing hashCode() method for Rowid class and HeapFileId class, I really don't know how should I override hashCode() along with equals().

When working on HeapPage.java, I got confused about the member variables it has, so I first draw a diagram to try to understand the structure. I don't know how to do isSlotUsed() method.

屏幕快照 2017-03-10 下午5.53.42.png-118.3kB

/**
     * Returns true if associated slot on this page is filled.
     */
    public boolean isSlotUsed(int i) {
        // some code goes here
        //what is this i? num of Row?
    }

I did some search on bitmap, here's what the last instruction means:
1.png-76.4kB

Some background info about Bitwise:

3.JPG-33.6kB
value << num
num 指定要移位值value 移动的位数。
左移的规则只记住一点：丢弃最高位，0补最低位
如果移动的位数超过了该类型的最大位数，那么编译器会对移动的位数取模。如对int型移动33位，实际上只移动了33%32=1位。

4.jpg-66.6kB

掩码

6.JPG-42.1kB

5.jpg-74kB

7.JPG-67.7kB

8.JPG-33.5kB

留下一个未解决的疑问，现在进行下一步：

屏幕快照 2017-03-12 下午9.40.55.png-108kB

After you have implemented HeapPage, you will write methods for HeapFile in this lab to calculate the number of pages in a file and to read a page from the file. You will then be able to fetch tuples from a file stored on disk.
写HeapFile去计算一个file一面的page数目，以及读取file里的一页page.你讲能够从储存在磁盘上的file里提取rows.

Exercise 5. Implement the skeleton methods in:
src/java/simpledb/HeapFile.java
To read a page from disk, you will first need to calculate the correct offset in the file. Hint: you will need random access to the file in order to read and write pages at arbitrary offsets. You should not call BufferPool methods when reading a page from disk.
为了从磁盘读取page，你首先要计算file里的正确的偏移量。
提示：你需要Random Access访问file来读写位于任意偏移量的pages. 当从磁盘读取page时，不应该用BufferPool的method.（经查，random access是Java I/O里的知识，查了一些资料见下）

You will also need to implement the HeapFile.iterator() method, which should iterate through through the tuples of each page in the HeapFile. The iterator must use the BufferPool.getPage() method to access pages in the HeapFile. This method loads the page into the buffer pool and will eventually be used (in a later lab) to implement locking-based concurrency control and recovery. Do not load the entire table into memory on the open() call -- this will cause an out of memory error for very large tables.
这个iterator会迭代HeapFile里的每页page上的rows.这个iterator必须用BufferPool.getPage()去访问HeapFile里的pages. 这个方法将page载入到buffer pool，然后再被使用来实施locking -based 控制和恢复。不要在open()里将整个table载入到内存,如果table很大这将导致内存溢出。

At this point, your code should pass the unit tests in HeapFileReadTest.

Try..Catch

屏幕快照 2017-03-13 上午11.15.29.png-37.6kB

在java中使用try{}catch{}语句对异常进行处理，处理方法如下：

try子句：
1.其后跟随可能产生异常的代码块；
2.产生异常后，位于try内的产生异常的代码块之后的代码不会再运行

catch子句：
1.其后跟随异常处理语句，通常用到两个方法：
getMessage() – 返回一个字符串对发生的异常进行描述；
printStackTrace() – 给出方法的调用序列，一直到异常的产生位置，通常在测试和调试时有用；
2.在异常发生时，程序将查找第1个能处理异常的catch子句：抛出的异常类型和异常参数类型相同或者是后者的子类；
3.catch子句处理完毕后，程序将忽略任何与try块相关的其他子句，并从try/catch后的第1行代码开始运行。

Random Access

做readPage()的时候，Hint说了不能用BufferPool里面的方法，要用Random Access来访问file.
Here’s a sample program that uses a RandomAccessFile to read file :

import java.io.*;
public class RandomAccessFileTest
{
    public static void main(String[] args)
    {
        try(
            RandomAccessFile raf =  new RandomAccessFile(
                "RandomAccessFileTest.java" , "r"))
        {
            // 获取RandomAccessFile对象文件指针的位置，初始位置是0
            System.out.println("RandomAccessFile的文件指针的初始位置：" + raf.getFilePointer());
            // 移动raf的文件记录指针的位置
            raf.seek(300);
            byte[] bbuf = new byte[1024];
            // 用于保存实际读取的字节数
            int hasRead = 0;
            // 使用循环来重复“取水”过程
            while ((hasRead = raf.read(bbuf)) > 0 )
            {
                // 取出“竹筒”中水滴（字节），将字节数组转换成字符串输入！
                System.out.print(new String(bbuf , 0 , hasRead ));
            }
        }
        catch (IOException ex)
        {
            ex.printStackTrace();
        }
    }
}

写readPage(), 一开始一头雾水，参考了别人的代码，现在算是理解了：

 /**
     * Read the specified page from disk.
     *
     * @throws IllegalArgumentException if the page does not exist in this file.
     */
    // see DbFile.java for javadocs
    public Page readPage(PageId pid) {
        // some code goes here
        // pageNumber starts from 0?
        // offset is the size in bytes of the total of all previous pages.
        // data is the bytes info of this page.
        try{
            RandomAccessFile f = new RandomAccessFile(this.file, "r");
            int offset = BufferPool.PAGE_SIZE * pid.pageNumber();
            byte[] data = new byte[BufferPool.PAGE_SIZE];
            if(offset + BufferPool.PAGE_SIZE > f.length()){
                System.err.print("ERROR: page offset exceeds max size");
                System.exit(1);
            }
            //move the pointer to the beginning of the page we want to read.(or the end of the previous page)
            f.seek(offset);
            //read the whole page into the byte array data.
            f.readFully(data);
            //close the RandomAccessFile f.
            f.close();
        }
        catch(FileNotFoundException e){
            System.err.print("FileNotFoundException" + e.getMessage());
        }
        catch(IOException e){
            System.err.print("IOException" + e.getMessage());
        }
    }

留意一下randomaccessfile.readFully(byte [])方法：

屏幕快照 2017-03-13 下午2.21.06.png-56.8kB
屏幕快照 2017-03-13 下午2.21.17.png-79.5kB

参考资料：
Java 输入/输出 I/O流 RandomAccessFile
RandomAccessFile类的使用
 Java RandomAccessFile用法

写numPages()方法时，用到了file.length()，它返回的是该file的byte长度：

/**
     * Returns the number of pages in this HeapFile.
     */
    public int numPages() {
        // some code goes here
        //file.length() method returns the length of this file, measured in bytes.
        return (int) Math.ceil(this.file.length() / BufferPool.PAGE_SIZE);
    }

写HeapFile的iterator,不是很懂题目要求，特别是还有奇怪的open(), rewind(), close()这样的我以前从来没在iterator里面见过的方法。参考了别人的代码，算是理解了，但要自己完全独立写，可能还很难，有些东西根本不知道怎么想到的，更别提会用了。因为这个iterator要求较复杂，所以新建了一个类HeapFileIterator来写以满足要求。
原方法：要求必须迭代所有储存在这个DbFile(这里是HeapFile)里的rows.而且，iterator必须用BufferPool的getPage方法而不是HeapFile里的readPage方法来迭代pages. IDEA会自动提示要处理exception,如果throw不行，就用try catch。

 /**
     * Returns an iterator over all the rows stored in this DbFile. The
     * iterator must use {@link BufferPool#getPage}, rather than
     * {@link #readPage} to iterate through the pages.
     *
     * @return an iterator over all the rows stored in this DbFile.
     */
    public DbFileIterator iterator(TransactionId tid){
        // some code goes here
        try{
            return new HeapFileIterator(this,tid);
        }
        catch(DbException dbe){
            dbe.printStackTrace();
        }
        catch(TransactionAbortedException tae){
            tae.printStackTrace();
        }
        return null;
    }

新建一个helper class HeapFileIterator, 思考它的成员变量：
* 最基本要有一个对应的HeapFile file;;
* 类比于HeapPage的iterator是iterate through each row starting from row 0，那么HeapFile的iterator应该是iterate through each page (然后调用这页HeapPage自身的iterator来iterate through each row)，所以需要一个int currentPageNumber = 0（类比其他iterator的int index = 0）;
* 一个代表iterator是否open的boolean isOpen值；
* 一个HeapFile里面每一页HeapPage自身的rowIterator(iterator里面的iterator)
* transactionId tid;(题目要求里提到需要用到BufferPool里的getPage(TransactionId tid, PageId pid, Permissions perm),所以需要tid这个变量。

成员变量：

    public HeapFile heapfile;
    public TransactionId tid;
    public int currentPageNumber;
    public Iterator<Row> rowIterator;
    public boolean isOpen;

构造方法：初始化HeapFileIterator时，int currentPageNumber 默认值为0； boolean isOpen 的默认值为 false；而Iterator<Row> row Iterator可以用helper method新建，所以构造器只需要heapfile和tid两个参数.

Constructor:

public HeapFileIterator(HeapFile hf, TransactionId tid) throws DbException, TransactionAbortedException {
        this.heapfile = hf;
        this.tid = tid;
        this.currentPageNumber = 0;
        this.rowIterator = getIterator(this.currentPageNumber);
        this.isOpen = false;
    }

现在来看getIterator这个helper method,实际上就是我们new一个HeapPage的iterator用作我们HeapFileIterator内部专门在指定Page上iterate through rows的iterator.那我们只要找到当前对应的HeapPage,就可以直接return它的iterator了。所以这个getIterator需要一个参数pageNumber史来表示它是哪一页HeapPage的iterator. 现在我们有个这个pageNumber，那How to get HeapPage? 于是我们想到之前题目说的要用到BufferPool的getPage，我们看到getPage()需要的参数：

public Page getPage(TransactionId tid, PageId pid, Permissions perm)

tid我们有，perm可以直接写READ_ONLY或者READ_WRITE,所以我们现在还差PageId pid. 回到HeapPageId去看看，它的构造器是：

public HeapPageId(int tableId, int pgNo) {
        // some code goes here
        this.tableId = tableId;
        this.pageNum = pgNo;
    }
    /** @return the table associated with this PageId */
    public int getTableId() {
        // some code goes here
        return this.tableId;
    }
    /**
     * @return the page number in the table getTableId() associated with
     *   this PageId
     */
    public int pageNumber() {
        // some code goes here
        return this.pageNum;
    }

那我们如果知道了这个HeapFile所对应的table的tableid, 加上我们已经有的pageNumber,就可以new一个HeapPageId对象，从而帮助我们进而得到HeapPage. 注意这里的tableid,根据HeapFle与table一一对应，以及HeapFile的getId的Javadoc，其实HeapFile的getId返回值就是tableid.

 /**
     * Returns an ID uniquely identifying this HeapFile. Implementation note:
     * you will need to generate this tableid somewhere ensure that each
     * HeapFile has a "unique id," and that you always return the same value for
     * a particular HeapFile. We suggest hashing the absolute file name of the
     * file underlying the heapfile, i.e. f.getAbsoluteFile().hashCode().
     *
     * @return an ID uniquely identifying this HeapFile.
     */
    public int getId() {
        // some code goes here
        return this.file.getAbsoluteFile().hashCode();
    }

于是我们得到HeapPageId pid:

HeapPageId pid = new HeapPageId(this.getId(), this.currenPageNumber);

现在我们再回到BufferPool.getPage(),便可以找到对应的Page:

HeapPage heapPage = (HeapPage) Database.getBufferPool().getPage(this.tid, pid, Permissions.READ_ONLY);

现在，我们便可以直接return 这个HeapPage的iterator作为我们现在要写的HeapFileIterator的成员变量之一（或者称之为HeapFileIterator内部的iterator)

 return heapPage.iterator();

至此，我们才完成getIterator这个helper method.
重新回到这个HeapFileIterator的Big picture.我们刚刚声明完成员变量，写好了构造器。接下来看看题目提供的API:

 @Override
    public void open() throws DbException, TransactionAbortedException { }
    @Override
    public boolean hasNext() throws DbException, TransactionAbortedException { }
    @Override
    public Row next() throws DbException, TransactionAbortedException, NoSuchElementException { }
    @Override
    public void rewind() throws DbException, TransactionAbortedException { }
    @Override
    public void close() {}

open()方法因为我们自己设定的boolean isOpen而变得很简单：

@Override
    public void open() throws DbException, TransactionAbortedException {
        isOpen = true;
    }

hasNext()方法很有趣。首先，我们要确保iterator是open的。然后，我们可以先看看内部iterator即所在HeapPage页的rowIterator的hasNext()是否为真。如果是，则说明这一页HeapPage还有下一row，那我们的大iterator即HeapFile的iterator肯定也有下一row,hasNext()为true;如果该页HeapPage的rowIterator的hasNext()为false，即没有下一row，我们的大iterator却可能有下一row（翻到下一页从第一row继续iterate),也有可能没有下一row了（已经到了最后一页）.所以我们要分情况讨论.只要现在所在的页数小于最后一页（注意：页数是按index(starts at 0)，但HeapFile的numPages是一个纯计数, 所以临界条件是index = numPages - 1),我们都可以让currentPageNumber increment by 1, 然后重新得到新HeapPage对应的iterator, 返回这个小iterator的hasNext()即可；如果已经到最后一页的最后一行了，就return false。

 @Override
    public boolean hasNext() throws DbException, TransactionAbortedException {
        if (!isOpen) {
            return false;
        } else if (rowIterator.hasNext()) {
            return true;
        } else if (currentPageNumber < heapfile.numPages() - 1) {
            currentPageNumber += 1;
            rowIterator = getIterator(currentPageNumber);
            return rowIterator.hasNext();
        } else {
            return false;
        }

next()方法很straightforward,返回一个Row对象；充分利用HeapPage的小iterator自身的next用到这里：

@Override
    public Row next() throws DbException, TransactionAbortedException, NoSuchElementException {
        if (hasNext()) {
            return rowIterator.next();
        } else {
            throw new NoSuchElementException();
        }
    }

rewind()方法比较少见，相当于reset, everything starts from the beginning,那我们就重置currentPageNumber和对应的rowIterator就可以。

@Override
    public void rewind() throws DbException, TransactionAbortedException {
        currentPageNumber = 0;
        rowIterator = getIterator(currentPageNumber);
    }

close() is quite straightforward since we have boolean isOpen with us:

 @Override
    public void close() {
        isOpen = false;
    }

到这里，这个HeapFileIterator就写好了。从来没有写过如此复杂的iterator, 也不代表自己以后就会写了。但通过这种逐个方法的分析，以后再多遇到几次类似的情况，多练几次肯定就掌握于心了。

Proj2 HeapFile

HeapFile

Try..Catch

Random Access

内容目录