|
1 Translator 101 Lesson
1.1 Translator 101 Lesson 1: Setting the Stage
This is the first postin a series that will explain some of the details of writing a GlusterFStranslator, using some actual code to illustrate.
Before we begin, aword about environments. GlusterFS is over 300K lines of code spread across afew hundred files. That’s no Linux kernel or anything, but you’re still goingto be navigating through a lot of code in every code-editing session, so somekind of cross-referencing is essential. I use cscope with the vimbindings, and if I couldn’t do “crtl-\ g” and such to jump between definitionsall the time my productivity would be cut in half. You may prefer differenttools, but as I go through these examples you’ll need something functionallysimilar to follow on. OK, on with the show.
The first thing youneed to know is that translators are not just bags of functions and variables.They need to have a very definite internal structure so that thetranslator-loading code can figure out where all the pieces are. The way itdoes this is to use dlsym to look for specific names within your shared-objectfile, as follow (from xlator.c):
if(!(xl->fops = dlsym (handle,"fops")))
{
gf_log ("xlator", GF_LOG_WARNING,"dlsym(fops) on %s", dlerror ());goto out;
}
if(!(xl->cbks = dlsym (handle,"cbks")))
{
gf_log ("xlator", GF_LOG_WARNING,"dlsym(cbks) on %s", dlerror ());goto out;
}
if(!(xl->init = dlsym (handle,"init")))
{
gf_log ("xlator", GF_LOG_WARNING,"dlsym(init) on %s", dlerror ());goto out;
}
if(!(xl->fini = dlsym (handle,"fini")))
{
gf_log ("xlator", GF_LOG_WARNING,"dlsym(fini) on %s", dlerror ());goto out;
}
In this example, xl isa pointer to the in-memory object for the translator we’re loading. As you cansee, it’s looking up various symbols by name in the sharedobject it just loaded, and storing pointers to those symbols. Some of them(e.g. init are functions, while others e.g. fops aredispatch tables containing pointers to many functions. Together, these make upthe translator’s public interface.
Most of this glue orboilerplate can easily be found at the bottom of one of the source files thatmake up each translator. We’re going to use the rot-13 translator just for fun,so in this case you’d look in rot-13.c to see this:
struct xlator_fops fops ={
.readv= rot13_readv,
.writev= rot13_writev
};
struct xlator_cbks cbks ={};
struct volume_options options[]={
{ .key={"encrypt-write"}, .type= GF_OPTION_TYPE_BOOL },
{ .key={"decrypt-read"}, .type= GF_OPTION_TYPE_BOOL },
{ .key={NULL}},
};
The fops table,defined in xlator.h, is one of the most important pieces. This table contains apointer to each of the filesystem functions that your translator mightimplement – open, read, stat, chmod, and so on. There are 82 such functions inall, but don’t worry; any that you don’t specify here will be see as null andfilled with defaults from defaults.c when your translator is loaded. In thisparticular example, since rot-13 is an exceptionally simple translator, we onlyfill in two entries for readv and writev.
There are actually twoother tables, also required to have predefined names, that are also used tofind translator functions: cbks (which is empty in thissnippet) and dumpops (which is missing entirely). The first ofthese specify entry points for when inodes are forgotten or file descriptorsare>
The last piece I’llcover today is options. As you can see, this is a table oftranslator-specific option names and some information about their types.GlusterFS actually provides a pretty rich set of types (volume_option_type_t inoptions.h) which includes paths, translator names, percentages, and times inaddition to the obvious integers and strings. Also, the volume_option_t structurecan include information about> { .key={"data-self-heal-algorithm"},
.type= GF_OPTION_TYPE_STR,
.default_value="",
.description="Select between \"full\", \"diff\". The ""\"full\" algorithm copies the entire file from ""source to sink. The \"diff\" algorithm copies to ""sink only those blocks whose checksums don't match ""with those of source.", .value={"diff","full",""}},
{ .key={"data-self-heal-window-size"},
.type= GF_OPTION_TYPE_INT, .min=1, .max=1024,
.default_value="1", .description="Maximum number blocks per file for which self-heal ""process would be applied simultaneously."},
When your translatoris loaded, all of this information is used to parse the options actuallyprovided in the volfile, and then the result is turned into a dictionary andstored as xl->options. This dictionary is then processed byyour init function, which you can see being looked up in thefirst code fragment above. We’re only going to look at a small part of therot-13′s init for now.
priv->decrypt_read =1; priv->encrypt_write =1;
data = dict_get (this->options,"encrypt-write");
if(data){
if(gf_string2boolean (data->data,&priv->encrypt_write)==-1)
{
gf_log (this->name, GF_LOG_ERROR,"encrypt-write takes only boolean options");
return-1;
}}
What we can see hereis that we’re setting some defaults in our priv structure,then looking to see if an “encrypt-write” option was actually provided. If so,we convert and store it. This is a pretty> So far we’ve coveredthe basic of how a translator gets loaded, how we find its various parts, andhow we process its options. In my next Translator 101 post, we’ll go a littledeeper into other things that init and its companion fini mightdo, and how some other fields in our xlator_t structure(commonly referred to asthis) are commonly used.
1.2 Translator 101 Lesson 2: init, fini, and private context
In the previousTranslator 101 post, we looked at some of the dispatch tables and optionsprocessing in a translator. This time we’re going to cover the rest of the“shell” of a translator – i.e. the other global parts not specific to handlinga particular request.
Let’s start by looking at the> 132
133
134
135
136
137
138
139
int32_t init (xlator_t *this) { data_t *data = NULL; rot_13_private_t *priv = NULL; if (!this->children || this->children->next) { gf_log ("rot13", GF_LOG_ERROR, "FATAL: rot13 should have exactly one child"); return -1; } if (!this->parents) { gf_log (this->name, GF_LOG_WARNING, "dangling volume. check volfile "); }
priv = GF_CALLOC (sizeof (rot_13_private_t), 1, 0); if (!priv) return -1;
At the very top, we see the function signature – we get a pointer tothe xlator_t object that we’re initializing, and we returnan int32_t status. As with most functions in the translatorAPI, this should be zero to indicate success. In this case it’s safe to return-1 for failure, but watch out: in dispatch-table functions, the return valuemeans the status of the function call rather than the request.A request error should be reflected as a callback with a non-zero op_retvalue,but the dispatch function itself should still return zero. In fact, thehandling of a non-zero return from a dispatch function is not all that robust(we recently had a bug report in HekaFS> The first thing this init function does is check that thetranslator is being set up in the right kind of environment. Translators arecalled by parents and in turn call children. Some translators are “initial”translators that inject requests into the system from elsewhere – e.g.mount/fuse injecting requests from the kernel, protocol/server injectingrequests from the network. Those translators don’t need parents, but rot-13does and so we check for that. Similarly, some translators are “final”translators that (from the perspective of the current process) terminaterequests instead of passing them on – e.g. protocol/client passing them toanother node, storage/posix passing them to a local filesystem. Othertranslators “multiplex” between multiple children – passing each parent requeston to one (cluster/dht), some (cluster/stripe), or all (cluster/afr) of thosechildren. Rot-13 fits into none of those categories either, so it checks thatit has exactly one child. It might be more convenient orrobust if translator shared libraries had standard variables describing theserequirements, to be checked in a consistent way by the translator-loadinginfrastructure itself instead of by each separate init function,but this is the way translators work today.
The last thing we see in this fragment is allocating our private dataarea. This can literally be anything we want; the infrastructure just providesthe priv pointer as a convenience but takes no responsibilityfor how it’s used. In this case we’re using GF_CALLOC toallocate our own rot_13_private_t structure. This gets us allthe benefits of GlusterFS’s memory-leak detection infrastructure, but the waywe’re calling it is not quite> To finish our tour of standard initialization/termination, let’s look atthe end of init and the beginning of fini
174
175
176
177
this->private = priv; gf_log ("rot13", GF_LOG_DEBUG, "rot13 xlator loaded"); return 0; } void fini (xlator_t *this) { rot_13_private_t *priv = this->private; if (!priv) return; this->private = NULL; GF_FREE (priv);
At the end of init we’re just storing our private-datapointer in the priv field of our xlator_t, thenreturning zero to indicate that initialization succeeded. As is usually thecase, our fini is even simpler. All it really has to dois GF_FREE our private-data pointer, which we do in a slightlyroundabout way here. Notice how we don’t even have a return value here, sincethere’s nothing obvious and useful that the infrastructure could do if fini failed.
That’s practically everything we need to know to get our translatorthrough loading, initialization, options processing, and termination. If we haddefined no dispatch functions, we could actually configure a daemon to use ourtranslator and it would work as a basic pass-through from its parent to asingle child. In the next post I’ll cover how to build the translator andconfigure a daemon to use it, so that we can actually step through it in adebugger and see how it all fits together before we actually start addingfunctionality.
1.3 Translator 101 Lesson 3: This Time For Real
In the first two parts of this series, we learned how to write a basictranslator skeleton that can get through loading, initialization, and optionprocessing. This time we’ll cover how to build that translator, configure avolume to use it, and run the glusterfs daemon in debug mode.
Unfortunately, there’s not much direct support for writing newtranslators. You can check out a GlusterFS tree and splice in your owntranslator directory, but that’s a bit painful because you’ll have to updatemultiple makefiles plus a bunch of autoconf garbage. As part of the HekaFSproject, I basically reverse engineered the truly necessary parts of thetranslator-building process and then pestered one of the Fedora glusterfspackage maintainers (thanks daMaestro!) to add a glusterfs-devel package withthe required headers. Since then the complexity level in the HekaFS tree hascrept back up a bit, but I still remember the simple method and still considerit the easiest way to get started on a new translator. For the sake of thosenot using Fedora, I’m going to describe a method that doesn’t depend on thatheader package. What it does depend on is a GlusterFS source tree, much as youmight have cloned fromGitHub or the Gluster review site. This treedoesn’t have to be fully built, but you do need to run autogen.sh and configure init. Then you can take the following simple makefile and put it in a directorywith your actual source.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Change these to match your source code. TARGET = rot-13.so
OBJECTS = rot-13.o
# Change these to match your environment. GLFS_SRC = /play/glusterfs
GLFS_LIB = /opt/glusterfs/3git/lib64
HOST_OS = GF_LINUX_HOST_OS
# You shouldn't need to change anything below here.
CFLAGS = -fPIC -Wall -O0 -g \ -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D$(HOST_OS) \ -I$(GLFS_SRC) -I$(GLFS_SRC)/libglusterfs/src \ -I$(GLFS_SRC)/contrib/uuid
LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB) -lglusterfs -lpthread
$(TARGET): $(OBJECTS) $(CC) $(OBJECTS) $(LDFLAGS) -o $(TARGET)
Yes, it’s still Linux-specific. Mea culpa. As you can see, we’re stickingwith the rot-13 example, so you can just copy the files from…/xlators/encryption/rot-13/src in your GlusterFS tree to follow on. Type“make” and you should be rewarded with a nice little .so file.
1
2
[jeff@gfs-i8c-01 xlator_example]$ ls -l rot-13.so
-rwxr-xr-x. 1 jeff jeff 40784 Nov 16 16:41 rot-13.so
Notice that we’ve built with optimization level zero and debugging symbolsincluded, which would not typically be the case for a packaged version ofGlusterFS. Let’s put our version of rot-13.so into a slightly different file onour system, so that it doesn’t stomp on the installed version (not that you’dever want to use that anyway).
1
2
3
[root@gfs-i8c-01 xlator_example]# ls /opt/glusterfs/3git/lib64/glusterfs/3git/xlator/encryption/
crypt.so crypt.so.0 crypt.so.0.0.0 rot-13.so rot-13.so.0 rot-13.so.0.0.0
[root@gfs-i8c-01 xlator_example]# cp rot-13.so /opt/glusterfs/3git/lib64/glusterfs/3git/xlator/encryption/my-rot-13.so
These paths represent the current Gluster filesystem layout, which islikely to be deprecated in favor of the Fedora layout; your paths may vary. Atthis point we’re ready to configure a volume using our new translator. To dothat, I’m going to suggest something that’s strongly discouraged except duringdevelopment (the Gluster guys are going to hate me for this): write our ownvolfile. Here’s just about the simplest volfile you’ll ever see.
1
2
3
4
5
6
7
8
9
volume my-posix
type storage/posix
option directory /play/export
end-volume
volume my-rot13
type encryption/my-rot-13
subvolumes my-posix
end-volume
All we have here is a basic brick using /play/export for its data, andthen an instance of our translator layered on top – no client or server isnecessary for what we’re doing, and the system will automatically push amount/fuse translator on top if there’s no server translator. To try this out,all we need is the following command (assuming the directories involved alreadyexist).
1
[jeff@gfs-i8c-01 xlator_example]$ glusterfs --debug -f my.vol /play/import
You should be rewarded with a whole lot of log output, including the textof the volfile (this is very useful for debugging problems in the field). Ifyou go to another window on the same machine, you can see that you have a newfilesystem mounted.
1
2
3
4
[jeff@gfs-i8c-01 ~]$ df /play/import
Filesystem 1K-blocks Used Available Use% Mounted on
/play/xlator_example/my.vol
114506240 2706176 105983488 3% /play/import
Just for fun, write something into a file in /play/import, then look atthe corresponding file in /play/export to see it all rot-13′ed for you.
1
2
3
4
[jeff@gfs-i8c-01 ~]$ echo hello > /play/import/a_file
[jeff@gfs-i8c-01 ~]$ cat /play/export/a_file
uryyb
There you have it – functionality you control, implemented easily, layeredon top of local storage. Now you could start adding functionality – realencryption, perhaps – and inevitably having to debug it. You could do that theold-school way, with gf_log (preferred) or even plain old printf, or you couldrun daemons under gdb instead.>1.4 Translator 101 Lesson 4: Debugging a Translator
Now that we’ve learned what a translator looks like and how to build one,it’s time to run one and actually watch it work. The best way to do this isgood old-fashioned gdb, as follows (using some of the examples from last time).
1
2
3
4
5
6
7
[root@gfs-i8c-01 xlator_example]# gdb glusterfs
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
...
(gdb) r --debug -f my.vol /play/import
Starting program: /usr/sbin/glusterfs --debug -f my.vol /play/import
...
[2011-11-23 11:23:16.495516] I [fuse-bridge.c:2971:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13
If you get to this point, your glusterfs client process is alreadyrunning. You can go to another window to see the mountpoint, do fileoperations, etc.
[root@gfs-i8c-01 ~]# df /play/import
Filesystem 1K-blocks Used Available Use% Mounted on
/root/xlator_example/my.vol
114506240 2643968 106045568 3% /play/import
[root@gfs-i8c-01 ~]# ls /play/import
a_file
[root@gfs-i8c-01 ~]# cat /play/import/a_file
hello
Now let’s interrupt the process and see where we are.
1
2
3
4
5
6
7
8
9
10
11
^C
Program received signal SIGINT, Interrupt.
0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) info threads
5 Thread 0x7fffeffff700 (LWP 27206) 0x0000003a002dd8c7 in readv ()
from /lib64/libc.so.6
4 Thread 0x7ffff50e3700 (LWP 27205) 0x0000003a0060b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
3 Thread 0x7ffff5f02700 (LWP 27204) 0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
2 Thread 0x7ffff6903700 (LWP 27203) 0x0000003a0060f245 in sigwait ()
from /lib64/libpthread.so.0
* 1 Thread 0x7ffff7957700 (LWP 27196) 0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
Like any non-toy server, this one has multiple threads. What are they alldoing? Honestly, even I don’t know. Thread 1 turns out to be inevent_dispatch_epoll,which means it’s the one handling all of our network I/O. Note that with socket multi-threading patch thiswill change, with one thread insocket_poller per connection. Thread2 is in glusterfs_sigwaiter which means signals will be isolatedto that thread. Thread 3 is in syncenv_task, so it’s a workerprocess for synchronous requests such as those used by the rebalance and repaircode. Thread 4 is in janitor_get_next_fd, so it’s waiting for achance to close no-longer-needed file descriptors on the local filesystem. (Iadmit I had to look that one up, BTW.) Lastly, thread 5 is in fuse_thread_proc,so it’s the one fetching requests from our FUSE interface. You’ll often seemany more threads than this, but it’s a pretty good basic set. Now, let’s set abreakpoint so we can actually watch a request.
1
2
3
4
(gdb) b rot13_writev
Breakpoint 1 at 0x7ffff50e4f0b: file rot-13.c, line 119.
(gdb) c
Continuing.
At this point we go into our other window and do something that willinvolve a write.
1
2
3
4
5
6
7
[root@gfs-i8c-01 ~]# echo goodbye > /play/import/another_file
(back to the first window)
[Switching to Thread 0x7fffeffff700 (LWP 27206)]
Breakpoint 1, rot13_writev (frame=0x7ffff6e4402c, this=0x638440, fd=0x7ffff409802c,
vector=0x7fffe8000cd8, count=1, offset=0, iobref=0x7fffe8001070) at rot-13.c:119
119 rot_13_private_t *priv = (rot_13_private_t *)this->private;
Remember how we built with debugging symbols enabled and no optimization?That will be pretty important for the next few steps. As you can see, we’re inrot13_writev,with several parameters.
frame is our always-present frame pointer for this request. Also,frame->local will point to any local data we created and attached to therequest ourselves.
this is a pointer to our instance of the rot-13 translator. You canexamine it if you like to see the name, type, options, parent/children, inodetable, and other stuff associated with it.
fd is a pointer to a file-descriptor object (fd_t, not just a file-descriptorindex which is what most people use “fd” for). This in turn points to an inodeobject (inode_t) and we can associate our own rot-13-specific data with eitherof these.
vector and count together describe the data buffers for this write, whichwe’ll get to in a moment.
offset is the offset into the file at which we’re writing.
iobref is a buffer-reference object, which is used to track the life cycleof buffers containing read/write data. If you look closely, you’ll noticethatvector[0].iov_base points to the same address as iobref->iobrefs[0]NaNr,which should give you some> OK, now what about that vector? We can use it to examine the data beingwritten, like this.
(gdb) p vector[0]
$2 = {iov_base = 0x7ffff7936000, iov_len = 8}
(gdb) x/s 0x7ffff7936000
0x7ffff7936000: "goodbye\n"
It’s not always safe to view this data as a string, because it might justas well be binary data, but since we’re generating the write this time it’ssafe and convenient. With that knowledge, let’s step through things a bit.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(gdb) s
120 if (priv->encrypt_write)
(gdb)
121 rot13_iovec (vector, count);
(gdb)
rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57
57 for (i = 0; i < count; i++) {
(gdb)
58 rot13 (vector.iov_base, vector.iov_len);
(gdb)
rot13 (buf=0x7ffff7936000 "goodbye\n", len=8) at rot-13.c:45
45 for (i = 0; i < len; i++) {
(gdb)
46 if (buf >= 'a' && buf
They weigh in at 224 and 229 lines respectively, with some of that takenup by licenses and white space. Each took less than a day to write. Please bearin mind, though, that these are only prototypes. They exist to teach and tomake a point, not – in their current form – to be used in production. Makingthem suitable for real-world use would at least double their> negative-lookup /negative.h
#ifndef __NEGATIVE_H__
#define __NEGATIVE_H__
#ifndef _CONFIG_H
#define _CONFIG_H
#include "config.h"
#endif
#include "mem-types.h"
#include "hashfn.h"
#define GHOST_BUCKETS 64
#define GHOST_HASH(x) (SuperFastHash(x,strlen(x)) % GHOST_BUCKETS)
typedef struct _ghost {
struct _ghost *next;
char *path;
} ghost_t;
typedef struct {
ghost_t *ghosts[GHOST_BUCKETS];
} negative_private_t;
enum gf_negative_mem_types_ {
gf_negative_mt_priv = gf_common_mt_end + 1,
gf_negative_mt_ghost,
gf_negative_mt_end
};
#endif /* __NEGATIVE_H__ */
negative-lookup / negative.c
#include
#include
#ifndef _CONFIG_H
#define _CONFIG_H
#include "config.h"
#endif
#include "glusterfs.h"
#include "xlator.h"
#include "logging.h"
#include "negative.h"
void
exorcise (xlator_t *this, char *spirit)
{
negative_private_t *priv = this->private;
ghost_t *gp = NULL;
ghost_t **gpp = NULL;
uint32_t bucket = 0;
bucket = GHOST_HASH(spirit);
for (gpp = &priv->ghosts[bucket]; *gpp; gpp = &(*gpp)->next) {
gp = *gpp;
if (!strcmp(gp->path,spirit)) {
*gpp = gp->next;
GF_FREE(gp->path);
GF_FREE(gp);
gf_log(this->name,GF_LOG_DEBUG,"removed %s",spirit);
break;
}
}
}
int32_t
negative_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
int32_t op_ret, int32_t op_errno, inode_t *inode,
struct iatt *buf, dict_t *dict, struct iatt *postparent)
{
negative_private_t *priv = this->private;
ghost_t *gp = NULL;
uint64_t ctx = 0;
uint32_t bucket = 0;
inode_ctx_get(inode,this,&ctx);
if (op_ret < 0) {
gp = GF_CALLOC(1,sizeof(ghost_t),gf_negative_mt_ghost);
if (gp) {
gp->path = (char *)ctx;
bucket = GHOST_HASH(gp->path);
/* TBD: locking */
gp->next = priv->ghosts[bucket];
priv->ghosts[bucket] = gp;
gf_log(this->name,GF_LOG_DEBUG,"added %s",
(char *)ctx);
goto unwind;
}
}
else {
gf_log(this->name,GF_LOG_DEBUG,"found %s", (char *)ctx);
exorcise(this,(char *)ctx);
}
/* Both positive result and allocation failure come here. */
GF_FREE((void *)ctx);
unwind:
STACK_UNWIND_STRICT (lookup, frame, op_ret, op_errno, inode, buf,
dict, postparent);
return 0;
}
int32_t
negative_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc,
dict_t *xattr_req)
{
negative_private_t *priv = this->private;
ghost_t *gp = NULL;
uint32_t bucket = 0;
bucket = GHOST_HASH(loc->path);
for (gp = priv->ghosts[bucket]; gp; gp = gp->next) {
if (!strcmp(gp->path,loc->path)) {
gf_log(this->name,GF_LOG_DEBUG,"%s (%p) => HIT",
loc->path, loc->inode);
STACK_UNWIND_STRICT (lookup, frame, -1, ENOENT,
NULL, NULL, NULL, NULL);
return 0;
}
}
gf_log(this->name,GF_LOG_DEBUG,"%s (%p) => MISS",
loc->path, loc->inode);
inode_ctx_put(loc->inode,this,(uint64_t)gf_strdup(loc->path));
STACK_WIND (frame, negative_lookup_cbk, FIRST_CHILD(this),
FIRST_CHILD(this)->fops->lookup, loc, xattr_req);
return 0;
}
int32_t
negative_create_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
int32_t op_ret, int32_t op_errno, fd_t *fd, inode_t *inode,
struct iatt *buf, struct iatt *preparent,
struct iatt *postparent)
{
uint64_t ctx = 0;
inode_ctx_get(inode,this,&ctx);
exorcise(this,(char *)ctx);
GF_FREE((void *)ctx);
STACK_UNWIND_STRICT (create, frame, op_ret, op_errno, fd, inode, buf,
preparent, postparent);
return 0;
}
int32_t
negative_create (call_frame_t *frame, xlator_t *this, loc_t *loc, int32_t flags,
mode_t mode, fd_t *fd, dict_t *params)
{
inode_ctx_put(loc->inode,this,(uint64_t)gf_strdup(loc->path));
STACK_WIND (frame, negative_create_cbk, FIRST_CHILD(this),
FIRST_CHILD(this)->fops->create, loc, flags, mode, fd,
params);
return 0;
}
int32_t
negative_mkdir_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
int32_t op_ret, int32_t op_errno, inode_t *inode,
struct iatt *buf, struct iatt *preparent,
struct iatt *postparent)
{
uint64_t ctx = 0;
inode_ctx_get(inode,this,&ctx);
exorcise(this,(char *)ctx);
GF_FREE((void *)ctx);
STACK_UNWIND_STRICT (mkdir, frame, op_ret, op_errno, inode,
buf, preparent, postparent);
return 0;
}
int
negative_mkdir (call_frame_t *frame, xlator_t *this, loc_t *loc, mode_t mode,
dict_t *params)
{
inode_ctx_put(loc->inode,this,(uint64_t)gf_strdup(loc->path));
STACK_WIND (frame, negative_mkdir_cbk, FIRST_CHILD(this),
FIRST_CHILD(this)->fops->mkdir, loc, mode, params);
return 0;
}
int32_t
init (xlator_t *this)
{
negative_private_t *priv = NULL;
if (!this->children || this->children->next) {
gf_log ("negative", GF_LOG_ERROR,
"FATAL: negative should have exactly one child");
return -1;
}
if (!this->parents) {
gf_log (this->name, GF_LOG_WARNING,
"dangling volume. check volfile ");
}
priv = GF_CALLOC (1,> if (!priv)
return -1;
this->private = priv;
gf_log ("negative", GF_LOG_DEBUG, "negative xlator loaded");
return 0;
}
void
fini (xlator_t *this)
{
negative_private_t *priv = this->private;
if (!priv)
return;
this->private = NULL;
GF_FREE (priv);
return;
}
struct xlator_fops fops = {
.lookup = negative_lookup,
.create = negative_create,
.mkdir = negative_mkdir,
};
struct xlator_cbks cbks = {
};
struct volume_options options[] = {
{ .key = {NULL} },
};
negative-lookup /Makefile
# Change these to match your source code.
TARGET = negative.so
OBJECTS = negative.o
# Change these to match your environment.
GLFS_SRC = /root/glusterfs_patches
GLFS_ROOT = /opt/glusterfs
GLFS_VERS = 3git
GLFS_LIB = $(GLFS_ROOT)/$(GLFS_VERS)/lib64
HOST_OS = GF_LINUX_HOST_OS
# You shouldn't need to change anything below here.
CFLAGS = -fPIC -Wall -O0 -g \
-DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D$(HOST_OS) \
-I$(GLFS_SRC) -I$(GLFS_SRC)/libglusterfs/src \
-I$(GLFS_SRC)/contrib/uuid -I.
LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB) -lglusterfs -lpthread
$(TARGET): $(OBJECTS)
$(CC) $(CFLAGS) $(OBJECTS) $(LDFLAGS) -o $(TARGET)
install: $(TARGET)
cp $(TARGET) $(GLFS_LIB)/glusterfs/$(GLFS_VERS)/xlator/features
clean:
rm -f $(TARGET) $(OBJECTS)
negative-lookup /README.md
This is a very simple translator to cache "negative lookups" for workloads in which the same file is looked up many times in places where it doesn't exist. In particular, web script files with many includes/requires and long paths can generate hundreds of such lookups per front-end request. If we don't cache the negative results, this can mean hundreds of back-end network round trips per front-end request. So we cache. Very simple tests for this kind of workload on two machines connected via GigE show an approximately 3x performance improvement.
This code is nowhere near ready for production use yet. It was originally developed as a pedagogical example, but one that could lead to something truly useful as well. Among other things, the following features need to be added.
· Support for other namespace-modifying operations - link, symlink, mknod, rename, even funky xattr requests.
· Time-based cache expiration to cover the case where another client creates a file that's in our cache because it wasn't there when we first looked it up. This might even include periodic pruning of entries that are already stale but will never be looked up (and therefore never reaped in-line) again.
· Locking on the cache for when we're called concurrently.
This is intended to be a learning tool. I might not get back to this code myself for a long time, but I always have time to help anyone who's learning to write translators. If you want to help move it along, please fork and send me pull requests.
For more information on writing GlusterFS translators, check out my "Translator 101" series:
· http://hekafs.org/index.php/2011/11/translator-101-class-1-setting-the-stage/
· http://hekafs.org/index.php/2011/11/translator-101-lesson-2-init-fini-and-private-context/
· http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/
· http://hekafs.org/index.php/2011/11/translator-101-lesson-4-debugging-a-translator/
模块二:
bypass / bypass.h
/*
* Copyright (c) 2011 Red Hat
*/
#ifndef __bypass_H__
#define __bypass_H__
#ifndef _CONFIG_H
#define _CONFIG_H
#include "config.h"
#endif
#include "mem-types.h"
/* Deal with casts for 32-bit architectures. */
#define CAST2INT(x) ((uint64_t)(long)(x))
#define CAST2PTR(x) ((void *)(long)(x))
typedef struct {
xlator_t *target;
} bypass_private_t;
enum gf_bypass_mem_types_ {
gf_bypass_mt_priv_t = gf_common_mt_end + 1,
gf_by_mt_int32_t,
gf_bypass_mt_end
};
#endif /* __bypass_H__ */
bypass / bypass.c
/*
* Copyright (c) 2011 Red Hat
*/
#include
#include
#ifndef _CONFIG_H
#define _CONFIG_H
#include "config.h"
#endif
#include "glusterfs.h"
#include "call-stub.h"
#include "defaults.h"
#include "logging.h"
#include "xlator.h"
#include "bypass.h"
int32_t
bypass_readv (call_frame_t *frame, xlator_t *this, fd_t *fd,> off_t offset)
{
bypass_private_t *priv = this->private;
STACK_WIND (frame, default_readv_cbk, priv->target,
priv->target->fops->readv, fd,> return 0;
}
dict_t *
get_pending_dict (xlator_t *this)
{
dict_t *dict = NULL;
xlator_list_t *trav = NULL;
char *key = NULL;
int32_t *value = NULL;
xlator_t *afr = NULL;
bypass_private_t *priv = this->private;
dict = dict_new();
if (!dict) {
gf_log (this->name, GF_LOG_WARNING, "failed to allocate dict");
return NULL;
}
afr = this->children->xlator;
for (trav = afr->children; trav; trav = trav->next) {
if (trav->xlator == priv->target) {
continue;
}
if (gf_asprintf(&key,"trusted.afr.%s",trav->xlator->name) < 0) {
gf_log (this->name, GF_LOG_WARNING,
"failed to allocate key");
goto free_dict;
}
value = GF_CALLOC(3,sizeof(*value),gf_by_mt_int32_t);
if (!value) {
gf_log (this->name, GF_LOG_WARNING,
"failed to allocate value");
goto free_key;
}
/* Amazingly, there's no constant for this. */
value[0] = htons(1);
if (dict_set_dynptr(dict,key,value,3*sizeof(*value)) < 0) {
gf_log (this->name, GF_LOG_WARNING,
"failed to set up dict");
goto free_value;
}
}
return dict;
free_value:
GF_FREE(value);
free_key:
GF_FREE(key);
free_dict:
dict_unref(dict);
return NULL;
}
int32_t
bypass_set_pending_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
int32_t op_ret, int32_t op_errno, dict_t *dict)
{
if (op_ret < 0) {
goto unwind;
}
call_resume(cookie);
return0;
unwind:
STACK_UNWIND_STRICT (writev, frame, op_ret, op_errno, NULL, NULL);
return 0;
}
int32_t
bypass_writev_resume (call_frame_t *frame, xlator_t *this, fd_t *fd,
struct iovec *vector, int32_t count, off_t off,
struct iobref *iobref)
{
bypass_private_t *priv = this->private;
STACK_WIND (frame, default_writev_cbk, priv->target,
priv->target->fops->writev, fd, vector, count, off,
iobref);
return 0;
}
int32_t
bypass_writev (call_frame_t *frame, xlator_t *this, fd_t *fd,
struct iovec *vector, int32_t count, off_t off,
struct iobref *iobref)
{
dict_t *dict = NULL;
call_stub_t *stub = NULL;
bypass_private_t *priv = this->private;
/*
* I wish we could just create the stub pointing to the target's
* writev function, but then we'd get into another translator's code
* with "this" pointing to us.
*/
stub = fop_writev_stub(frame, bypass_writev_resume,
fd, vector, count, off, iobref);
if (!stub) {
gf_log (this->name, GF_LOG_WARNING, "failed to allocate stub");
goto wind;
}
dict = get_pending_dict(this);
if (!dict) {
gf_log (this->name, GF_LOG_WARNING, "failed to allocate stub");
goto free_stub;
}
STACK_WIND_COOKIE (frame, bypass_set_pending_cbk, stub,
priv->target, priv->target->fops->fxattrop,
fd, GF_XATTROP_ADD_ARRAY, dict);
return0;
free_stub:
call_stub_destroy(stub);
wind:
dict_unref(dict);
STACK_WIND (frame, default_writev_cbk, FIRST_CHILD(this),
FIRST_CHILD(this)->fops->writev, fd, vector, count, off,
iobref);
return 0;
}
/*
* Even applications that only read seem to call this, and it can force an
* unwanted self-heal.
* TBD: there are probably more like this - stat, open(O_RDONLY), etc.
*/
int32_t
bypass_fstat (call_frame_t *frame, xlator_t *this, fd_t *fd)
{
bypass_private_t *priv = this->private;
STACK_WIND (frame, default_fstat_cbk, priv->target,
priv->target->fops->fstat, fd);
return 0;
}
int32_t
init (xlator_t *this)
{
xlator_t *tgt_xl = NULL;
bypass_private_t *priv = NULL;
if (!this->children || this->children->next) {
gf_log (this->name, GF_LOG_ERROR,
"FATAL: bypass should have exactly one child");
return -1;
}
tgt_xl = this->children->xlator;
/* TBD: check for cluster/afr as well */
if (strcmp(tgt_xl->type,"cluster/replicate")) {
gf_log (this->name, GF_LOG_ERROR,
"%s must be loaded above cluster/replicate",
this->type);
return -1;
}
/* TBD: pass target-translator name as an option (instead of first) */
tgt_xl = tgt_xl->children->xlator;
priv = GF_CALLOC (1,> if (!priv)
return -1;
priv->target = tgt_xl;
this->private = priv;
gf_log (this->name, GF_LOG_DEBUG, "bypass xlator loaded");
return0;
}
void
fini (xlator_t *this)
{
bypass_private_t *priv = this->private;
if (!priv)
return;
this->private = NULL;
GF_FREE (priv);
return;
}
struct xlator_fops fops = {
.readv = bypass_readv,
.writev = bypass_writev,
.fstat = bypass_fstat
};
struct xlator_cbks cbks = {
};
struct volume_options options[] = {
{ .key = {NULL} },
};
bypass / Makefile
# Change these to match your source code.
TARGET = bypass.so
OBJECTS = bypass.o
# Change these to match your environment.
GLFS_SRC = /root/glusterfs_patches
GLFS_ROOT = /opt/glusterfs
GLFS_VERS = 3git
GLFS_LIB = `ls -d $(GLFS_ROOT)/$(GLFS_VERS)/lib*`
HOST_OS = GF_LINUX_HOST_OS
# You shouldn't need to change anything below here.
CFLAGS = -fPIC -Wall -O0 -g \
-DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D$(HOST_OS) \
-I$(GLFS_SRC) -I$(GLFS_SRC)/libglusterfs/src \
-I$(GLFS_SRC)/contrib/uuid -I.
LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB) -lglusterfs -lpthread
$(TARGET): $(OBJECTS)
$(CC) $(CFLAGS) $(OBJECTS) $(LDFLAGS) -o $(TARGET)
install: $(TARGET)
cp $(TARGET) $(GLFS_LIB)/glusterfs/$(GLFS_VERS)/xlator/features
clean:
rm -f $(TARGET) $(OBJECTS)
bypass /bytest-fuse.vol
volume bytest-posix-0
type storage/posix
option directory /export/bytest1
end-volume
volume bytest-locks-0
type features/locks
subvolumes bytest-posix-0
end-volume
volume bytest-client-1
type protocol/client
option remote-host gfs1
option remote-subvolume /export/bytest2
option transport-type tcp
end-volume
volume bytest-replicate-0
type cluster/replicate
subvolumes bytest-locks-0 bytest-client-1
end-volume
volume bytest-bypass
type features/bypass
subvolumes bytest-replicate-0
end-volume
volume bytest
type debug/io-stats
option latency-measurement off
option count-fop-hits off
subvolumes bytest-bypass
end-volume
bypass /README.md
This is a proof-of-concept translator for an> · If multiple clients try to write the same file with bypass turned on, you'll get massive split-brain problems. Solutions might include honoring AFR's quorum-enforcement rules, auto-issuing locks during open to prevent such concurrent access, or simply documenting the fact that users must do such locking themselves. The last might sound like a cop-out, but such locking is already common for the likely use case of serving virtual-machine images.
· We only intercept readv, writev, and fstat. There are many other calls that can trigger self-heal, including plain old lookup. The only way to prevent a lookup self-heal would be to put another translator below AFR to intercept xattr requests and pretend everything's OK. Ick. Remember, though, that this is only a proof of concept. If we really wanted to get serious about this, we could implement the same technique within AFR and do all the necessary coordination there.
· It would be nice if the AFR subvolume to use could be specified as an option (instead of just picking the first child), if bypass could be made selective, etc.
The coolest direction to go here would be to put information about writes we've seen onto a queue, with a separate process listening on that queue to perform assynchronous but nearly immediate self-heal on just those files. As long as the other consistency issues are handled properly, this might be a really easy way to get near-local performance for virtual-machine-image use cases without introducing consistency/recovery nightmares.
|
|